[求助]N-Gram, ngram

清风出袖 · 2005-09-01

why is called gram? Are there any psecific reasons for the naming? thanks a lot for your kind explanation!

[本贴已被 xujiajin 于 2005年10月01日 10时16分36秒编辑过]

动态语法 · 2005-09-02

回复：[求助]N-Gram

-gram
suffix
Something written or drawn; a record: cardiogram.
(dictionary.com)

Similar to character but is more general in meaning.

Main Entry: n-gram
Part of Speech: noun
Definition: a sequence of variable characters that stands for a word or string of words in a corpus.
(dictionary.com)

xiaoz · 2005-09-02

But in corpus linguistics, it typically refers to the latter sense, i.e. a string of tokens, not of characters.

清风出袖 · 2005-09-02

thanks a lot! i can see your kind explanations! but probably tokens, -gram, characters, words each of them has specific reference with slight nuances in the linguistics. Right? So in a word we may take them as same in general!

wzli · 2005-09-04

回复：[求助]N-Gram

Not exactly. A token in a corpus is actually a running word and contrastive term for 'a type': the occurrences of a type are tokens. We don't often say 'character' for the English language. The word 'word' is a highly ambiguous one, so sometimes one tends to use 'form' or 'word form' if s/he has to use 'word'. The word 'form' is also a very important idea promoted by Sinclair along with 'sense', which, Sinclair was able to prove, are quite inseparable. And 'gram' is quite computational, indicating any string that has spaces on both sides. Therefore a 3-gram would be a cluster consisting of three contiguous strings, whatever words they are. N-grams vary in length -- the number of strings in the cluster. The distinction between 'a character' and 'a word' is much more tricky and finer in Chinese linguistics, please confer Pan Wenguo Character-Based Studies.

[本贴已被 xiaoz 于 2005年09月04日 00时19分22秒编辑过]

xiaoz · 2005-09-04

In reply to No. 5 -

1) Terms such as "word form" are typically used in relation to "lemma", e.g. "go", "goes", "going", "went" and "gone" are different word forms of GO.

2) Some people at the 1st Int'l Symposium on Contrastive and Translation Studies of English and Chinese (Shanghai, 08/2002, organised by Pan) proposed a character-based approach to the studies of Chinese. But in my view such an apporach is misleading as many Chinese characters are only meaningful in combinations with other characters.

清风出袖 · 2005-09-04

thanks a lot, dr. xiao and dr. li for your detailed explanation on the topic! could you take the trouble of telling me exactly from which publishing house the book Pan Wenguo's Character-Based Studies is released,and does pan wenguo write as 潘文国？How about the name of the book in Chinese? thanks a lot!

xiaoz · 2005-09-04

《字本位与汉语研究》，上海：华东师范大学出版社，2002年

xiaoz · 2005-09-04

以字为本打通古今――评潘文国先生的《字本位与汉语研究》

2003-05-28 11:49:57 作者：尚新

目前汉语研究中存在的两大问题：一是古代汉语与现代汉语研究的断裂局面，这是在世界任何其他地方不曾出现的现象；二是目前所有的汉语研究理论都是外来的，汉语语法的一些基本问题至今也没得到满意的解决。近年来，对现代汉语研究现状不满的呼声越来越高，国内一些语言学家开始认真反思汉语研究所处的难堪局面，对百年来的汉语研究进行理性的总结，而潘文国的新著《字本位与汉语研究》（下称《字本位》）无疑是这一反思思潮中极为深刻、鲜明的部分。

汉字自《马氏文通》始就被看作是书写的“形”而遭不公待遇，直至今天，“字”还在承受着语言学阵营中某些固有观念的歧视。汉字在汉语研究中究竟扮演着什么样的角色？要回答这些问题，除了要解释汉字的特点之外，最重要的是要解决语言观的问题。潘文国认为，中国语言研究要重新振奋，必须从深层的哲学思考开始，建立汉语自己的语言哲学。多年来，潘文国的语言研究，一直围绕着探索语言的本质和建立汉语自己的语言哲学展开。没有自己的语言哲学和语法哲学，汉语研究就只能沦为西方理论的注脚。试问那一种西方语言理论的产生和发展不是以坚实的哲学理论为基础？由此看来，我们应该跟进的，是西方的语言研究格局，是更多的哲学家、文化学家、心理学家、社会学家等等参与到汉语的研究中来。而作为第一步，就是要对“什么是语言”有个清醒的认识。

潘文国经过长期探索认为：语言是人类认知世界及进行表述的方式和过程。这一定义的哲学意义在于，它首先承认语言是客观现实经过人类思维的作用的产物，然后又强调语言对人类认识的反作用，并由此决定了语言学的人文科学地位。这个定义的重要意义还在于重新认识和评价汉字的地位，即一方面承认汉字后于口语，另一方面也强调汉字一旦形成后对汉语发展的制约作用。正是在这个意义上，汉语语言学研究中应有汉字的一席之地，但这并不是说所有关于汉字的研究都具有语言学性质，而是指形位以上的语形学研究，强调这一点，就与字位学划清了界线。这样，“一体三相”的汉字，就找到了其在汉语语言学中的恰当位置。

以字为本位，才能打通古今汉语研究的断裂局面。传统的中国音韵、文字、训诂学研究就是从不同的侧面研究“字”的结构，其核心是研究“字义”。确定了汉语研究的结构本位是字，也就理顺了古今汉语研究的共同轴心。在《字本位》理论框架下，以字为基础，在“字”以下形成音韵学和形位学，在“字”以上形成辞、读、句、篇等语言单位。这种研究不仅符合汉语、汉字的特性，为打通古今语法研究找到了出口，同时又可以结合现代普通语言学理论及方法论，使传统研究得到升华和现代化，实现汉语研究和国际语言研究的全方位接轨。

以字为本位，才能真正地使汉语研究迎头赶上国际语言研究。假如说以字为本位打通古今汉语研究是属于汉语研究的个性方面，那么如何与国际语言研究接轨就属于人类语言的共性方面。过去百年的汉语研究建立在印欧系语言的词本位和语法体系之上，结果是使汉语的面貌越来越模糊不清。实践证明这是错误的结合。那么我们如何才能实现成功的接轨呢？应该站在什么样的立足点上实现对接呢？潘文国先生找到了这个立足点，就是以字为本位。《字本位》理论确立了汉语研究与西方语言研究体系上的对应性、方法论上的共通性以及理论上的互补性。形位学的创立，使传统的造字理论与当代西方构词法理论得以结合，构建了向上合成、向下分析、与西方语法相对应的汉语语形学体系，从而避免了汉语只有“句法”没有“词法”的窘状。章句学理论，则在传统章句学的基础上，借鉴西方句法和语篇研究理论，以生成论和调节论为支框架，研究汉语中辞、句的生成与调节，以及语篇的组成性和调控性原则。字义学理论，则在中国传统训诂学的基础上结合西方语义理论，用现代普通语言学理论和方法论进行字义研究，不但使现代汉语的语义研究在字本位的基础上与国际接轨，也使传统训诂学在得到新生的同时走向世界。

《字本位》有着自己的理论特色。它着眼于整个汉语研究，努力探索汉语语言哲学问题，着力解决语言观、语法观、汉语观、结构本位观等一系列重大问题，突破了现当代西方语言研究的框架，从而构建了中国特色语言学的理论体系。音义互动律的发现就是显著例证，也是以字为本位进行汉语研究的必然结果。“汉语是一种语义型语言，汉语又是一种音足型语言”，即语义与音节产生互动，共同造就了汉语的语言个性。郭绍虞、赵元任、吕叔湘等都曾提出过语音（节律）可以影响构词和句法。但首次把这种现象归纳为“音义互动律”，作为语法手段处理的，则是潘文国先生。将“音义互动律”更充分、更立体、更全面地挖掘，定会使汉语研究呈现另一番景象。

《字本位》理论有着重要的语言学史意义。首先，它以中国传统的语言研究的精髓要义，即字的立体研究为基础，按照国际学术规范重新构建汉语语法体系，促进传统的现代化；其次，它主张一种中西方平等对话的立体结合的语言研究，从而使汉语研究能够逐步走出西方的阴影；再次，它不仅突破了现当代西方语言研究的框架，构建了中国特色语言学的理论体系，而且敢于提出和确立新术语（如形位学、语形学、字义学、音义互动律等）；最后，《字本位》的出版表明，以字为本位的汉语研究开始逐步走出困境，其中凝聚着人文科学的深意、辩证扬弃的基本精神、自主平等的基本立场。而要在千军万马的跟进狂潮中，正对本土，平视他者，没有过人的胆识和非凡的勇气谈何容易！因此我们说，作者对张志公先生的评价也同样适用于作者本人，“我们可能不同意他的观点，但没法不钦佩他的反思精神。正是这种精神，才是推动中国语言学乃至各项事业不断发展的动力”。

wzli · 2005-09-04

回复：[求助]N-Gram

More about 'form'
Yes, you can say a form has something to do with lemma, that different 'forms' can be lemmatized as one base form. But I feel there's more to it. In corpus investigation, a unique form is not only simply morphological or grammatically distinctive, but carries unique meaning. So 'certainly' is not simply the adverb form of the adjective 'certain', but is used to mean differently. According to Sinclair, form is closely associated with meaning (or sense), just like the two sides of a coin. Interpreted this way, the practice of lemmatization is rooted in a non-corpus-based grammatical idea: that one invents a battery of parts of speech, and makes all the words fit into the slots. Nowadays we tend to look at each individual form as unique and observe it in its context. In this light, much of the work of tagging and parsing is but a make-do. So a clean text can always be a good place to start.

about 'character-based' approach
I read Pan's book carefully and feel he is providing a powerful argument, which is not to be discarded easily. The mono-syllabic characters still take a great place in Chinese, and it might be a good place to start.

xiaoz · 2005-09-04

In reply to No. 10 -

1) form vs. word form
For the sake of clarity, it is desirable to maintain a distinction between "form" (as opposed to meaning or function) and "word form" (as opposed to lemma);

2) lemmatizarion, POS tagging and parsing
These are three types of corpus annotation, there are of course many other types. While POS tagging and parsing are closely associated, lemmatisation is clearly a distinctive annotation type, which means lemmatisation is not rooted in POS tagging or parsing. They are not to be conflated.

3) form, pattern, patterning, "functionally complete unit of meaning"
"Form" is conventionally associated with "function", but it has also been used in relation to "meaning". Hunstun and Partington et al refer to it as pattern or patterning, while sinclair and Tognini-Bonelli et al call it "functionally complete unit of meaning".

4) plain text vs. corpus annotation

There are a number of criticisms of corpus annotation, notablyby so called "corpus-driven linguists". The plain text argument is one of them. Below is a section that deals with these issues.

Corpus annotation = added value

Like corpus markup, annotation adds value to a corpus. Leech (1997a: 2) maintains that corpus annotation is ‘a crucial contribution to the benefit a corpus brings, since it enriches the corpus as a source of linguistic information for future research and development.’ Both Leech (ibid: 4-5) and McEnery (2003: 454-455) suggest that there are at least four advantages for corpus annotation.
Firstly, it is much easier to extract information from annotated corpora in a number of ways. Leech (ibid) observes, for example, that without part-of-speech tagging (see unit 4.4.1), it is difficult to extract left as an adjective from a raw corpus as its various meanings and uses cannot be identified from its orthographic form or context alone. For example, the orthographic form left with a meaning opposite to right can be an adjective, an adverb or a noun. It can also be the past or past participle form of leave. With appropriate part-of-speech annotations these different uses of left can be readily distinguished apart. Corpus annotation also enables human analysts and machines to exploit and retrieve analyses of which they are not themselves capable (McEnery 2003: 454). For example, even if you do not know Chinese, given a suitably annotated Chinese corpus, you are able to find out a great deal about Chinese using that corpus (see case study 6 in Section C). Speed of data extraction is another advantage of annotated corpora. Even if one is capable of undertaking the required linguistic analyses, one is quite unlikely to be able to explore a raw corpus as swiftly and reliably as one can explore an annotated corpus if one has to start by annotating the corpus oneself.
Secondly, an annotated corpus is a reusable resource, as annotation records linguistic analyses within the corpus that are then available for reuse. Considering that corpus annotation tends to be costly and time consuming, reusability is a powerful argument in favour of corpus annotation (cf. Leech 1997a: 5).
Thirdly, an advantage of corpus annotation, related to reusability, is multi-functionality. A corpus may have originally been annotated with one specific purpose in mind. However, corpus analyses may be reused for a variety of applications and even for purposes not originally envisaged.
Finally, corpus annotation records a linguistic analysis explicitly. As such, the corpus annotation stands as a clear and objective record of analysis that is open to scrutiny and criticism (cf. McEnery 2003), a laudable goal.
In addition to these advantages we can also note that corpus annotation, like a corpus per se, provides a standard reference resource. While a corpus may constitute a standard reference for the language variety which it is supposed to represent, corpus annotation provides a stable base of linguistic analyses, objectively recorded, so that successive studies can be compared and contrasted on a common basis.
Having outlined the advantages of corpus annotation, it is necessary to address some of the criticisms of corpus annotation. Four main criticisms of corpus annotation have been presented over the past decade.
The first criticism is that corpus annotation produces cluttered corpora. Hunston (2002: 94) argues that ‘[h]owever much annotation is added to a text, it is important for the researcher to be able to see the plain text, uncluttered by annotational labels.’ While we agree that the plain text is important in a corpus analysis, especially in observing the patterning of words, corpus annotation does not necessarily obscure such a patterning, because most corpus exploration tools (e.g. WordSmith, MonoConc, SARA and Xaira, see Section C) do indeed allow users to suppress annotation in search results so as to allow users to view the plain text. As such this criticism is more directed at corpus browsing/retrieval tools rather than at corpus annotation per se.
A second criticism is that annotation imposes a linguistic analysis upon a corpus user. While it is true that corpus annotation is fundamentally interpretative in nature, there is no compulsion that corpus users accept that analysis. They can impose their own interpretations if they will or simply ignore the annotation. The plurality of interpretations of a text is something that must be accepted from the outset when undertaking corpus annotation (cf. McEnery 2003: 456). Yet just leaving a corpus unannotated does not mean that there is no process of interpretation occurring when the corpus is analyzed. Rather, the lack of annotation simply disguises the fact that such multiple-interpretations still occur when researchers use a raw corpus. The analysis still happens, it is simply hidden from clear view. Corpus annotation should be recognized as an advantage rather than a weakness in this respect as it provides an objective record of an explicit analysis open for scrutiny C failing to annotate is not simply a failure to analyze. Failing to annotate does, however, ensure that the analysis is difficult, or indeed impossible, to recover.
A further criticism is that annotation may ‘overvalue’ a corpus, making it less readily accessible, updateable and expandable (cf. Hunston 2002: 92-93). Annotation does not necessarily makes a corpus less accessible. For example, many parsed (e.g. the Lancaster Parsed Corpus and the Susanne corpus, see unit 7.4) and prosodically annotated corpora (e.g. the London-Lund Corpus and the Lancaster/IBM Spoken English Corpus, see unit 7.5) are publicly available. Corpus builders are usually happy to make their corpora available as widely as possible in spite of (or sometimes because of) the huge effort that they have put into annotation. Funders are also often prepared to finance corpus construction because a valuable annotated resource will be made widely available. Public funding bodies are particularly unlikely to fund corpus building projects which do not result in a readily accessible resource. A more common reason for not making an annotated corpus (or indeed a raw corpus) publicly available is that the copyright issues related to the corpus data prohibit it (see unit 9). Copyright, not annotation, is the greater force in favour of restriction. The arguments relating to updating and expansion are also questionable. Unlike a monitor corpus, which is constantly updated to track rapid language change (see unit 7.9 for further discussion), most corpora are sample corpora. A sample corpus is designed to represent a particular language variety at a particular time. For example, the LOB and Brown corpora are supposed to represent written British and American English in the early 1960s. There are indeed ‘updates’ for the two corpora C FLOB and Frown (see unit 7.4). The two updated corpora respectively represent written British and American English in the early 1990s and can be used to track slower paced language change (see unit 15.5). The need for constant expansion is only related to the dynamic monitor corpus model. It does not necessarily apply as an argument to sample corpora. Given that most corpora are sample corpora, the expandability argument is hardly important, as with a sample corpus size is typically determined when the corpus is designed. Once the corpus is created, there is generally no need for expansion.
The final criticism is related to the accuracy and consistency of corpus annotation. There are three basic methods of annotating a corpus C automatic, computer-assisted and manual (see unit 4.3). On the one hand, as Hunston (2002: 91) argues, ‘an automatic annotation program is unlikely to produce results that are 100% in accordance with what a human researcher would produce; in other words, there are likely to be errors.’ Such errors also occur when humans alone analyze the texts C even the best linguist at times makes mistakes. Introducing a human factor into annotation may have another implication; Sinclair (1992) argues, the introduction of a human element in corpus annotation, as in manual or computer-assisted annotation, results in a decline in the consistency of annotation. Taking the two points together, one might wonder why any linguist has ever carried out an analysis, as it would have been inaccurate and inconsistent! One must conclude that they have done so, and that annotators continue to do so because while inconsistency and inaccuracy in analyses are indeed observable phenomena, their impact upon an expert human analysis has been exaggerated. Also, the computer is not a sure-fire means of avoiding inaccuracy or inconsistency: the two points may also apply to machine analyses. Automatic annotation is not error-free, and it may be inconsistent. If resources are altered for an annotation program C the lexicon changed, rules rewritten C then over time the output of the program will exhibit inconsistency on a scale that may well exceed that displayed by human analysts. So what should we use for corpus annotation, human analysts or the computer? Given that the value of corpus annotation is well recognized, the human analyst and the machine should complement each other, providing a balanced approach to accuracy and consistency that seeks to reduce inaccuracy and inconsistency to levels tolerable to the research question that the corpus is intended to investigate.
It is clear from the above discussion that all of the four criticisms of corpus annotation can be dismissed, with caveats, quite safely. Annotation only means undertaking and making explicit a linguistic analysis. As such, it is something that linguists have been doing for centuries.

5) character-based approach
In terms of type frequency, monosyllabic words do not take a prominant place in Chinese, though of them are very frequently used (see token frequency) below. The following statistics are based on the 1 million word LCMC corpus (written) and the 0.92 million word Lancaster Los angeles Spoken Chinese Corpus:

1-gram: type 3941; token: 1011726
2-gram: type: 32449; token: 624330
3-gram: type: 8451; token: 42359
4-gram: type: 5766; token: 17101
4+-gram: type: 1710; token: 4593

It is clear that the character based approach cannot solve much problem. The word based approach, in contrast, also consider monosyllabic words as words.

清风出袖 · 2005-09-04

thanks a lot, dr xiao and dr.liwenzhong!

wzli · 2005-09-05

回复：[求助]N-Gram

Thanks a lot for this meticulous discussion.
But I have to clarify that lemmatization and POS tagging are two different processes, as self-evident as they are. But lemmatization is still based on the pre-occupied notions about which word forms should be lemmatized to what. One can lemmatize 'certain', 'certainly' as one, yet they might be different words. I once discussed the use of 'necessary', 'necessity', 'necessarily'; the three forms mean different things and are used in different context. But I do not deny the value of lemmatization and any other form of annotation. It all depends on how one manipulates one's research. What I want to say is one has to be cautious when using an annotated corpus -- it is like playing in other's garden.

xiaoz · 2005-09-05

RE: lemmetisation -

Agreed. Different word forms of a head word (lemma) can behave differently, e.g. in collocations.

Here is the rationale for lemmatisation. The word is a basic unit in linguistics, and is one of the units around which Xaira's analysis procedures are built. However, the lexicon of a language does not consist of an undifferentiated mass of words. Some words make up families. For instance, the English words "be", "was", "is", "are" and "were" clearly make up a related group: they are different morphosyntactic forms of a single lexical item. Lemmata (this word is the plural of lemma) and lemmatisation schemes give us a way to formally describe these relationships among words.

While lemmatization is important in vocabulary studies and lexicography, e.g. in studying the distribution pattern of lexemes and improving dictionaries and computer lexicons, the usefulness of lemmatization depends on how inflectional a language is. For highly inflectional language like Russian and Spanish, where a lemma covers a large number of inflectional variants, lemmatization is particularly useful whereas for non-inflectional languages like Chinese, lemmatization is of limited use. As English is a language with simple inflectional morphology, which only inflects verbs for tense and nouns for plurality, lemmatization ‘may be considered somewhat redundant’ for English (Leech 1997a: 15). That may explain why, although quite accurate software is available for this purpose, few English corpora are lemmatized.

xujiajin · 2005-09-06

字本位的观点是北大的徐通锵首创的，见他的《语言论》，潘文国只是众多后继者之一。

[求助]N-Gram, ngram

清风出袖

高级会员

动态语法

管理员

xiaoz

永远的超级管理员

清风出袖

高级会员

wzli

普通会员

xiaoz

永远的超级管理员

清风出袖

高级会员

xiaoz

永远的超级管理员

xiaoz

永远的超级管理员

wzli

普通会员

xiaoz

永远的超级管理员

清风出袖

高级会员

wzli

普通会员

xiaoz

永远的超级管理员

xujiajin

管理员