Collocation Frequency: Index of Concept Saliency

xujiajin

管理员
Staff member
Durian, David. 2002a. "Speak My Language: Collocation Frequency as an Index of Concept Saliency." Unpublished Manuscript, Northern Illinois University and North Central Regional Educational Laboratory. © 2002 David Durian. All Rights Reserved.

Speak My Language: Collocation Frequency as an Index of Concept Saliency
http://forum.corpus4u.org/upload/forum/2005102916315047.rar

David Durian
Northern Illinois University and North Central Regional Educational Laboratory

1. IntroductionSince the release of the Brown Corpus in electronic format in the early 1960s, researchers have used natural language corpora to study probable trends in the occurrence of collocations in a number of areas pertaining to language study, including biblical and literary studies, lexicography, dialect studies, language education studies, and studies of grammar. Of the wealth of linguistic literature investigating the occurrence of collocations in these areas, however, almost none of the literature has dealt specifically with the topic of semantic collocation as an index of concept saliency, nor the topic of how semantic collocation occurs as a continuum-like "markedness" phenomenon in discourse. Much of the literature that has investigated collocation in natural language discourse has focused on the use of lexical markers in discourse cohesion (e.g., Halliday and Hassan 1976; Mel' cuk and Pertsov 1987), the creation of dictionary entries (e.g., Cowie 1981; Benson 1989), the automatic generation of text using collocational knowledge (e.g., Smadja and McKeown 1991; Radev and McKeown 1997), lexical studies of probable language occurrence based on corpora data (e.g., Kjellmer 1982; Sinclair 1991; Renouf 1992), the automated translation of text from one language to another (e.g., Kupiec 1993; Dagan and Church 1994; Smadja et al. 1996), or the investigation of contextual theories of meaning (e.g., Firth 1957; Halliday 1966; Sinclair 1995), and therefore, has dealt with these topics only fleetingly, if at all. Of the literature that has dealt with issues surrounding these topics more
--------------------------------------------------------------------------------
Page 2
Durian, 2002b, "Speak My Language"2directly, the approach that has been taken has been to look at the general aspects of semantic collocation as an index of concept saliency within in term weighting approaches to word sense disambiguation (e.g., Brown et al. 1991; Church et al. 1989; Dagan and Itai 1994), and not to investigate how semantic collocations specifically function as an index of concept saliency within a "markedness" continuum.Because of the lack of investigation into semantic collocation as a spectrum-like index of concept frequency, I have decided to discuss that topic here by conducting a study that makes use of a corpus of collocation data culled from actual Web site user search strings as entered by K-12 educators from a seven-state region in the Midwestern United States. Using the methods of both qualitative and quantitative analysis, I will investigate the statistical and psychological implications of high-frequency semantic collocations in corpus data as they reflect the concepts available in the mental lexicon of these educators, and in doing so, I will demonstrate that concept saliency can be indexed by the occurrence of semantic collocations in discourse. As well, I will also show through my qualitative analysis of high-frequency collocations in the corpora that semantic collocations exist along a frequency-based continuum-like spectrum of "markedness" for speakers possessing these collocations in their lexicon.However, before I conduct this quantitative and qualitative analysis of the data (sections 4 and 5), I shall first discuss briefly some conceptual notions and prototypical characteristics of the phenomenon of semantic collocation as it occurs in natural language discourse in section 2. This discussion will be followed in section 3 by a description of the corpus data analyzed in this study.
--------------------------------------------------------------------------------
Page 3
Durian, 2002b, "Speak My Language"32. Characteristics of Collocation in Discourse As most researchers who have studied the phenomenon of the occurrence of collocation in natural language discourse will attest to, coming up with an exact definition of collocation can be a daunting task, regardless of the genre under study (for a survey of this problem in language studies, see Kennedy 1998; Manning and Schütze 1999; and McKeown and Radev 2000). This is especially true when trying to define semantic collocation, since this type of collocation deals with the "fuzzy" area of meaning beyond the surface level occurrence of two or more words together in a discourse context. However, attempts at defining semantic collocation have been made in the extensive literature of studies available on this lexical phenomenon, based on the analysis of observable and quantifiable characteristics of the occurrence of collocation in natural language discourse, and in the following pages, I will discuss these characteristics as they relate to the study of semantic collocation presented in this paper. As McKeown and Radev (2000) note, in much of the linguistic literature available on the subject, semantic collocations are usually described in terms of a continuum of lexical relations, with free-word combinations placed on one end of the spectrum, idiomatic expressions placed on the other end, and collocations occupying an area in the middle of the continuum. Within this continuum, a prototypical free-word combination can be described as a composite unit consisting of two or more words in which each of the words within the combination can be replaced by another word without the meaning of that composite unit being seriously modified by that replacement, but at the same time, if one of the words within the combination is omitted, a reader or listener will not be able to easily infer it from the remaining words. In contrast, a prototypical idiomatic expression can be described as a rigid word-combination in which none the words within the combinationcan be replaced by another word or omitted without the meaning of that
--------------------------------------------------------------------------------
Page 4
Durian, 2002b, "Speak My Language"4composite unit being seriously modified by that replacement. Additionally, the meaning of the idiomatic expression is characterized by limited compositionality, in that its meaning cannot be understood from the meaning of its constituent parts, but instead, must be understood beyond the sum of those parts semantically. Semantic collocations fall between these two extremes in that they exhibit characteristics of both free-word combinations and idiomatic expressions. Like prototypical free-word combinations, semantic collocations are composite units consisting of two or more words in which each of the words within the combination can be replaced by another word without the meaning of that composite unit being seriously modified by that replacement, but, unlike free-word combinations, a replacement word can only be used within the combination if the replacement word is a closely-related synonym of the word it is replacing. At the same time, semantic collocations, like prototypical idiomatic expressions, are characterized by limited compositionality, in that the meaning of the expression under analysis cannot be predicted merely by the sum of its parts, but instead, must be understood beyond the sum of those parts semantically. However, unlike prototypical idiomatic expressions, the meanings of the individual constituent word components of a semantic collocation usually provide the user with a context from which to glean the meaning beyond the words, and in this way, the constituent words provide "hints" to the meaning of the composite expression. Beyond these somewhat "fuzzy" characteristics, semantic collocations have also been shown in the corpus-based research literature to exhibit a number of characteristics that can be quantifiably analyzed. Researchers who have conducted quantitative studies of collocation in natural language discourse have noted that semantic collocations occur in corpora as "specific combinations of words which co-occur more often than the frequencies of the constituents of the
--------------------------------------------------------------------------------
Page 5
Durian, 2002b, "Speak My Language"5combination would lead us to expect" (Kjellmer 1982: 25) but less often than their constituent words occur on their own. Researchers have also noted that, semantically, these collocations tend to function like idioms in that they are associated with cognitive phenomena reflected in language, phenomena that can be statistically analyzed and noted in corpora through the occurrence of a large number of pre-constructed phrases constituting specific word choices that cannot be further analyzed although they appear to be constructed of analyzable segments (Sinclair 1991). As well, these researchers have also shown that semantic collocations have been found to occur more frequently in discourse containing specialized terminological phrases and technical terms, especially in content domain specific discourse (McKeown and Radev 2000).Because semantic collocations can occur frequently in documents, as shown by Kjellmer (1982), but occur less frequently thanthe words that compose them, the occurrence of semantic collocations in discourse appear to be a "marked" phenomena when their occurrence is compared with the occurrence of these more generally occurring constituent words. They appear to be a "marked" phenomenon because collocations tend to contain less psychologically salient, specialized, more knowledge domain-specific information, as well as a meaning embedded "beyond" the word combination represented by the collocation when used by speakers, and thus, based on the specific discourse context in which they occur, this "extended meaning" would not be universally-known to all speakers of a specific language, whether they be native or non-native speakers of that language. This would mark their occurrence in discourse in contrast to the occurrence of "unmarked" words and phrases, such as a speaker's use of the constituent words contained within a collocation individually, which would occur without this "extended meaning" in a more generalized discourse context, and therefore, would be more universally understood by all speakers of a language.
--------------------------------------------------------------------------------
Page 6
Durian, 2002b, "Speak My Language"6Although all collocations are essentially "marked" cases in the universal mental lexicon of a language when compared to other more generally occurring lexical items, within in this "marked" class of semantic collocations, some collocations are more "marked" than others, while others are less "marked," since the occurrence of semantic collocations in discourse, just as the occurrence of any other lexical item in a discourse, can be said to be a frequency-based phenomenon. This implies that, for speakers possessing knowledge of them in their mental lexicon, semantic collocations can be placed on a continuum of "markedness," in which more frequently occurring collocations will be less "marked," while less frequently occurring collocations will be more "marked." At the "high" end of this continuum would be placed those "marked" collocations which are most frequently occurring, most salient, and thus, least prototypically "marked," and on the "low" end of this spectrum would be placed those "marked" collocations which are least frequently occurring, least salient, and thus, most prototypically "marked."As this paper is being written, the continuum-like nature of "markedness" has yet to be proven conclusively in the literature on semantic collocations, although it has been proven to hold true in other areas of linguistic inquiry. For example, using data first analyzed in Berlin and Kay's (1969) cross-linguistic study of the color naming conventions used by speakers of 98 different languages throughout the world, which was later reformulated in Kay (1975), Witkowski and Brown (1977) were able to show that "marked" and "unmarked" classes of color terms can exist as a series of "marked/unmarked" flexible contrast-set relations in a gradable, spectrum-like relationship cross-linguistically based on their frequency of occurrence. As well, Brown's linguistic analysis of naming conventions for the cross-cultural encoding sequence of both folk botanical and folk zoological life forms (1984) across over 140 languages reports
--------------------------------------------------------------------------------
Page 7
Durian, 2002b, "Speak My Language"7similar findings about "markedness" relations within the domain of folk naming and classification. However, as I will demonstrate throughout the rest of this paper, based on the evidence of my quantitative and qualitative analysis of the occurrence of high frequency collocations in the NCREL Search String Corpus, semantic collocations can exist on a continuum of "markedness" for speakers that possess knowledge of them in their lexicon, and those collocations that are the most frequent are in fact the most salient. As well, I will also demonstrate that, due the scalar nature of the "markedness" of semantic collocations, high frequency semantic collocations can prove to be anindex of concept saliency that reflects the knowledge available to speakers who possess these collocations in their mental lexicon.3. The DataThe collocation data analyzed in this study were culled from actual user searches entered in to the main searchengine of the North Central Regional Educational Laboratory's NCREL Web Site and have been compiled as the NCREL Search String Corpus. The NCREL String Search Corpus contains search strings representing searches conducted globally throughout the entire NCREL Web Site by K-12 educators in a seven-state region in the Midwestern United States and consists of regular expression searches varying in length from a single word to multiple words entered as search phrases by NCREL Web Site users over a year-long period between late January 1, 2001 and December 31, 2001 (excluding three weeks of data from approximately August 13, 2001 to September 1, 2001, which were lost due to functional problems with NCREL's search string retrieval program during that time period). In sum, the NCREL Search String Corpus consists of 170,139 total words divided into 49 search result .txt files that were then grouped together in the Web-based software program TEXTANT for analysis.
--------------------------------------------------------------------------------
Page 8
Durian, 2002b, "Speak My Language"84. Quantitative AnalysisStatistical analysis of fixed- and variable-phrase collocations in the NCREL Search String Corpus was conducted using a word/token frequency analysis performed in TEXTANT. TEXTANT is a Web-based statistical analysis software tool that allows users to perform statistical word/token frequency analyses of individual words within a data set, multi-level statistical word/token frequency analyses of fixed- and variable-phrase collocations within the data set, and multi-level analyses of text using stop lists and t-scores. For this analysis, the stop list feature of TEXTANT was utilized so that functional and non-relevant bigrams and lexical word class items such as personal pronouns, modal and auxiliary verbs, coordinating and subordinating conjunctions, prepositions, and determiners could be eliminated in the observed results. As well, t-scores were calculated during this analysis so that the items could be rank ordered on an ordinal scale.T-scores proved useful in this analysis because they look at the mean and variation of lexical items occurring in the Web search data and provide ranking information for the collocation word tokens (noted as bigrams in the chart below) based on their probable frequency of occurrence. In essence, t-scores are useful in this context because they allow the collocation word tokens to be more clearly ranked as individually occurring lexical items than the use of simple frequency distribution analysis. This is particularly useful in the situation noted in table 1, below:bigramFreqt-scoreinner language254.971early childhood254.968multiple intelligences254.914Table 1: Example of T-Score RankingIn this situation, without the use of t-scores, there would be no way to reliably determine the actual rank order of three items that each occur individually 25 times in the data. But, with
--------------------------------------------------------------------------------
Page 9
Durian, 2002b, "Speak My Language"9the use of the t-scores, a reliable statistical determination of their probable occurrence can be made. For the NCREL Search String Corpus, the t-score rankings of the individual items ranged from -95.27 to 26.39. On the scale provided by the t-scores, those items that ranked above 0.00 are considered to be positive collocations, with the probably that these items occur simply by chance being relatively low, while those items ranked below 0.00 are considered to be negative collocations, with the probably that these items occur simply by chance being relatively high. To conserve space, for the purposes of this paper, although rankings of all collocations in the data set were made available from the analysis conducted in TEXTANT, only the 25 most frequently occurring 2 word fixed- and variable-phrased collocations (also referred to as bigram collocations since, to simply, these collocations consist of two words) based on t-score rankings will be displayed and discussed.2Only bigram collocations are studied in this analysis because these are the terms that most clearly indicate salient concepts as entered by users.To capture the diversity of the collocation data contained within the NCREL Search String Corpus, both fixed-phrase and variable-phrase collocation analyses of the data were performed. The performance of these types of analysis allowed for the discovery of collocations that occur not only as tightly bounded lexical units of text (i.e., collocations revealed by fixed-phrase analysis), but also collocations that occur as text units related to each other by unbounded lexical co-occurrence within a larger discourse frame, in this case, the discourse frame of a lexical item and four additional words on either side of that item (i.e., collocations revealed as variable-phrase collocations). As well, the use of both types collocation analysis allows a comparative data set to be created, which, as I will show, is important in determining concept saliency because they allow collocational frequency rankings to be verified in greater detail
--------------------------------------------------------------------------------
Page 10
Durian, 2002b, "Speak My Language"10across the data set, as well as aid in the detection of important conceptual trends within the larger data set that might not otherwise be revealed with only one type of collocational analysis.As table 2, below, illustrates, the 25 bigram fixed-phrase collocations most highly ranked within the data set analyzed for this pilot study are the following:BigramFreqtengaged learning71525.518professional development55123.058lesson plans40619.933early childhood39519.732school improvement37918.474critical issue34418.387graphic organizers32717.987collaborative classroom28716.552multicultural education24614.907high school24014.469parent involvement20014.023childhood education20513.476middle school20213.444multiple intelligences18013.362beyond bell17813.305student achievement18313.259research say17913.213staff development17812.959picture machine15712.473school reform18012.284what research16112.218critical issues15312.108cooperative learning15912.007amazing picture14511.988special education15611.987**Total number of word tokens: 147,508 Total number of collocations with length of 2: 85,647**Table 2: "Top 25" Fixed-Phrase Collocations (length of 2, minimum frequency of 2, with stop list applied)The data contained in table 2 were obtained using the fixed-phrase search capability of TEXTANT. The collocation length for this search was set at two words per collocation, and the minimum occurrence of frequency was set at two instances for a collocation to be counted in the analysis. For this search, a total of 85, 647 bigram fixed-phrase collocations were found, composed of a total of 147, 508 word tokens.
--------------------------------------------------------------------------------
Page 11
Durian, 2002b, "Speak My Language"11As the rankings in table 2 show, the most frequently occurring bigram fixed-phrase collocation in the NCREL Search String Corpus was the term "engaged learning," occurring a total of 715 times throughout the course of the entire year-long data set. Based on the mean and variance of its occurrence within the set, it received a t-score of 25.52. As the table also shows, the twenty-fifth most-frequently occurring bigram collocation is the term "special education," which occurs a total of 156 times throughout the course of the data set. Based on the mean and variance of its occurrence within the set, it received a t-score of 11.99. As table 3, below, illustrates, the 25 bigram variable-phrase collocations most highly ranked within the data set analyzed for this pilot study are the following:BigramFreqtengaged learning73326.392professional development57223.686lesson plans40620.027early childhood39519.794school improvement39019.194critical issue34218.402graphic organizers32718.029collaborative classroom29416.929multicultural education25415.505high school24315.013parent involvement21514.598based learning22614.240middle school21014.065childhood education20913.985early education21113.925student achievement18413.414multiple intelligences18013.386beyond bell17913.358research say18013.323staff development18013.201school reform18713.046what say16612.831technology plan18412.789what research16512.582picture machine15712.497**Total number of word tokens: 147,511 Total number of bigram collocations in window of 4: 151,406**Table 3: "Top 25" Variable-Phrase Collocations (length of 4 either side, minimum frequency of 2, with stop list applied)
--------------------------------------------------------------------------------
Page 12
Durian, 2002b, "Speak My Language"12The data contained in table 3 were obtained using the variable-phrase search capability of TEXTANT. The collocation length for this search was set at four words per side per collocation, and the minimum occurrence of frequency was set at two instances for a collocation to be counted in the analysis. For this search, a total of 147,511 bigram fixed-phrase collocations were found, composed of a total of 151,406 word tokens. The rankings in table 3 show that the most frequently occurring bigram variable-phrase collocation in the NCREL Search String Corpus was the term "engaged learning," occurring a total of 733 times throughout the course of the entire year-long data set. Based on the mean and variance of its occurrence within the set, it received a t-score of 26.40. As the table also shows, the twenty-fifth most-frequently occurring bigram collocation in the corpus is the term "picture machine," which occurs a total of 157 times throughout the course of the set. Based on the mean and variance of its occurrence within the data set, it received a t-score of 12.50.As the statistical analysis reveals, the data contained in tables 2 and 3 overlap heavily, with all but 10 of the 29 bigram collocation word tokens occurring in either analysis, albeit at different levels of ranking for 6 of the tokens and similar levels of ranking for 13 of the tokens. This overlap between data sets indicates that, for the 11 most highly ranking bigram collocations (either fixed-phrase of variable-phrase), as well as the 13th and 16th most highly ranking ("middle school" and "student achievement" respectively), the order of these collocations correlates exactly in terms of ranking, regardless of the Web search discourse context in which these items occur. As well, the co-occurrence of the 6 overlapping bigram collocations placed lower than 11th in both lists--“multiple intelligences,” “childhood education,” “research say,” “staff development,” “school reform,” and “what research”―indicates that the rankings of these items also correlate, regardless of the Web search phrase discourse context in which the items
--------------------------------------------------------------------------------
Page 13
Durian, 2002b, "Speak My Language"13occur, although the rankings correlate less perfectly than the 11 most highly ranking bigram collocation items do.5. Qualitative AnalysisAs my qualitative analysis will now demonstrate, the higher frequency occurrence of these 19 bigram collocations, as well as the correspondence of their occurrence in both fixed-and variable-phrase contexts, is significant for indicating that these items are highly salient concepts for the NCREL users observed in this study. As well, I will also demonstrate that, due the scalar nature of the "markedness" of semantic collocations, high-frequency collocations can prove to be an index of concept saliency that reflects the knowledge available to speakers who possess these collocations in their mental lexicon. To begin this discussion, I will now discuss the extremely high ranking correspondence among the occurrence of 19 out of 29 of the most frequently occurring fixed- and variable-phrase collocations that was noted in section 5. The fact that these collocations occur as frequently as they do within the corpus data is not surprising, as the concepts and issues these collocation tokens represent are universal concerns that would be shared by all educators within the class of speakers represented by the sample population observed in this study―namely, K-12 educators in a seven state region of the Midwestern United States―regardless of the educational level at which any individual member of that population is involved within the educational system. Because these concerns are universal to all members of the sample population, we would expect them to be present in the mental lexicon of these speakers, and thus, we would also expect them to understand the extended semantic meanings of these semantic collocations. Because all members of the sample population would understand the extended semantic meanings of the collocations, it is also not
--------------------------------------------------------------------------------
Page 14
Durian, 2002b, "Speak My Language"14surprising to find that these items are most salient since they are, in fact, most widely understood by members from across the sample population. However, although the universal nature of many of the concepts encoded by the 19 most highly correlating semantic collocations in this study explains their high salience, this data set also contains concepts that occur at different frequencies and rankings, and thus, show that certain concepts are more salient universally for these speakers than others. What this implies is the notion that I first mentioned in section 2 of this paper―that the semantic collocations in this data set exist on a continuum of "markedness," in which some of the collocations are more "marked" and thus, less psychologically salient for users within the overall sample population, while other concepts are less "marked," and thus, operate within the discourse more like prototypically "unmarked" semantic collocations than their more "marked" counterparts. As I will now discuss, these data occur as a scalar set of "marked" relations because the collocations within the set represent lexicalizations of concepts present to different extents in the multiple mental lexicons that the members of various subgroups of the sample population access when they use these collocations in discourse contexts. In other words, although the data observed in this set reflect the overall mental lexicon of speakers within the particular sample population under scrutiny in this study―K-12 educators from a seven state region in the Midwestern United States―within this sample population exist several more subgroups, each of which will have access to the concepts reflected in the semantic collocations occurring within the Corpus to differing degrees, due to their membership in these specific subgroups. Among the subgroups composed of these speakers would be the following:elementary school teachers, middle school teachers, high school teachers, district-level administrators such as superintendents; school level administrators, such as deans and principles;
--------------------------------------------------------------------------------
Page 15
Durian, 2002b, "Speak My Language"15school, district, and state-level policymakers; educational researchers; and even possibly college-level professors and administrators. Since the members of each of these subgroups participate in the more general educational system in different ways, they would also recognize certain concepts as being more salient for themselves than other concepts, as their knowledge concerns would be dependent to some extent on their membership within these subgroups. For example, high school teachers might be aware of middle school-specific educational issues, but if these issues are not pertinent to their own day-to-day work as high school educators, they either might not be interested in these issues at all, or they might only maintain a cursory interest in the topics, perhaps because they might find bits and pieces of an issue to berelevant to their work, but not the whole issue as a salient concept. But at the same time, since they are also aware of more general issues that impact the day-to-day lives of educators at all levels of the educational system―issues such as professional development or the implementation of engaged learning principles into the curriculum―they will likely be more interested in these types of issues, and thus, will recognize these concepts as being more salient to them in their professional lives. In this way, their membership in a particular subclass of educator affects their access to a mental lexicon of the collocations used to represent their knowledge of educational concepts, in that this lexicon will contain the more universal, highly salient concepts available in the mental lexicon of all members of the class of educators, but will also include more specific, less universally salient concepts that would only be available in the mental lexicons of their particular subgroup as highly salient concepts. This explains why they would possess highly salient lexical knowledge of more universal educational concepts while also possessing highly salient lexical knowledge of certain, more specific concepts within the larger class of concepts represented by the collocations found in the corpus data.
--------------------------------------------------------------------------------
Page 16
Durian, 2002b, "Speak My Language"16As each group will conceptualize the information made available to them differently because of their participation in different subgroup memberships, this will have an overall impact on the saliency of the concepts represented in semantic collocations contained in the corpora data both within groups and across groups, which means, ultimately, that the frequencies of the semantic collocation data will be affected as a result of these group membership differences. In theory, this means that more universal concepts known to all members of the larger sample population will occur more frequently, while more specific concepts that are more salient within the subgroups but less salient within the total sample population group will occur less frequently overall, since this information will not be as "shared" as the more universal information. As the NCREL Search String Corpus data show, the semantic collocations do occur at different levels of frequency, with more universal educational concepts occurring more frequently, for the most part, and more specific educational concepts which are more salient only in the mental lexicons of certain subgroups of this population occurring somewhat less frequently. For example, in both tables 2 and 3, we see that 8 of the top 11 most frequently occurring corresponding semantic collocations that lexicalize highly salient information for our sample population―"engaged learning," "professional development," "lesson plans," "school improvement," "graphic organizers," "collaborative classroom," "multicultural education," and "parental involvement"―are all universal concepts that would be "shared knowledge" understood by the entire class of speakers and thus, occur at the very top of the distribution. But as we move down the distribution tables, we see semantic collocations lexicalizing more specific concepts, with 2 of the semantic collocations within the 13 most frequently corresponding―"early childhood" and "high school"―representing concepts that would be more specifically salient to only certain members of the subgroups (in this case, perhaps kindergarten
--------------------------------------------------------------------------------
Page 17
Durian, 2002b, "Speak My Language"17teachers and high school teachers) within the sample population. This trend continues as we move down the distribution tables beyond the top 11 collocations, where we see that 4 more of the 19 highly corresponding collocations that we noted in section 5―"multiple intelligences," "school reform," "student achievement," and "staff development"―lexicalize highly salient universal concepts, while 2 of the 6 less perfectly corresponding collocations noted in section 5―"childhood education," and "middle school"―represent more specifically salient concepts that are not as universally available in the universal lexicon of the sample population. This trend in the data shows that the collocation data found in the NCREL Search String Corpus have a continuum-like nature, in which more universal, and thus, more prototypically "unmarked" collocations occur the most frequently, followed by more specifically occurring, more prototypically "marked" collocations. As the frequency of the semantic collocations decreases, the saliency of the concepts represented by the collocations also decreases, as the conceptual information encoded by the semantic collocations becomes more specific and less universal, and the lexical information represented as semantic collocations becomes increasingly more "marked." Since this continuum is sensitive to saliency, this sensitivity also makes the continuum actually become a lexical representation model that can be used to index concept saliency within the mental lexicon of the speakers who possess these collocations in their lexicon. 6. ConclusionAs I have shown throughout this paper, based on the evidence of my quantitative and qualitative analysis of the occurrence of high frequency collocations in the NCREL Search String Corpus, semantic collocations appear to exist on a continuum of "markedness." Within this continuum of "markedness," more frequently occurring, highly salient concepts occur at one
--------------------------------------------------------------------------------
Page 18
Durian, 2002b, "Speak My Language"18end of the continuum and lexicalize information that is more universal for speakers, and thus, more prototypically "unmarked," while less salient, less frequently occurring concepts occur at the other end of this continuum and lexicalize information that is more specific, and thus, more prototypically "marked." As I have also shown through my qualitative analysis, due the scalar nature of the "markedness" of semantic collocations, the occurrence of semantic collocations can prove to be an index of concept saliency that reflects the knowledge available to speakers who possess these collocations in their mental lexicon. Ultimately, what the data in this study also seem to show is that semantic collocations can prove to be an index of concept saliency that have implications beyond the realm of this study. One implication of this study seems to be that semantic collocations could be used in the design of category structures built to present information to speakers by utilizing the lexical representation model of concepts that the continuum-like nature of the "markedness" of collocations make available. Although this finding presents new data issues that would need to be explored in more detail beyond the scope of this study, it seems promising based on the data presented here that frequency-based studies of the occurrence of collocations could be used for such a purpose. At present, I am working on exploring this issue utilizing the data set presented here, and at a later date, plan to present the findings of this exploration.3Endnotes1. The work presented in this paper was funded in part by the U. S. Department of Education under a grant from the Office of Research for Educational Improvement (OERI). Specifically, the funding here applies to sections 3 and 4, which were completed under the auspices of the North Central Regional Educational Laboratory (NCREL).
--------------------------------------------------------------------------------
Page 19
Durian, 2002b, "Speak My Language"192. The full results of this analysis can be viewed in real-time online via the TEXTANT NCREL Web site at http://131.156.77.64/bowie/textantncrelexperimental.cgi. If researchers wish to obtain files of the original data for further analysis, they can do so by e-mailing the author at ddurian@ncrel.org.3. These findings will be presented in Durian (forthcoming).ReferencesBerlin, Brent, and Paul Kay. 1969. Basic Color Terms: Their Universality and Evolution. Berkeley, CA: University of California Press.Benson, Morton. 1989. The Structure of the Collocational Dictionary. International Journal of Lexicography 2 (1): 1-14.Brown, Cecil H. 1984. Language and Living Things: Uniformities in Folk Classification and Naming. Brunswick, NJ: Rutgers University Press.Brown, Peter F., Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1991. "Word-Sense Disambiguation Using Statistical Methods." In Proceedings of the 29th Annual Meeting of the Association of Computational Linguistics (pp. 264-270). Berkeley, CA: University of California Press.Cowie, A. P. 1981. The Treatment of Collocations and Idioms in Learner's Dictionaries. Applied Linguistics 2 (3): 223-235. Church, Kenneth, William Gale, Patrick Hanks, and Donald Hindle. 1991. "Using Statistics in Lexical Analysis." In Uri Zernik. (Ed.). Lexical Acquisition: Exploring Online Resources to Build a Lexicon (pp. 116-164). Hilldale, NJ: Lawrence Earlbaum. Dagan, Ido, and Kenneth Church. 1994. "TERMIGHT: Identifying and Translating Technical Terminology." In Proceedings from the 4th Annual Conference on Applied Natural Language Processing (pp. 34-40). Germany: Stuggart.
--------------------------------------------------------------------------------
Page 20
Durian, 2002b, "Speak My Language"20Dagan, Ido, and Alon Itai. 1994. "Word Sense Disambiguation Using a Second Language Monolingual Corpus." Computational Linguistics 20 (4): 563-596.Durian, David. "Key Word Indices: Implications for the Analysis of Collocation Frequency in Database and Web Site Design." Forthcoming, Fall 2002.Firth, John Rupert. 1957. "A Synopsis of Linguistic Theory, 1930-1955." In Studies in Linguistic Analysis (pp. 1-32). Oxford: Philological Society. Reprinted in F.R. Palmer. (Ed.). 1968. Selected Papers of J.R. Firth, 1952-1959. London: Longman. Halliday, M. A. K. 1966. "Lexis as a Linguistic Level." In C.E. Bazell, J.C. Catford, M.A.K. Halliday, and R. H. Robins. (Eds.). 1966. In Memory of J.R. Firth (pp.148-162). London: Longman.Halliday, M.A.K., and Ruoaiya Hassan. 1976. Cohesion in English. London: Longman.Kay, Paul. 1975. "Synchronic Variability and Diachronic Change in Basic Color Terms." Language in Society 4 (3): 257-270. Kennedy, Graeme. 1998. An Introduction to Corpus Linguistics. New York: Addison-Wesley Longman Limited.Kjellmer, Göran. 1982. "Some Problems Relating to the Study of Collocations in the Brown Corpus." In Stig Johansson. (Ed.). 1982. Computer Corpora in English Language Research. Bergen: Norwegian Computing Center for the Humanities.Kupiec, Julian. 1993. "An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora." In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics (pp.17-22). Columbus, OH: Association for Computational Linguistics.
--------------------------------------------------------------------------------
Page 21
Durian, 2002b, "Speak My Language"21Manning, Christopher D., and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press.McKeown, Katheleen, and Dragomir Radev. 2000. "Collocations." In Robert Dale, Herman Moisl, and Harold Somers. (Eds.). 2000. Handbook of Natural Language Processing. (pp. 507-521). New York: Marcel Dekker, Inc.Mel'cuk, Igor A., and Nikoli V. Pertsov. 1987. Surface-Structure of English: A Formal Model in the Meaning-Text Theory. Philadelphia, PA: Benjamins.Radev, Dragomir, and Katheleen McKeown. 1997. "Building a Generation Knowledge Source Using Internet-Accessible Newswire." In Proceedings of the 5th Conference of Allied Natural Language Processing (pp. 221-228). New York: Columbia University. Renouf, Antoinette. 1992. "What Do You Think of that? A Pilot Study of the Phraseology of the Core Words of English." In Gerhard Leitner. (Ed.). 1992. New Directions in English Language Corpora: Methodology, Results, Software Developments. Berlin: Mouton de Gruyter.Sinclair, John. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press.Sinclair, John. 1995. (Ed.). Collins COBUILD English Dictionary. London: Harper Collins.Smadja, Frank, and Katheleen McKeown. 1991. "Using Collocations for Language Generation." Computational Intelligence 7 (4): 222-248.Smadja, Frank, Katheleen McKeown, and Vasilieios Hatzivassiloglou. 1996. "Translation Collocations for Bilingual Lexicons: A Statistical Approach." Computational Linguistics22 (1): 1-38.Witkowski, Stanley R., and Cecil H. Brown. 1977. "An Explanation of Color Nomenclature Universals." American Anthropologist 79 (1): 50-57.
 
回复:Collocation Frequency

这里有可下载的链接:

Durian, D. (2002a). Speak my language: Collocation frequency as an index of concept saliency. Unpublished paper. PDF Version is available at http://www.ling.ohio-state.edu/~ddurian/Collo.pdf. If you still cannot download it, find the attached PDF full paper here.

This paper investigates the idea that collocations can exist in natural language corpus data as a series of items on a continuum of "marked" relations, with some items being more "marked," and other items being less "marked," than other items in the set. It also looks at how this continuum functions as an index of concept saliency for items contained in the data set. To investigate these ideas, David uses data gathered from Web site users collected for the design of the NETRO Web site. This is the second of four papers that David has written involving the NETRO Web site data.

http://forum.corpus4u.org/upload/forum/2005110113032689.pdf
 
Back
顶部