清风出袖
高级会员
The LIVAC (Linguistic Variations in Chinese Speech Communities) synchronous corpus, pioneered by the Language Information Sciences Research Centre at The City University of Hong Kong, contains texts from representative Chinese newspapers and electronic media of Hong Kong, Taiwan, Beijing, Shanghai, Macau and Singapore. The collection of materials from the diverse communities is synchronized, and so offers an innovative "Window" approach for a whole variety of comparative studies and useful IT applications.
Analyzed by various linguistic units (e.g. characters, words, sentences), the LIVAC corpus serves many purposes. In particular, it provides an important database and means for in-depth investigation of lexical development, including the evolution of new concepts and their expressions, in contemporary Chinese.
All corpus texts have undergone automatic segmentation, and the results have been manually verified. A lexical database is derived from the segmented texts. Apart from ordinary words, those expressing new concepts or undergoing sense shifts, as well as regionalistic words from the six communities, are singled out. The database is thus a rich resource for research into linguistics, sociolinguistics, and Chinese language and society. Up to date, quantitative data on the Chinese language are also particularly useful for applications in the field of Information Technology, including the development of search engines and machine translation systems in language engineering.
Fresh textual materials for the corpus have been collected every four days since July 1995, with a 10-year time span planned for the collection to capture salient pre- and post-millennium evolving cultural and social fabrics of the diverse Chinese speech communities. Up to January 2005, the unique and growing corpus contains over 150 million Chinese characters and over 720,000 word types, and is still expanding.
The Centre has launched a bi-weekly Celebrity Roster listing the top 25 celebrities in Beijing, Shanghai, Hong Kong and Taiwan according to their media exposure in Chinese newspapers, and similar indices for place names and common words. Comments and feedback are welcome.
The free but limited access to LIVAC is available from
http://livac.org/.
Analyzed by various linguistic units (e.g. characters, words, sentences), the LIVAC corpus serves many purposes. In particular, it provides an important database and means for in-depth investigation of lexical development, including the evolution of new concepts and their expressions, in contemporary Chinese.
All corpus texts have undergone automatic segmentation, and the results have been manually verified. A lexical database is derived from the segmented texts. Apart from ordinary words, those expressing new concepts or undergoing sense shifts, as well as regionalistic words from the six communities, are singled out. The database is thus a rich resource for research into linguistics, sociolinguistics, and Chinese language and society. Up to date, quantitative data on the Chinese language are also particularly useful for applications in the field of Information Technology, including the development of search engines and machine translation systems in language engineering.
Fresh textual materials for the corpus have been collected every four days since July 1995, with a 10-year time span planned for the collection to capture salient pre- and post-millennium evolving cultural and social fabrics of the diverse Chinese speech communities. Up to January 2005, the unique and growing corpus contains over 150 million Chinese characters and over 720,000 word types, and is still expanding.
The Centre has launched a bi-weekly Celebrity Roster listing the top 25 celebrities in Beijing, Shanghai, Hong Kong and Taiwan according to their media exposure in Chinese newspapers, and similar indices for place names and common words. Comments and feedback are welcome.
The free but limited access to LIVAC is available from
http://livac.org/.