VMP: Discourse (Lexical) Analysis Tools

动态语法

管理员
Staff member
Full name: Vocabulary-Management Profile (VMP)

1) This is primarily for English text analysis.

2) Online, free, and works pretty fast.

3) Various interesting statistical tools for word lists with frequency
information, type-token ratio, curve graphics, concordance (?),
information distribution tendencies (fractal dimension), etc.

Article by the author in Language:

http://forum.corpus4u.org/upload/forum/2005072606381241.pdf

URL: http://www.missouri.edu/~youmansc/vmp/index.shtml

[本贴已被 作者 于 2005年07月26日 09时58分51秒 编辑过]
 

xiaoz

永远的超级管理员
Staff member
回复:VMP: Discourse (Vocabulary) Analysis Tools

I used the following paragraph as a test:

Corpus markup is important for at least three reasons. First, as noted in unit 2, the corpus data basically consists of samples of used language. This means that these examples of linguistic usage are taken out of the context in which they originally occurred and their contextual information is lost. Burnard (2002) compares such out-of-context examples to a laboratory specimen and argues that contextual information (i.e. metadata, or 'data about data') is needed to restore the context and to enable us to relate the specimen to its original habitat. In corpus building, therefore, it is important to recover as much contextual information as practically possible to alleviate or compensate for such a loss (see unit 10.8 for further discussion). Second, while it is possible to group texts and/or transcripts of similar quality together and name these files consistently (e.g. as happens with the LOB and Brown corpora, see unit 7.4), filenames can provide only a tiny amount of extra-textual information (e.g. text types for written data and sociolinguistic variables of speakers for spoken data) and no textual information (paragraph/sentence boundaries and speech turns) at all. Yet such data is of great interest to linguists and thus should be encoded, separately from the corpus data per se, in a corpus (see unit 3.3). Markup adds value to a corpus and allows for a broader range of research questions to be addressed as a result. Finally, pre-processing written texts, and particularly transcribing spoken data, also involves markup. For example in written data, when graphics/tables are removed from the original texts, placeholders must be inserted to indicate the locations and types of omissions; quotations in foreign languages should also be marked up. In spoken data, pausing and para-linguistic features such as laughter need to be marked up. Corpus markup is also needed to insert editorial comments, which are sometimes necessary in pre-processing written texts and transcribing spoken data. What is done in corpus markup has a clear parallel in existing linguistic transcription practices. Markup is essential in corpus building.

Statistics and graph:

http://forum.corpus4u.org/upload/forum/2005072609003563.pdf

The result dowloaded:

http://forum.corpus4u.org/upload/forum/2005072609010995.pdf

The results are impressive, but I am not sure why some items in the wordlist do not occur in my input text.
 

动态语法

管理员
Staff member
回复:VMP: Discourse (Vocabulary) Analysis Tools

It may have to do with the file format. I remember somewhere it
was said that files needed to be formated with line breaks. Looks like
some of the extra 'words' (e.g. cor, pus) may be due to mistaken
line breaks.
 

xiaoz

永远的超级管理员
Staff member
The paragraph is one line, without line breaks. Maybe the line was too long to process properly. But I am not sure.
 
顶部