关于wordsmith使用的一些问题,如lemmatization

我的问题是,我的词频统计中,很多不是原形,有些是过去式,有些是带复数形式的,比如boxes,但是我想要的是box,我已经做过lemma了,但是boxes只有一个,而没有box,所以就仍然是boxes。这样的情况,我要怎么样才能让所有统计出来的单词都是原形呢?

谢谢大家:)
 
大家看看

比如,impairs有1个,impaired有3个,最后lemma之后,impairs有4个,而不是,impair有4个。
 
回复:请问关于wordsmith使用中的问题,应该在哪个版讨论

2005070309372117.jpg
 
You did it in a right way. The problem was that the root form impair did not appear in your corpus (must be very small?), WST could not join the lemmas to the root.
 
If your corpus is not large, you can check the list (mark and join) by hand, or use the BNC list I posted.
 
The general problem behind this post is how your lemmatization is done. If it is based on an existing corpus, then cases like 'impairs' as the root form will occur. If, as suggested by XIAOZ, one starts with a word list (preferably some well known lists), then this problem could be avoided. But you may, on the other hand, end up with a lot of zero occurrences when some of the words do not show up in your corpus.
 
回复:请问关于wordsmith使用中的问题,应该在哪个版讨论

以下是引用 动态语法2005-7-4 6:11:55 的发言:
The general problem behind this post is how your lemmatization is done. If it is based on an existing corpus, then cases like 'impairs' as the root form will occur. If, as suggested by XIAOZ, one starts with a word list (preferably some well known lists), then this problem could be avoided. But you may, on the other hand, end up with a lot of zero occurrences when some of the words do not show up in your corpus.

Indeed so especially if a corpus is small. I have posted a fullt lemmatized wordlist on the basis of the BNC (World Edition) for those who need it (in the section of Native Corpora).
 
Loading a corpus into WordSmith: some tips

Loading a corpus into WordSmith: some tips

When all of the files for a corpus are stored in a few filefolders, it it quite straightforward when loading the corpus. For some large corpora, however, the files are located in many directories. In the BNC (Word Edition), for example, the 4054 files are stored in as many as 175 directories/subdirectories. It is very boring and time-consuming to load the corpus.

Here is an easy way - using "Get" in Choosing texts in WS3 and "Favorite" in Choosing texts in WS4. But you need to create a text file indicating the full path for each file. Here are some examples.

The whole BNC (World Edition): http://www.corpus4u.org/upload/forum/2005070604263563.zip
The written component of the BNC (World Edition): http://www.corpus4u.org/upload/forum/2005070604270144.zip
The spoken component of the BNC (World Edition): http://www.corpus4u.org/upload/forum/2005070604273342.zip
The demographically sampled part of the BNC (World Edition): http://www.corpus4u.org/upload/forum/2005070604275460.zip

If you are using the BNC, you can change the drive name (E:\, F:\ etc) in these files to the drive on which your copy of the corpus is stored (using Notepad to Replace All, not case sensitive).
 
回复:请问关于wordsmith使用中的问题,应该在哪个版讨论

Very useful tips. xiaoz, you are great.
 
回复: 关于wordsmith使用的一些问题,如lemmatization

请问如何使用Wordsmith软件来检索BNC语料库?我们下载并解压了wordsmith软件但还是打不开语料库,不知问题出在哪里?
 
回复: 关于wordsmith使用的一些问题,如lemmatization

能否告知如何运用wordsmith软件在线检索BNC的基本步骤?谢谢!
 
回复: 关于wordsmith使用的一些问题,如lemmatization

wordsmith不能在线检索BNC,通常用来检索当前机器里的语料的。
 
请问antconc的检索concordance功能里的高级检索里的context功能是如何使用的。

请问antconc的检索concordance功能里的高级检索里的context功能是如何使用的??能否举例说明。非常感谢。
 
Back
顶部