


You did it in a right way. The problem was that the root form impair did not appear in your corpus (must be very small?), WST could not join the lemmas to the root.
If your corpus is not large, you can check the list (mark and join) by hand, or use the BNC list I posted.
The general problem behind this post is how your lemmatization is done. If it is based on an existing corpus, then cases like 'impairs' as the root form will occur. If, as suggested by XIAOZ, one starts with a word list (preferably some well known lists), then this problem could be avoided. But you may, on the other hand, end up with a lot of zero occurrences when some of the words do not show up in your corpus.

以下是引用 动态语法2005-7-4 6:11:55 的发言:
The general problem behind this post is how your lemmatization is done. If it is based on an existing corpus, then cases like 'impairs' as the root form will occur. If, as suggested by XIAOZ, one starts with a word list (preferably some well known lists), then this problem could be avoided. But you may, on the other hand, end up with a lot of zero occurrences when some of the words do not show up in your corpus.

Indeed so especially if a corpus is small. I have posted a fullt lemmatized wordlist on the basis of the BNC (World Edition) for those who need it (in the section of Native Corpora).
Loading a corpus into WordSmith: some tips

Loading a corpus into WordSmith: some tips

When all of the files for a corpus are stored in a few filefolders, it it quite straightforward when loading the corpus. For some large corpora, however, the files are located in many directories. In the BNC (Word Edition), for example, the 4054 files are stored in as many as 175 directories/subdirectories. It is very boring and time-consuming to load the corpus.

Here is an easy way - using "Get" in Choosing texts in WS3 and "Favorite" in Choosing texts in WS4. But you need to create a text file indicating the full path for each file. Here are some examples.

The whole BNC (World Edition): http://www.corpus4u.org/upload/forum/2005070604263563.zip
The written component of the BNC (World Edition): http://www.corpus4u.org/upload/forum/2005070604270144.zip
The spoken component of the BNC (World Edition): http://www.corpus4u.org/upload/forum/2005070604273342.zip
The demographically sampled part of the BNC (World Edition): http://www.corpus4u.org/upload/forum/2005070604275460.zip

If you are using the BNC, you can change the drive name (E:\, F:\ etc) in these files to the drive on which your copy of the corpus is stored (using Notepad to Replace All, not case sensitive).

Very useful tips. xiaoz, you are great.
回复: 关于wordsmith使用的一些问题,如lemmatization

回复: 关于wordsmith使用的一些问题,如lemmatization

回复: 关于wordsmith使用的一些问题,如lemmatization

