[求助]xaira的词表

xudekuan

Moderator
用xaira 1.6 取得的词表:
<entry freq="2"><lemma>and</lemma></entry>
<entry freq="1"><lemma>are</lemma></entry>
<entry freq="1"><lemma>bibliographic</lemma></entry>
<entry freq="1"><lemma>built</lemma></entry>
<entry freq="1"><lemma>by</lemma></entry>
<entry freq="1"><lemma>components</lemma></entry>
<entry freq="1"><lemma>corpus</lemma></entry>
<entry freq="1"><lemma>details</lemma></entry>
<entry freq="1"><lemma>for</lemma></entry>
<entry freq="1"><lemma>forms</lemma></entry>
<entry freq="1"><lemma>fullwidth</lemma></entry>
<entry freq="1"><lemma>general</lemma></entry>
<entry freq="1"><lemma>generated</lemma></entry>
<entry freq="2"><lemma>header</lemma></entry>
<entry freq="1"><lemma>headers</lemma></entry>
<entry freq="1"><lemma>ideographshalfwidth</lemma></entry>
<entry freq="1"><lemma>in</lemma></entry>
<entry freq="2"><lemma>indexer</lemma></entry>
<entry freq="1"><lemma>indextools</lemma></entry>
<entry freq="1"><lemma>individual</lemma></entry>
<entry freq="1"><lemma>list</lemma></entry>
<entry freq="1"><lemma>made07</lemma></entry>
<entry freq="1"><lemma>of</lemma></entry>
<entry freq="2"><lemma>pos</lemma></entry>
<entry freq="1"><lemma>postypetype</lemma></entry>
<entry freq="1"><lemma>provided</lemma></entry>
<entry freq="2"><lemma>punctuationcjk</lemma></entry>
<entry freq="1"><lemma>status</lemma></entry>
<entry freq="1"><lemma>symbols</lemma></entry>
<entry freq="1"><lemma>sysid</lemma></entry>
<entry freq="1"><lemma>texts</lemma></entry>
<entry freq="1"><lemma>the</lemma></entry>
<entry freq="1"><lemma>toolscorpus</lemma></entry>
<entry freq="1"><lemma>toolslanguage</lemma></entry>
<entry freq="1"><lemma>unified</lemma></entry>
<entry freq="1"><lemma>unknown</lemma></entry>
<entry freq="1"><lemma>”</lemma></entry>
<entry freq="33"><lemma>、</lemma></entry>
<entry freq="472"><lemma>。</lemma></entry>
<entry freq="164"><lemma>一</lemma></entry>
<entry freq="4"><lemma>一下</lemma></entry>
<entry freq="2"><lemma>一下子</lemma></entry>
<entry freq="1"><lemma>一二</lemma></entry>


问题在于,原文件是全中文的文件,标注如下:
<pos type="ns">黑鲨洋</pos>
<pos type="m">1</pos>
<pos type="nh">老七叔</pos>
<pos type="dt">新</pos>
<pos type="v">搞</pos>
<pos type="u">了</pos>
<pos type="m">一</pos>
<pos type="q">条</pos>
<pos type="n">船</pos>
<pos type="w">,</pos>
<pos type="v">请</pos>
<pos type="nh">曹莽</pos>
<pos type="v">入伙</pos>


文件中并没有“and”“are”等英文单词,但是词表中却出现了。如何才能取的真正的词表。
 
Yes the Xaira wordlist appears to include the tokens in the authomatically generated corpus header.
 
Those words are from the corpus header automatically generated when a corpus is indexed using Xaira. It has nothing to do with a particular corpus. To get an XML wordlist without those words. you can open an indexed corpus and start word query, press Lookup to get a whole word list and choose to save the list in XML format.
 
thanks a lot, dear Xiao.
And is the English countpart of LCMC Chinese available for comparative studies?
 
I don't see a point in comparing a Chinese wordlist with an English wordlist...
 
Back
顶部