汉语文本处理常见问题及解答 中文 Chinese text processing FAQ

armstrong

高级会员
回复: 汉语文本处理常见问题及解答Chinese text processing FAQ

thanks again,Dr.Hong,but all the links you provide above are dead.
 

armstrong

高级会员
回复: 汉语文本处理常见问题及解答Chinese text processing FAQ

Dr.hong,I tried adding numbers to the sentences as you instructed and I made it, but I failed to convert the .txt format to .xml format though I downloaded the source code and software.
please tell me why?
 

hazhihan

初级会员
回复: 汉语文本处理常见问题及解答Chinese text processing FAQ

我做了个VBA程序,测试一下。:D
在Word中打开待处理的文本,按Alt+F11进入Visual Basic编辑器,双击左侧“工程”窗口中的"ThisDocument",将附件 vba.txt中代码粘贴到右边空白处,按F5执行,最后将处理过的文本保存或另存为。
 

附件

laohong

管理员
Staff member
回复: 汉语文本处理常见问题及解答Chinese text processing FAQ

...but I failed to convert the .txt format to .xml format ...please tell me why?
After you added the <s n="xxx"> format to your sentence/paragraph, you still need add a closing mark </s> at the end of each sentence/pagagraph. This can be done easily with EditPlus: Search, Replace, Regular expression, Find What "\n" and Replace With "</s>\n" (no quotation marks).

Then, add the following lines to the beginning of the text:
<?xml version="1.0" encoding="utf-8"?>
<Doc>
<FileInfo>
<TextType>News</TextType>
<FileID>001</FileID>
</FileInfo>
<Text>


You can change the <FileInfo>, <TextType>, <FileID>, etc to whatever you like, but make sure to pair them with openning and closing marks.

Finally add the two lines below to the end of your text:
</Text>
</Doc>


Now you are ready to save the text as xml format with utf-8 encoding. If you have Xaira, you can test it by using Xaira to index it.
 

armstrong

高级会员
回复: 汉语文本处理常见问题及解答Chinese text processing FAQ

Huge thanks,Dr.Hong,I made it.
 

xujiajin

管理员
Staff member
回复: 汉语文本处理常见问题及解答Chinese text processing FAQ

我做了个VBA程序,测试一下。:D
在Word中打开待处理的文本,按Alt+F11进入Visual Basic编辑器,双击左侧“工程”窗口中的"ThisDocument",将附件 vba.txt中代码粘贴到右边空白处,按F5执行,最后将处理过的文本保存或另存为。
请问,这个做什么用的?
 

armstrong

高级会员
回复: 汉语文本处理常见问题及解答Chinese text processing FAQ

给句首或者段首加序号的。
 

xujiajin

管理员
Staff member
回复: 汉语文本处理常见问题及解答Chinese text processing FAQ

嗯,有点用,挺好的。谢谢。
 

刘语料

封禁用户
回复: 汉语文本处理常见问题及解答Chinese text processing FAQ

我有如下问题想请教:
1. We are in a period of decisive historical significance.
2. <w pos="PRP">We</w><w pos="VBP" lemma="be">are</w><w pos="IN">in</w><w pos="DT">a</w><w pos="NN">period</w><w pos="IN">of</w><w pos="JJ">decisive</w><w pos="JJ">historical</w><w pos="NN">significance</w>.

上面的句子1至少要经过词性标注(POS)过程,词类归并(lemmatization)过程,然后再经过XML过程才能变成句子2.
请问有没有一系软件可以做到,有没有软件可以做到词性标注并附带词类归并的?(我用过Treetagger)但是效果不佳。

谢谢!
 
回复: 汉语文本处理常见问题及解答Chinese text processing FAQ

各位专家, 我是新手, 想请问一下, 为什么这个论坛中的很多链接地址打开都是论坛的首页, 而不是正确的网页, 还请各位告知! 不甚感激!
 

xujiajin

管理员
Staff member
回复: 汉语文本处理常见问题及解答Chinese text processing FAQ

我刚才已经将有问题的n个链接都手动改过来了。现在应该没问题了。
 

bigwind

初级会员
回复: 汉语文本处理常见问题及解答 中文 Chinese text processing FAQ

it seems ICTCLAS cannot work now. I tried it and it failed to split words. Only works in 2008? How to make it workable in the year 2009?
 

xujiajin

管理员
Staff member
回复: 汉语文本处理常见问题及解答 中文 Chinese text processing FAQ

Set your system clock to 2008 or earlier.
 

bigwind

初级会员
回复: 汉语文本处理常见问题及解答 中文 Chinese text processing FAQ

thank you, mr. xu. yes. just a clock-setting enables it work again.
 

armstrong

高级会员
回复: 汉语文本处理常见问题及解答 中文 Chinese text processing FAQ

好象Scott piao的MCLT不能将英语编码为Ansi转换成Unicode8,不知哪个软件能够将英文转换成Unicode8编码?最好是批量转换。
谢谢!
 
Last edited:

laohong

管理员
Staff member
回复: 汉语文本处理常见问题及解答 中文 Chinese text processing FAQ

好象Scott piao的MCLT不能将英语编码为Ansi转换成Unicode8,不知哪个软件能够将英文转换成Unicode8编码?最好是批量转换。
谢谢!
建议你使用这个,台湾同胞开发的一个非常好的免费的中文繁简转化工具,我用了好多年了。英文Ansi文本转换成Unicode就选择从GBK到Unicode。简单易用,研究一下吧。

ConvertZ v8.02

ConvertZ 是一個中文內碼轉換器,讓您能輕鬆地對純文字檔案或剪貼簿內容在big5/gbk/unicode/utf-8/jis/shift-jis/euc-jp各種內碼之間自由轉換,解決不同地區因為應用不同編碼而產生的溝通問題。

功能:

1. 提供繁/簡體中文及英文介面,適用於繁/簡體或其它版本的視窗。
2. 支援 Big5/GBK/HZ/Unicode/UTF-8檔案在上述各種內碼間自由轉換。
3. 可預覽轉換前/後的文章內容和結果。
4. 支援剪貼簿內碼轉換,可於程式主視窗、工具列圖示的右按選單、或用熱鍵行使此功能。
5. 部份簡體字可同時對應數個繁體字(例如:〔干、幹、乾〕〔划、劃〕〔里、裡〕〔發、髮〕〔郁、鬱〕〔松、鬆〕〔余、餘〕等字),程式能於轉碼的同時自動修正這些別字。使用者可自行編輯程式內建的『詞彙校正列表』去提高辨識率。
6. 自動更新 HTML 檔案 <Meta> 標籤內的 charset 數值。
7. 文字轉換轉送:可在文字輸入方塊內鍵入中文字,然後將轉碼結果輸出到指定程式。
8. Command line 支援。
9. 支持 CF_HTML 轉換,在 Office, IE, Outlook 等文件做剪貼簿文字轉碼時可以保留文件式樣。
10. 可以將統一碼數字記法 () 還原成目標編碼的文字。
11. 支持Mp3/APE/OGG 檔案 ID3/APE/OGG 標籤的編碼轉碼。


作業平台:
Windows 9x/ME/NT/2000/XP/2K3

介面:
運行時程式以工具列形式顯示於螢幕上方(預設為自動隱藏)。使用者亦可用滑鼠右按 Windows 工具欄內的小圖示在功能表中行使各種指令。

安裝:
無須安裝,只要將所有檔案解壓到新資料夾,然後直接執行 convertz.exe。
 

armstrong

高级会员
回复: 汉语文本处理常见问题及解答 中文 Chinese text processing FAQ

建议你使用这个,台湾同胞开发的一个非常好的免费的中文繁简转化工具,我用了好多年了。英文Ansi文本转换成Unicode就选择从GBK到Unicode。简单易用,研究一下吧。

ConvertZ v8.02

ConvertZ 是一個中文內碼轉換器,讓您能輕鬆地對純文字檔案或剪貼簿內容在big5/gbk/unicode/utf-8/jis/shift-jis/euc-jp各種內碼之間自由轉換,解決不同地區因為應用不同編碼而產生的溝通問題。

功能:

1. 提供繁/簡體中文及英文介面,適用於繁/簡體或其它版本的視窗。
2. 支援 Big5/GBK/HZ/Unicode/UTF-8檔案在上述各種內碼間自由轉換。
3. 可預覽轉換前/後的文章內容和結果。
4. 支援剪貼簿內碼轉換,可於程式主視窗、工具列圖示的右按選單、或用熱鍵行使此功能。
5. 部份簡體字可同時對應數個繁體字(例如:〔干、幹、乾〕〔划、劃〕〔里、裡〕〔發、髮〕〔郁、鬱〕〔松、鬆〕〔余、餘〕等字),程式能於轉碼的同時自動修正這些別字。使用者可自行編輯程式內建的『詞彙校正列表』去提高辨識率。
6. 自動更新 HTML 檔案 <Meta> 標籤內的 charset 數值。
7. 文字轉換轉送:可在文字輸入方塊內鍵入中文字,然後將轉碼結果輸出到指定程式。
8. Command line 支援。
9. 支持 CF_HTML 轉換,在 Office, IE, Outlook 等文件做剪貼簿文字轉碼時可以保留文件式樣。
10. 可以將統一碼數字記法 () 還原成目標編碼的文字。
11. 支持Mp3/APE/OGG 檔案 ID3/APE/OGG 標籤的編碼轉碼。


作業平台:
Windows 9x/ME/NT/2000/XP/2K3

介面:
運行時程式以工具列形式顯示於螢幕上方(預設為自動隱藏)。使用者亦可用滑鼠右按 Windows 工具欄內的小圖示在功能表中行使各種指令。

安裝:
無須安裝,只要將所有檔案解壓到新資料夾,然後直接執行 convertz.exe。

谢谢Laohong提供软件。
 
回复: 汉语文本处理常见问题及解答 中文 Chinese text processing FAQ

推荐几款中文的词频分析软件,最好是针对已分好词的语料直接统计
 
顶部