寻求建立藏文语料库合作伙伴

xujiajin

管理员
Staff member
#2
Interesting. 你的要求并不高。
只要你有藏文语料的电子文本,然后存成unicode就可以。不过我不知道藏文在书写形式中是不是分词连写的,即词与词中间是不是空格的?
 

xujiajin

管理员
Staff member
#4
理论上来说应该不是问题,不过Mike Scott似乎并没有把Tibetan考虑在。
不过这里有个好消息:
if you know a language with an interesting writing system you might earn a free copy
http://www.lexically.net/wordsmith/free_copy.htm
 

僧梦

初级会员
#5
手头已经有1g的藏文电子文本,不过它还存在一个转换的问题,这部分我自己可以解决。
藏文的词与词之间没有空格,分词的任务和汉文差不多,但藏文的语法结构相对严谨,规则分析部分比较好做。
请多关照。
 

Haiyang Ai

Administrator
Staff member
#6
WST4支持Unicode,基本上所有的语言都能支持,藏文应该不算例外。
分词(segmentation)可能会比较难做。
 

xiaoz

永远的超级管理员
Staff member
#7
As Ocean said, the difficult part may be tokenisation/POS tagging. For concordancing, Xaira can be used. We have tested it on similar languages in South Asia.
 

僧梦

初级会员
#8
对了,藏文电子文本还没办法存成Unicode,因为藏文的Unicode版微软尚未发布,不过听说vista里已经有了。该怎么办好呢?
 

laohong

管理员
Staff member
#9
回复:寻求建立藏文语料库合作伙伴

There are only a few people working on Tibetan languages and computing. Have a look at this guy's page:

Paul G. Hackett
http://www.columbia.edu/~ph2046/RnD/Hackett/

You can get some information at:

IATS-X Tibetan Information Technology Panel
Links to the papers and presentations given in the “Tibetan Information Technology (IT) Panel” at the Tenth International Association of Tibetan Studies Conference, Oxford, 6-12 September, 2003.

Tibetan and Computing
Links to a number of resources for Tibetan encoding and data manipulation on a variety of platforms: Macintosh, Palm Pilot, and Windows/Intel PCs.

Tibetan Bibliography
Links to a number of resources for Tibetan Bibliographic information.
 

laohong

管理员
Staff member
#10
This paper is also quite relevant:

Title: A Syntactically Annotated Corpus of Tibetan

by
Andreas Wagner, Bettina Zeisler
SFB 441, University of Tuebingen


Abstract
This paper describes the creation of a syntactically annotated Tibetan corpus. This corpus forms a part of the TUSNELDA collection of corpora and databases for linguistic research. It will ultimately comprise spoken and written Tibetan texts originating from different regions and historical epochs. These texts are annotated with several kinds of linguistic information, in particular POS tags, phrases, argument structures of verbs, clauses and sentences, as well as several kinds of discourse units and textual segments. The annotation is done in XML. The primary research interest which guides the development of the corpus is the investigation of cross-clausal references, especially the relation between empty arguments (i.e. arguments not overtly realised in a clause) and their antecedents in previous clauses. For this purpose, such references are explicitly encoded so that they can be qualitatively and quantitatively evaluated with the help of standard XML techniques such as XPath search and XSLT transformations. Apart from this primary research interest, we expect that our corpus will be useful for other projects concerning Tibetan and related languages. Like other data in TUSNELDA, it will be made accessible via a WWW query interface.

Keyword(s): corpus, XML, Tibetan, syntax, case roles

Full Paper is available here:
http://gandalf.aksis.uib.no/non/lrec2004/pdf/293.pdf
 

僧梦

初级会员
#11
谢谢楼上提供的信息,第十届国际藏学会上演示的那个当时我也看了,是关于西藏某一寺院或某一文化现象的数据库。但后者尚未接触,对我也许有用,可惜本人英文太差,在找人翻译^_^。能给我提供一些汉文的资料和工具软件吗?拜托各位高手~
 

laohong

管理员
Staff member
#12
这个网站有关于西藏和藏文方面的介绍,在“工具”那个部分里有文字处理相关的软件,不过除了其中的视频、音频转写软件QuillDriver外, 其它我没有试过。你自己看看。对不太了解西藏的朋友来说,这是个百科全书式的网站。下面是该网站自己的介绍:

雪域数码图书馆是一个利用网络技术集成各种关于雪域和喜玛拉雅区域知识的网站。该网站免费为来自全球的访问者服务。为满足不同访问者的需求通过多种语言,提供多媒体的学习教程,以及与该区域的环境,文化和历史相关的丰富的研究资料。

http://www.thdl.org/xml/show.php?xml=thdlhp.xml&lng=chi
 

僧梦

初级会员
#14
那些工具看过了,功能单一,不能满足需要,有更好的吗?请高手们指点~~呵呵,越来越贪了
 
顶部