Design of A Subtitle Corpus (MMSC) and Its Applications

zephyr

大巫师
ABSTRACT
Corpus, especially parallel corpus, has become an indispensable instrument in many linguistic researches including translation studies and natural language processing studies. However, due to the limited sources of bilingual or multi-lingual materials, development of parallel corpora has lagged far behind other types of corpora.
In the meantime, with the appearance and prevalence of DVDs and Internet, films and television subtitles (captions), which are bi-lingual or multi-lingual by nature, are easier to get and the their volume, which is already huge, grows fast.
Therefore, the author makes an attempt to build a parallel corpus using the voluminous subtitles available on line or from DVDs, namely, the Mass Media Subtitle Corpus or MMSC in short. MMSC is designed to be open and extensible, with a framework that allows easy accesses as well as convenient management and maintaining. At the completion of the thesis, the MMSC contains no less than 1,500,000 words and 100,000 parallel units and is expected to receive much more texts from users and donators in due course.
The present paper centers on the creation of MMSC, after which several test studies conducted on it are introduced in an effort to discuss its possible usages in academic areas such as translation studies, translator training and English teaching, etc.
The creation of the MMSC contains several steps including overall design, subtitle selection and collection, text alignment, text annotation, concordance platform and maintenance interface design, etc. In the text alignment part, the author proposes a new aligning algorithm specially designed for subtitles, which is different from traditional algorithms that take statistical approaches.
The applications of the MMSC are exemplified at the end of the thesis through a few pilot studies.
In the end, potential usages of MMSC and one of its possible upgraded versions are suggested.
Keywords: corpus, film, television, subtitle, translation

Zhu Chong. Chengdu: UESTC, 2007
 
回复: Design of A Subtitle Corpus (MMSC) and Its Applications

I was excited to find someone doing similar researches when I first visited the link Laohong posted. I had in mind a potential upgraded version of the MMSC (Mass Media Subtitle Corpus), namely, the Multi-Media Subtitle Corpus when I started my thesis last year. During the creation of the MMSC, time-codes (or time-cues) are kept intact with each line. Thus the corpus could be upgraded to a multi-media corpus when media files and subtitles are linked by these time-cues, through a working algorithm much the same with the one shown by Dr. Sato.
Thank you, Laohong.
 
回复: Design of A Subtitle Corpus (MMSC) and Its Applications

Congratulations to your excellent work! It seems that you are using Tomcat. Have you tried to link the search results to the video clips? Any idea of releasing it to the friends here as beta testers?
 
回复: Design of A Subtitle Corpus (MMSC) and Its Applications

Thanks.
Yes. I am using Java/Tomcat.
At the completion of my thesis, I have not tried to link the search results with video files. The programming is relatively easy, whereas the making of video files from DVDs will be much nastier. Besides, the concordancer is quite shabby and slow. Much improvement work needs to be done.
I have tried to upload excerpts of my thesis but failed. The net has become so slow recently in UESTC that I am even having much trouble in accessing corpus4u.
 
回复: Design of A Subtitle Corpus (MMSC) and Its Applications

test a pic link, which is a sample xml file in the MMSC.
xmlsample.JPG
 
回复: Design of A Subtitle Corpus (MMSC) and Its Applications

Did you want to post this image?

xmlsample.JPG
 
Last edited:
回复: Design of A Subtitle Corpus (MMSC) and Its Applications

Exactly.
If I could, I would like to upload a test version of the MMSC up here.
 
回复: Design of A Subtitle Corpus (MMSC) and Its Applications

wow, very promising:)
awaiting your further reports...
 
回复: Design of A Subtitle Corpus (MMSC) and Its Applications

hi!
i really interested in your research, can e-mail me your paper 影视字幕平行语料库研制的重要性与可行性研究 to my mailbox toby2006toby@163.com, thanks!:)
 
回复: Design of A Subtitle Corpus (MMSC) and Its Applications

朱老师,您好。我是语言学研究生,毕业论文想写“影视语料库”方面,期待与您取得联系。
张琳
;)我的信箱lynn_personnel@126.comQQ:236178854
 
回复: Design of A Subtitle Corpus (MMSC) and Its Applications

lynn,我最近正在建影视语料库,字幕收集差不多了,可以合作或交流一下,qq774741570,不过不能和zhu chong老师的影视语料库相提并论,我这还是生语料库呢,可能是文本太多了,用wordsmith4.0统计wordlist有困难,看来得分分类了。
 
回复: Design of A Subtitle Corpus (MMSC) and Its Applications

zhuchong老师您好,我对影视语料库也很感兴趣,正在建,现在正在整理阶段,不过还没有附码,如果只是生语料库的话应该从哪些方面去研究,这些影视资料应该如何分类呢。您有相关资料吗,忘不吝赐教。
 
回复: Design of A Subtitle Corpus (MMSC) and Its Applications

我曾经在2003年建过一个100万句对的汉英电影字母语料。目前还能检索。未整理总量超过1000万句对。
 
回复: Design of A Subtitle Corpus (MMSC) and Its Applications

现在网上有免费的影视语料库可以用吗?或者谁那有beta的版本?急需要这方面的语料库 多谢!
 
Back
顶部