Design of A Subtitle Corpus (MMSC) and Its Applications


Corpus, especially parallel corpus, has become an indispensable instrument in many linguistic researches including translation studies and natural language processing studies. However, due to the limited sources of bilingual or multi-lingual materials, development of parallel corpora has lagged far behind other types of corpora.
In the meantime, with the appearance and prevalence of DVDs and Internet, films and television subtitles (captions), which are bi-lingual or multi-lingual by nature, are easier to get and the their volume, which is already huge, grows fast.
Therefore, the author makes an attempt to build a parallel corpus using the voluminous subtitles available on line or from DVDs, namely, the Mass Media Subtitle Corpus or MMSC in short. MMSC is designed to be open and extensible, with a framework that allows easy accesses as well as convenient management and maintaining. At the completion of the thesis, the MMSC contains no less than 1,500,000 words and 100,000 parallel units and is expected to receive much more texts from users and donators in due course.
The present paper centers on the creation of MMSC, after which several test studies conducted on it are introduced in an effort to discuss its possible usages in academic areas such as translation studies, translator training and English teaching, etc.
The creation of the MMSC contains several steps including overall design, subtitle selection and collection, text alignment, text annotation, concordance platform and maintenance interface design, etc. In the text alignment part, the author proposes a new aligning algorithm specially designed for subtitles, which is different from traditional algorithms that take statistical approaches.
The applications of the MMSC are exemplified at the end of the thesis through a few pilot studies.
In the end, potential usages of MMSC and one of its possible upgraded versions are suggested.
Keywords: corpus, film, television, subtitle, translation

Zhu Chong. Chengdu: UESTC, 2007
回复: Design of A Subtitle Corpus (MMSC) and Its Applications

I was excited to find someone doing similar researches when I first visited the link Laohong posted. I had in mind a potential upgraded version of the MMSC (Mass Media Subtitle Corpus), namely, the Multi-Media Subtitle Corpus when I started my thesis last year. During the creation of the MMSC, time-codes (or time-cues) are kept intact with each line. Thus the corpus could be upgraded to a multi-media corpus when media files and subtitles are linked by these time-cues, through a working algorithm much the same with the one shown by Dr. Sato.
Thank you, Laohong.
回复: Design of A Subtitle Corpus (MMSC) and Its Applications

Congratulations to your excellent work! It seems that you are using Tomcat. Have you tried to link the search results to the video clips? Any idea of releasing it to the friends here as beta testers?
回复: Design of A Subtitle Corpus (MMSC) and Its Applications

Yes. I am using Java/Tomcat.
At the completion of my thesis, I have not tried to link the search results with video files. The programming is relatively easy, whereas the making of video files from DVDs will be much nastier. Besides, the concordancer is quite shabby and slow. Much improvement work needs to be done.
I have tried to upload excerpts of my thesis but failed. The net has become so slow recently in UESTC that I am even having much trouble in accessing corpus4u.
回复: Design of A Subtitle Corpus (MMSC) and Its Applications

test a pic link, which is a sample xml file in the MMSC.
回复: Design of A Subtitle Corpus (MMSC) and Its Applications

Did you want to post this image?

Last edited:
回复: Design of A Subtitle Corpus (MMSC) and Its Applications

If I could, I would like to upload a test version of the MMSC up here.
回复: Design of A Subtitle Corpus (MMSC) and Its Applications

wow, very promising:)
awaiting your further reports...
回复: Design of A Subtitle Corpus (MMSC) and Its Applications

i really interested in your research, can e-mail me your paper 影视字幕平行语料库研制的重要性与可行性研究 to my mailbox, thanks!:)
回复: Design of A Subtitle Corpus (MMSC) and Its Applications

回复: Design of A Subtitle Corpus (MMSC) and Its Applications

lynn,我最近正在建影视语料库,字幕收集差不多了,可以合作或交流一下,qq774741570,不过不能和zhu chong老师的影视语料库相提并论,我这还是生语料库呢,可能是文本太多了,用wordsmith4.0统计wordlist有困难,看来得分分类了。
回复: Design of A Subtitle Corpus (MMSC) and Its Applications

回复: Design of A Subtitle Corpus (MMSC) and Its Applications

回复: Design of A Subtitle Corpus (MMSC) and Its Applications

现在网上有免费的影视语料库可以用吗?或者谁那有beta的版本?急需要这方面的语料库 多谢!