From Corpora list
Some time ago I posted a query "comparable corpora and computer-aided translation" to ask about any progress of the application of comparable corpora in computer-aided translation and possible readings. Here is a late summary of the replies. I would like to thank all of the colleagues below for their contributions.
All the best
Xiaotian Guo
SOAS & New Vision Language Centre
-----------------------------------------------------
1. Gill Philip recommends an article of hers : Gill Philip (2009) Arriving at equivalence: Making a case for comparable general reference corpora in translation studies. In Allison Beeby, Patricia Rodríguez Inés & Pilar Sánchez-Gijón (eds) Corpus Use and Translating: Corpus use for learning to translate and learning corpus use to translate pp59-73. Amsterdam / Philadelphia: John Benjamins
2. Paul Rayson replies as follows:
You should have a look at the output from the ASSIST project involving Lancaster and Leeds. Papers are available from:
http://ucrel.lancs.ac.uk/projects/assist/
http://www.comp.leeds.ac.uk/ssharoff/
3. Dominic Widdows stresses the usefulness of comparable corpora, along with a paper as follows:
One paper on finding translations without parallel corpora is:
Learning Bilingual Lexicons from Monolingual Corpora Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick and Dan Klein, ACL 2008
http://www.eecs.berkeley.edu/~aria42/pubs/acl2008-unsup-bilexicon.pdf
In general I think there has been a lot of good work that uses language models for the target language built from large monolingual corpora. E.g., you can use a smaller parallel French-English corpus to translate into English, and a large English-only corpus to help "clean up" your translation to make sure your English translation is "reasonable English", as such. At least, that's my cartoon view of the general idea, I'm sure there are many experts out there who can enrich or correct this summary.
4. Nitin Madnani enriches the list of readings as follows:
You may also look at the following papers/resources on leveraging comparable data for SMT:
(a) Language and Translation Model Adaptation using Comparable Corpora Matthew Snover, Bonnie J. Dorr, and Richard Schwartz. EMNLP 2008
(b) Dragos Stefan Munteanu and Daniel Marcu. 2005. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguis- tics, 31(4):477–504.
(c) The proceedings for the workshop on Building and Using Comparable Corpora (http://comparable2009.ust.hk/). There have been two so far, I believe.
5. Yannick Versley, recommends a paper from the perspetive of computational linguistics:
This is also a bit on the computational side (rather than applied corpus linguistics), but it may be interesting: Pekar V., Mitkov R., Blagoev D., and Mulloni A. (2007). Finding Translations for Low-Frequency Words in Comparable Corpora. In Proceedings of the CONTEXT-07 Workshop on "Contextual Information in Semantic Space Models" (CoSMo-2007). Roskille, Denmark. pp.17-25. http://home.wlv.ac.uk/~in8113/papers/cosmo07_pekar_et_al.pdf
6. Stella Tagnin mentions two papers (one written in Portuguese) as follows:
British vs. American English, Brazilian vs. European Portuguese: how close or how far apart? - a corpus-driven study (Frankfurt am Main: Lodz Studies in Language 9, 2004, p. 193-208)
Stella E. O. Tagnin & Elisa Duarte Teixeira (http://www.fflch.usp.br/dlm/comet/artigos/BRITISH VS. AMERICAN ENGLISH.pdf)
A identifica??o de equivalentes tradutórios em corpora comparáveis (Anais do I Congresso Internacional da ABRAPUI: Belo Horizonte, 3 a 6 de junho de 2007)
Stella E. O. Tagnin
(http://www.fflch.usp.br/dlm/comet/Novo/Stella_Abrapui 2007_artigo.pdf)
Some time ago I posted a query "comparable corpora and computer-aided translation" to ask about any progress of the application of comparable corpora in computer-aided translation and possible readings. Here is a late summary of the replies. I would like to thank all of the colleagues below for their contributions.
All the best
Xiaotian Guo
SOAS & New Vision Language Centre
-----------------------------------------------------
1. Gill Philip recommends an article of hers : Gill Philip (2009) Arriving at equivalence: Making a case for comparable general reference corpora in translation studies. In Allison Beeby, Patricia Rodríguez Inés & Pilar Sánchez-Gijón (eds) Corpus Use and Translating: Corpus use for learning to translate and learning corpus use to translate pp59-73. Amsterdam / Philadelphia: John Benjamins
2. Paul Rayson replies as follows:
You should have a look at the output from the ASSIST project involving Lancaster and Leeds. Papers are available from:
http://ucrel.lancs.ac.uk/projects/assist/
http://www.comp.leeds.ac.uk/ssharoff/
3. Dominic Widdows stresses the usefulness of comparable corpora, along with a paper as follows:
One paper on finding translations without parallel corpora is:
Learning Bilingual Lexicons from Monolingual Corpora Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick and Dan Klein, ACL 2008
http://www.eecs.berkeley.edu/~aria42/pubs/acl2008-unsup-bilexicon.pdf
In general I think there has been a lot of good work that uses language models for the target language built from large monolingual corpora. E.g., you can use a smaller parallel French-English corpus to translate into English, and a large English-only corpus to help "clean up" your translation to make sure your English translation is "reasonable English", as such. At least, that's my cartoon view of the general idea, I'm sure there are many experts out there who can enrich or correct this summary.
4. Nitin Madnani enriches the list of readings as follows:
You may also look at the following papers/resources on leveraging comparable data for SMT:
(a) Language and Translation Model Adaptation using Comparable Corpora Matthew Snover, Bonnie J. Dorr, and Richard Schwartz. EMNLP 2008
(b) Dragos Stefan Munteanu and Daniel Marcu. 2005. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguis- tics, 31(4):477–504.
(c) The proceedings for the workshop on Building and Using Comparable Corpora (http://comparable2009.ust.hk/). There have been two so far, I believe.
5. Yannick Versley, recommends a paper from the perspetive of computational linguistics:
This is also a bit on the computational side (rather than applied corpus linguistics), but it may be interesting: Pekar V., Mitkov R., Blagoev D., and Mulloni A. (2007). Finding Translations for Low-Frequency Words in Comparable Corpora. In Proceedings of the CONTEXT-07 Workshop on "Contextual Information in Semantic Space Models" (CoSMo-2007). Roskille, Denmark. pp.17-25. http://home.wlv.ac.uk/~in8113/papers/cosmo07_pekar_et_al.pdf
6. Stella Tagnin mentions two papers (one written in Portuguese) as follows:
British vs. American English, Brazilian vs. European Portuguese: how close or how far apart? - a corpus-driven study (Frankfurt am Main: Lodz Studies in Language 9, 2004, p. 193-208)
Stella E. O. Tagnin & Elisa Duarte Teixeira (http://www.fflch.usp.br/dlm/comet/artigos/BRITISH VS. AMERICAN ENGLISH.pdf)
A identifica??o de equivalentes tradutórios em corpora comparáveis (Anais do I Congresso Internacional da ABRAPUI: Belo Horizonte, 3 a 6 de junho de 2007)
Stella E. O. Tagnin
(http://www.fflch.usp.br/dlm/comet/Novo/Stella_Abrapui 2007_artigo.pdf)