ParaConc: Concordance Software for Multilingual Parallel Corporahttp://forum.corpus4u.org/upload/forum/2006010819383527.pdf
1. Alignment
The successful searching and analysis of parallel texts
depends on the presence of aligned text segments in each
language corpus (and, of course, on the availability of
parallel corpora). The alignment, an indication of
equivalent text segments in the two languages, typically
uses the sentence unit as the basic alignment segment,
although naturally such an alignment is not one in which
each sentence of Language A is always aligned with a
sentence of Language B throughout the texts, since
occasionally a sentence in Language A may, for example,
be equivalent to two sentences in Language B, or perhaps
absent from Language B altogether. (More difficult
problems arise in cases where the translation of one
sentence in Language A is distributed over several
sentences in Language B.) The size of the aligned
segments is not set by the software, however. It would be
possible to work with paragraphs as the basic alignment
unit, but then the results of a search will be more
cumbersome because the translation of a word or phrase
will be embedded within a large amount of text, which is
especially difficult in cases in which the language is not
well-known.
The alignment utility in ParaConc is semi-automatic.
When files are loaded, the user enters information about
the format of the files either through reference to SGML
tags or via specifications of patterns. The user specifies
the form of headings and the form of paragraphs.
ParaConc uses the information to align the documents at
this level and the user can make adjustments by
merging/splitting units, as appropriate. Sentence level
alignment, if it is not indicated by SGML tags, is performed
using the Gale-Church algorithm (Gale and Church,
1. Alignment
The successful searching and analysis of parallel texts
depends on the presence of aligned text segments in each
language corpus (and, of course, on the availability of
parallel corpora). The alignment, an indication of
equivalent text segments in the two languages, typically
uses the sentence unit as the basic alignment segment,
although naturally such an alignment is not one in which
each sentence of Language A is always aligned with a
sentence of Language B throughout the texts, since
occasionally a sentence in Language A may, for example,
be equivalent to two sentences in Language B, or perhaps
absent from Language B altogether. (More difficult
problems arise in cases where the translation of one
sentence in Language A is distributed over several
sentences in Language B.) The size of the aligned
segments is not set by the software, however. It would be
possible to work with paragraphs as the basic alignment
unit, but then the results of a search will be more
cumbersome because the translation of a word or phrase
will be embedded within a large amount of text, which is
especially difficult in cases in which the language is not
well-known.
The alignment utility in ParaConc is semi-automatic.
When files are loaded, the user enters information about
the format of the files either through reference to SGML
tags or via specifications of patterns. The user specifies
the form of headings and the form of paragraphs.
ParaConc uses the information to align the documents at
this level and the user can make adjustments by
merging/splitting units, as appropriate. Sentence level
alignment, if it is not indicated by SGML tags, is performed
using the Gale-Church algorithm (Gale and Church,