[转帖]Spanish corpus with integrated search function

Haiyang Ai

Administrator
From Linguistlist.org


Spanish corpus with integrated search functions
Date: 22-Jan-2005
From: Craig Schulenberg <cschulen2004aol.com>
Subject: Spanish corpus with integrated search functions


As an outgrowth of our efforts to develop a Parser/Tagger for Spanish we
have created a prototype program (Literature Assistant) which integrates a
corpus (which has been processed by our Parser) with a 'Reader' interface
and some powerful search functions. This program is entirely self-contained
and employs an extremely fast database of our own design. We have no
intentions of developing this program into a commercial product; rather, it
is a research tool which is of great assistance to us in identifying the
(many) weaknesses in our Parser, and in our Dictionary. We would
appreciate feedback on the design and features of this software approach,
and would be interested in collaborative efforts on Parser/Tagger
implementations and corpus search algorithms. The Literature Assistant
runs in a DOS window on a PC.

The corpus includes 700 works (mostly novels), and menu screens allow
selecting an author, a work, and (finally) a chapter or bookmark. The user
then sees a 'Reader' screen which shows the complete text, and allows rapid
page up/down, top-of-text, and end-of-text positioning. When a word or
phrase is highlighted (by moving the cursor), the definition is shown
(drawn from our 48000 word Dictionary). Conjugated verb forms are
referenced back to their infinitives and their definition (based on our
13,066 verb database). If a highlighted word is selected, a second screen
immediately appears which shows 'all' sentences in the corpus that use the
same word/verb. On this second screen any of the cross-referenced works
can then be 'jumped to' by selecting that particular sentence. In this case
the user is positioned in the Reader Screen for this newly selected work.
In this way all of the texts may be traversed by following these links
between the two screens.

The second screen (Sentence Screen) permits corpus searches. For example,
the query 'gustar(se *)' will find all forms of the reflexive 'se' followed
by any conjugated form of 'gustar'. All sentences (and their title and
author) are shown that meet the search criteria. A special feature
(Jot-a-Note) is provided which makes it easy to generate a textual
commentary on any item observed on any screen. This output file can then
be processed in any text editor.

It is immediately clear that our Parser/Tagger is only 90-95% accurate at
this point, and that our Dictionary is too small too do proper justice to
these kinds of texts. Nonetheless, we believe that this is an interesting
approach not only to corpus linguistics, but also to making Spanish
literature more accessible and interactive.

Linguistic Field(s): Text/Corpus Linguistics
 
Back
顶部