UAM CorpusTool: Text Annotation for the 21st Century...and its free

回复: UAM CorpusTool: Text Annotation for the 21st Century...and its free

UAM CorpusTool : Text Annotation for the 21st Century......and its free!

http://www.wagsoft.com/CorpusTool/

The UAM CorpusTool is a state-of-the-art environment for annotation of text corpora. So, whether you are annotating a corpus as part of a linguistic study, or building a training set for use in statistical language processing, this is the tool for you.

Download here: http://www.wagsoft.com/cgi-bin/getCorpusTool.cgi

If it doesn't output the interlinked coding result to an XML file, you are confined to the program itself.
 
回复: UAM CorpusTool: Text Annotation for the 21st Century...and its free

UAM CorpusTool is quite easy to use. However, comparatively speaking, MMAX2 is much better in handling multi-layered, inter-linked and richly annotated data.

MMAX Annotation Tool
http://www.corpus4u.org/showthread.php?t=1621

[GOOD NEWS] MMAX Annotation Tool now FREE
http://www.corpus4u.org/showthread.php?t=3079

MMAX2 works fine, but it's not clear 1) how to build up sub-categories (e.g. nominal_semantics -> human/animate/object); 2) how to export the result to an xml file.
 
回复: UAM CorpusTool: Text Annotation for the 21st Century...and its free

MMAX2 works fine, but it's not clear 1) how to build up sub-categories (e.g. nominal_semantics -> human/animate/object); 2) how to export the result to an xml file.

What makes people less confident to use MMAX is that how to prepare their annotation schemes and how to extract/query the annotation results with MMAX. If you take a while to look at its manual and try it out with your own data, you'll find how to make it a handy tool.

Question 1: how to prepare a annotation scheme of hierarchical categories/features to use with MMAX.

Answer: Simple XML codes are used in the scheme to define the node relation, for example:

<attribute id="level_5" name="coref_type" text="Is the expression an anaphor, bridging expression, or none of both?">
<value id="value_21" name = "none"/>
<value id="value_22" name = "Ident"/>
<value id="value_26" name = "anaphoric" next="anaphoric_type"/>
</attribute>

<attribute id="anaphoric_type" name="anaphoric_type" text="Sub-categories of relation type between the anaphor and its antecedent">
<value id="value_28" name = "none"/>
<value id="value_29" name = "direct"/>
<value id="value_30" name = "pronominal"/>
<value id="value_31" name = "IS-A"/>
<value id="value_32" name = "other"/>
</attribute>


This defines a tree structure of coreference types:
None
Ident
Anaphoric


Under Anaphoric, you'll have five sub-categories:
none
direct
pronominal
IS-A
other



Question 2: how to export/query the annotation result

Answer: The result can be easily converted from MMAX markables to standard XCES XML files, and can then be indexed with Xaira or other corpus tools for query. For advanced query, in-house tailor-made query package will be better. You are suggested to have a try with our SCoRE Corpus online query package to get some ideas. Here is how to get access to it:

1. Web: http://score.crpp.nie.edu.sg/score/corpora.htm
2. Click: Corpus of Classroom Interactions;
3. (if you haven't registered with us before) Online registration and get a user ID and password immediately;
4. Login Corpus of Classroom Interactions;
5. Click Continue to query;
6. Choose DISCOURSE QUERY, simple query;
7. Select English, Continue to query;
8. Select some (sub-)categories to submit your query.

A gentle reminder: this is strictly for demo purpose only!
 
回复: UAM CorpusTool: Text Annotation for the 21st Century...and its free

What makes people less confident to use MMAX is that how to prepare their annotation schemes and how to extract/query the annotation results with MMAX. If you take a while to look at its manual and try it out with your own data, you'll find how to make it a handy tool.

Question 1: how to prepare a annotation scheme of hierarchical categories/features to use with MMAX.

Answer: Simple XML codes are used in the scheme to define the node relation, for example:

<attribute id="level_5" name="coref_type" text="Is the expression an anaphor, bridging expression, or none of both?">
<value id="value_21" name = "none"/>
<value id="value_22" name = "Ident"/>
<value id="value_26" name = "anaphoric" next="anaphoric_type"/>
</attribute>

<attribute id="anaphoric_type" name="anaphoric_type" text="Sub-categories of relation type between the anaphor and its antecedent">
<value id="value_28" name = "none"/>
<value id="value_29" name = "direct"/>
<value id="value_30" name = "pronominal"/>
<value id="value_31" name = "IS-A"/>
<value id="value_32" name = "other"/>
</attribute>


This defines a tree structure of coreference types:
None
Ident
Anaphoric


Under Anaphoric, you'll have five sub-categories:
none
direct
pronominal
IS-A
other

But the Wizard doesn't provide this option: you can keep adding "levels" but they are all parallel levels rather than hierachical.

As a text file this is easy to modify, but where is this info stored?


Question 2: how to export/query the annotation result

Answer: The result can be easily converted from MMAX markables to standard XCES XML files, and can then be indexed with Xaira or other corpus tools for query.

What I need is the markables in XML format. Again this is nowhere to be found. There is no function menu to export it anywhere in the program. Seems like one needs a style sheet and export it under the command console? Post-indexing is a different matter.

For advanced query, in-house tailor-made query package will be better. You are suggested to have a try with our SCoRE Corpus online query package to get some ideas. Here is how to get access to it:

1. Web: http://score.crpp.nie.edu.sg/score/corpora.htm
2. Click: Corpus of Classroom Interactions;
3. (if you haven't registered with us before) Online registration and get a user ID and password immediately;
4. Login Corpus of Classroom Interactions;
5. Click Continue to query;
6. Choose DISCOURSE QUERY, simple query;
7. Select English, Continue to query;
8. Select some (sub-)categories to submit your query.

A gentle reminder: this is strictly for demo purpose only!

That's a very nice application, but for general purposes the user would want the basic XML file of the coding results as a first step. I wouldn't worry too much about the query aspect at this stage.
 
回复: UAM CorpusTool: Text Annotation for the 21st Century...and its free

Is there a manual for this program? I saw a Quick Start guide by the author, but that file stops short of saying how to make the categories heriachical and how to export the coded results in XML format. Any pointers would be appreciated.

(PS: Since with UAM CorpusTool export is not an option, I don't consider it to be that useful, even though it is much more intuitive.)
 
回复: UAM CorpusTool: Text Annotation for the 21st Century...and its free

The current MMAX's Project Wizard just sucks, you should not waste your time on it. and that's why we developed our own scheme designer instead. It doesn't matter if you don't have the scheme desinger, you can still open the scheme file in the Scheme folder with a text editor to code it manually. As my earlier example shows, simply add a "next="xxxxxx"", you can make it hierarchical (refer to "<value id="value_26" name = "anaphoric" next="anaphoric_type"/>" please).

As far as the exporting XML is concernd, MMAX has no such a feature, as there may be many layers annotated for a same basedata, and even within a same layer there are embedding annotations, which the standard XML has no way to handle (let's talk about this better in another thread, and I'll send a paper to AAACL coference on this soon). Basically, there is a Basedata folder, which has the wordlist file, and the Markables folder has a bundle of markable files (if you have several schemes). Each (annotated) markable file is structured as below:

<?xml version="1.0"?>
<!DOCTYPE markables SYSTEM "markables.dtd">
<markables xmlns="www.eml.org/NameSpaces/coref">
<markable id="markable_404" span="word_3112..word_3113" refer_to="empty" coref_chain="set_262" />
<markable id="markable_266" span="word_2371..word_2372" coref_type="ident" refer_to="markable_145" minimal="levels" coref_chain="set_325" />
<markable id="markable_527" span="word_3584..word_3587" coref_type="ident" refer_to="empty" minimal="upregulation" coref_chain="set_401" />
........
</markables>


The wordspans of each markable files are actually mapping to the wordlist in Basedata folder (and this is the advantage of stand-off annotation). With a converter (not bundled with MMAX, we've developed our own) we can easily get the standard XML result as below:


<?xml version="1.0" encoding="UTF-8"?>
<?xml:stylesheet type="text/xsl" href="coref-table.xsl"?>
<DOC>
<articleinfo>
<bibliomisc>MEDLINE:pMC_1064895</bibliomisc>
</articleinfo>
<s> Increased <COREF ID="1" MIN="production"> interleukin-17 production </COREF> via a phosphoinositide 3-kinase/Akt and nuclear factor κB-dependent pathway in <COREF ID="2" MIN="patients"> patients with <COREF ID="3"> rheumatoid arthritis </COREF> </COREF> </s>
<s> Inflammatory mediators have been recognized as being important in <COREF ID="4" MIN="pathogenesis"> the pathogenesis of <COREF ID="5" REF="3" TYPE="ident"> rheumatoid arthritis (RA) </COREF> </COREF> . <COREF ID="6"> Interleukin (IL)-17 </COREF> is an important regulator of immune and inflammatory responses, including the induction of <COREF ID="7" MIN="cytokines"> proinflammatory cytokines </COREF> and <COREF ID="8" MIN="resorption"> osteoclastic bone resorption </COREF> .
......
</DOC>


...That's a very nice application, but for general purposes the user would want the basic XML file of the coding results as a first step. I wouldn't worry too much about the query aspect at this stage.

For simple annotation of just a few files, surely it's too early to talk about how to query the annotation results. If there are over a dozen of files annotated, a query package is defitely needed.
 
回复: UAM CorpusTool: Text Annotation for the 21st Century...and its free

Is UAM feasible to process Chinese corpus?
 
回复: UAM CorpusTool: Text Annotation for the 21st Century...and its free

The current MMAX's Project Wizard just sucks, you should not waste your time on it. and that's why we developed our own scheme designer instead. It doesn't matter if you don't have the scheme desinger, you can still open the scheme file in the Scheme folder with a text editor to code it manually. As my earlier example shows, simply add a "next="xxxxxx"", you can make it hierarchical (refer to "<value id="value_26" name = "anaphoric" next="anaphoric_type"/>" please).

That's easy enough.

As far as the exporting XML is concernd, MMAX has no such a feature, ... With a converter (not bundled with MMAX, we've developed our own) we can easily get the standard XML result as below:

I don't understand why such a necessary final step is not provided by the programmer when you develop such a useful annotation program.
 
回复: UAM CorpusTool: Text Annotation for the 21st Century...and its free

I don't understand why such a necessary final step is not provided by the programmer when you develop such a useful annotation program.

MMAX is designed for standoff annotation, which uses a standoff pointer to link all the markables to a same basedata, and it's also designed for multiple layer or multidimensional annotation. Supposing that one wants to annotate a same text into different linguistic features at the same time, for instance,

Word level: POS, Semantic, etc.
Phrase/sentence level: sentence type, clause complex, theme, rheme, etc.
Discourse level: IRF, Topic Related Sets, Move structures, etc.

It'll be very difficult for a standard XML file to mark up all the annotated information. For a better idea about this problem and solution, it's suggested to read these two pages:

A brief reading (Please note point 2 and 4):
Multi-dimensional Markup Frequently Asked Questions
http://ilps.science.uva.nl/nlpxml2006/faq.html

A larger scale example of this problem can be found in Durusau & O'Donnell's XML Europe 2002 paper: Concurrent Markup for XML Documents
http://www.idealliance.org/papers/xmle02/dx_xmle02/papers/03-03-07/03-03-07.html

Hope this is of help.
 
回复: UAM CorpusTool: Text Annotation for the 21st Century...and its free

MMAX is designed for standoff annotation, which uses a standoff pointer to link all the markables to a same basedata, and it's also designed for multiple layer or multidimensional annotation.

Again that's fair enough. But the fact that you have to design a program to put them together for indexing and further processing shows that 1) annotating in separate tracks is never the real end of any annotation process; and 2) putting different levels together is something doable.

Anyway, thanks for the clarification and helpful information.
 
回复: UAM CorpusTool: Text Annotation for the 21st Century...and its free

Standoff annotation is relatively new, arising from the problem of standard xml, and research topics on how to index and query standoff annotated data are very hot recently.
 
回复: UAM CorpusTool: Text Annotation for the 21st Century...and its free

哪位先生有用它标注的汉语的实例,能否给大家看看?谢谢!
 
回复: UAM CorpusTool: Text Annotation for the 21st Century...and its free

如果想用mmax,貌似需要具备比较高的xml基础,如果是计算机专业的人,他们用这个标注软件,是否会比较容易些那?
 
回复: UAM CorpusTool: Text Annotation for the 21st Century...and its free

最新的2.4.2很不稳定。。。我用2.4.0可以打开的文件,用2.4.2打不开,假死状态。。。
 
回复: UAM CorpusTool: Text Annotation for the 21st Century...and its free

1.<segment features='df' state='active'>*******</segment>

这个标记部分 都有 state=active所有标记都是一样的。那这个是否多余那??
2.被标记过的文本,比如已经有xml <>标记过的文本,不能再用这个软件标记了。。。。。我的原始文本中有些<>标记,再用这个标记后,原来的<>都被替换为&lt;/p&gt ....选项中也是只能选择纯文本。。。
 
Last edited:
回复: UAM CorpusTool: Text Annotation for the 21st Century...and its free

今天研究了一下mmax(感谢laolong不懈的推荐), 发现它没有在GUI提供standoff annotation到inline annotation的转换,参照其提供的samples, 一个可能的解决方案应当是定制自己的xslt(充分利用MMAX2.jar包中的/org/eml/MMAX2/discourse/下的文件,尤其是MMAX2DiscourseLoader.class文件提供的api)。
 
Back
顶部