用Word 制作机助附码工具:不会编程也能做

wzli

普通会员
Upon Dr Xu Jiajin's suggestion, I post the following for share. Hope it might help a little.
用MS WORD制作机助附码工具
李文中

1、附码方案及码集。假定你已准备好了一套标注方案(该方案可以基于自己的理论框架设定,也可以来自对语料的先导研究),包括名称及对应的码,如[fm1]表示拼写错误。设计的码既可以很复杂,也可以很简洁,主要根据自己的需要来做。设计好的标注方案叫“码集”(tagset),设计前最好进行先导分析,设计好后做试验性附码(trial tagging),确认无误后再正式附码。另外,对每个码对应的意义及附码格式、出现的难题及解决方案、以及每次修改和讨论,都要随时书面记录下来(documentation),以备回查。最后,对附码操作阶段所有技术环节,如文件格式、标注位置、样例、正字、符号、以及文件名称等,都要做出严格规定,确保单个附码者的前后一致以及多个附码者之间的一致。对于团队工作,这一点尤其重要。如果你设计的码很多,你可能需要把这些码分成几个大类,如形态错误、动词错误、名词错误、句法错误等。制作机助赋码工具的目的,是避免手工输入时出错,再就是减少记忆负担和查询。比如你设计了50种码,而每一个码都有其对应的意义,记忆起来就非常麻烦。
要达到的目的就是,输入标注时最好不去记忆那些对应码,只明白对应的意义就行;也不要每次重复手工输入每个码。如输入拼写错误标记,不必记[fm1]这个码,而只要知道有“拼写错误”,让电脑自动插入。
2、操作步骤。现在打开MS WORD, 单击菜单→视图→自动图文集,选择它,这时word上应该增加自动图文集工具条。
3、键入要插入的码,把它高亮化,单击自动图文集→新建→键入该码的名称,最好使用汉语,且意义明显的名称。比如使用“拼写错误”就比“fm1”这个名称容易看懂。为了查找方便,最好把自动图文集中原有的所有内容全部删除。
4、重复步骤3,直到把所有的编码都做成自动图文集。
5、单击菜单工具→自定义→工具栏→新建,这样一个新的工具栏出现在视窗,可命名为“我的标记栏”。
6、单击命令→新菜单→在右窗口把新菜单拖到新建的工具栏上,双击新菜单,在名称一栏输入你的分类名称,如形态错误,该名称不对应任何码,仅起分类作用。
7、重复步骤6,直到把所有的分类都做成空菜单,放在新建的工具条上。
8、单击命令→自动图文集→在右边窗口找你做好的标记名称,找到后按类别拖到相应的菜单里。
9、重复步骤8,直到所有的标记码都被放在菜单中。
10、关闭自定义,这时的菜单已经可以使用了。不用这个工具栏时,可在视图→工具栏中去掉该栏选项。

附码工具栏做好后,会被自动保存到模板文件normal.dot中。需要与他人共享该工具时,只需把该模板文件拷贝到word文件夹相应位置,替代原来的模板文件即可。
 
VIPs are VIPs, not aliens.
In fact, many VIPs, not aliens, have landed on this soil. You just don't know.
Their participations have made our discussions more intersting and insigtful, but they all come to discuss with us on an equal basis. There is no need to panic or being nervous.
 
I edited your post title in order to let it display in one line on the updated posts list of the first page of corpus4u. Sorry for not telling you before hand.
 
Attached below is the word tagger toolbar used by wzli for CLEC.

2005080223403728.jpg
 
回复:用Word 制作机助附码工具:不会编程也能做

This is a great tool.

A couple of questions, though.

1) How does the final result look like?

原文:

我每天看新闻。

after tagging:

我每天看新闻。动宾结构 ???

One would imagine that some sort of hypertext style marking needs to
be used to differentiate the text from the tag, for example:

我每天看新闻。<动宾结构> or: <SYN 动宾结构 /SYN>

which can be done rather easily with a modification of the tag, from

动宾结构 --> <动宾结构> or <SYN 动宾结构 /SYN>

However, a more serious problem is the range indication: how do we
know that this tag covers 看新闻 rather than, say, 我每天看? That is,

2) How to mark up a stretch of text rather than simply insert a string of
symbols? An example of this would be:

我每天<SYN 动宾结构>看新闻</SYN 动宾结构>。

where the opening and closing brackets indicate a domain.

I wonder if Word is capable of inserting both the opening and closing sequence
upon the user highlighting a stretch of text. Other text processors, such as NoteTab
can do this rather easily. If anyone is interested I'll make a post on NoteTab.
 
回复:用Word 制作机助附码工具:不会编程也能做

以下是引用 xujiajin2005-8-2 23:38:48 的发言:
Attached below is the word tagger toolbar used by wzli for CLEC.

2005080223403728.jpg


This means you have to annotate your corpus manually?
 
回复:用Word 制作机助附码工具:不会编程也能做

以下是引用 动态语法2005-8-3 0:04:17 的发言:
This is a great tool.

A couple of questions, though.

1) How does the final result look like?

原文:

我每天看新闻。

after tagging:

我每天看新闻。动宾结构 ???

One would imagine that some sort of hypertext style marking needs to
be used to differentiate the text from the tag, for example:

我每天看新闻。<动宾结构> or: <SYN 动宾结构 /SYN>

which can be done rather easily with a modification of the tag, from

动宾结构 --> <动宾结构> or <SYN 动宾结构 /SYN>

However, a more serious problem is the range indication: how do we
know that this tag covers 看新闻 rather than, say, 我每天看? That is,

2) How to mark up a stretch of text rather than simply insert a string of
symbols? An example of this would be:

我每天<SYN 动宾结构>看新闻</SYN 动宾结构>。

where the opening and closing brackets indicate a domain.

I wonder if Word is capable of inserting both the opening and closing sequence
upon the user highlighting a stretch of text. Other text processors, such as NoteTab
can do this rather easily. If anyone is interested I'll make a post on NoteTab.


That depends on what kind of format you choose to save your corpus files. You can see the tags mixed with the texts if you save you files in plain text. The tags will be filted by concordance if you save your corpus files in XML.
 
回复:用Word 制作机助附码工具:不会编程也能做

以下是引用 oscar32005-8-3 0:17:47 的发言:
This means you have to annotate your corpus manually?

Unfortunately no machines can do this sort of thinking and understanding for humans at the moment.
 
Even saved as XML files, there is still the problem of start/end tags.
 
回复:用Word 制作机助附码工具:不会编程也能做

You mean even if you have something like 我每天看新闻。动宾结构
XML will filter the tag out?


以下是引用 oscar3 ?2005-8-3 0:26:03 的发言:

That depends on what kind of format you choose to save your corpus files. You can see the tags mixed with the texts if you save you files in plain text. The tags will be filted by concordance if you save your corpus files in XML.
 
No. Here is the XML syntax:

An XML element with start/end tags: <tag>XXX</tag>

An XML element with start/end tags and an index: <tag type="1">XXX</tag>

An empty XML element: <tag/>

Can the Autotext tool add the start and end tags to embrace XXX? If not, it appears easier to use empty XML elements.
 
Such a tool (my word-tagger) is useful for manual tagging, where only the human being knows where to insert what. And POS tagging and parsing can be done automatically using the software. Suppose you need to concentrate on certain specific features and like to dig them out from the text, you will probably have to 'seek and mark' for later concordancing and frequency counting.
If the computer does not know which segment of the text is 'dongbin jiegou', it does nothing unless you tell it where to insert the opening tag and where the closing one.
 
This kind of tool is particularly useful in tagging learner errors, as the machine is by no way able to recognise learner errors automatically (except spellings). Such tools can provide an interactive human-machine interface to speed up the semi-automatic tagging process.

As far as English is concerned, structures like the V-N constructions can be extracted quite reliably using POS tagged data.
 
回复:用Word 制作机助附码工具:不会编程也能做

以下是引用 动态语法2005-8-3 0:04:17 的发言:
This is a great tool.

A couple of questions, though.

1) How does the final result look like?

原文:

我每天看新闻。

after tagging:

我每天看新闻。动宾结构 ???

One would imagine that some sort of hypertext style marking needs to
be used to differentiate the text from the tag, for example:

我每天看新闻。<动宾结构> or: <SYN 动宾结构 /SYN>

which can be done rather easily with a modification of the tag, from

动宾结构 --> <动宾结构> or <SYN 动宾结构 /SYN>

However, a more serious problem is the range indication: how do we
know that this tag covers 看新闻 rather than, say, 我每天看? That is,

2) How to mark up a stretch of text rather than simply insert a string of
symbols? An example of this would be:

我每天<SYN 动宾结构>看新闻</SYN 动宾结构>。

where the opening and closing brackets indicate a domain.

I wonder if Word is capable of inserting both the opening and closing sequence
upon the user highlighting a stretch of text. Other text processors, such as NoteTab
can do this rather easily. If anyone is interested I'll make a post on NoteTab.

Please upload a post of Note Tab
 
wzli, would you please share your tagger with corpus4u-ers here, or in a simplified demo form/template, so as to protect your property right?[emm9]
 
回复:用Word 制作机助附码工具:不会编程也能做

I mean the concordance will filter out the tags, and you will read running text instead of text mixed with various tags.
By the way, it is really enjoyable to discuss with VIPs.



以下是引用 动态语法2005-8-3 0:29:50 的发言:
You mean even if you have something like 我每天看新闻。动宾结构
XML will filter the tag out?


以下是引用 oscar3 ?2005-8-3 0:26:03 的发言:

That depends on what kind of format you choose to save your corpus files. You can see the tags mixed with the texts if you save you files in plain text. The tags will be filted by concordance if you save your corpus files in XML.

[本贴已被 作者 于 2005年08月03日 09时13分18秒 编辑过]
 
回复:用Word 制作机助附码工具:不会编程也能做

以下是引用 xiaoz2005-8-3 5:11:42 的发言:
This kind of tool is particularly useful in tagging learner errors, as the machine is by no way able to recognise learner errors automatically (except spellings). Such tools can provide an interactive human-machine interface to speed up the semi-automatic tagging process.

Agreed. It should be useful for many purposes.

As far as English is concerned, structures like the V-N constructions can be extracted quite reliably using POS tagged data.

The V-N structure is just an example (maybe a bad one) to show the domain issue.
There are many cases that one can come up with with domain issues (some quick
ones may include things like alienable/inalienable possessions,
double modifiers, coordinated objects, etc.), but whether this should be required of
this convenience tool is another issue.
 
Back
顶部