请问检索LOB和BROWN语料库时怎么去掉行号和标记

frankfrank1985 · 2008-04-30

为什么检索出来的结果里有L07 0930等等字母和数字

frankfrank1985 · 2008-04-30

回复: 请问检索LOB和BROWN语料库时怎么去掉行号和标记

没有人回答吗?很急啊...高手帮帮忙

xiaoz · 2008-04-30

回复: 请问检索LOB和BROWN语料库时怎么去掉行号和标记

Be specific in your question: Are you using some online concordancer? Or are you using the corpora in your own machine and you have control over the text?

frankfrank1985 · 2008-04-30

回复: 请问检索LOB和BROWN语料库时怎么去掉行号和标记

i am using wordsmith4.0, but after concodance all the line numbers are retained such as L040 etc. How can I remove all these nuisances? Is there any other software to do the concordance of LOB and BROWN?

laohong · 2008-04-30

回复: 请问检索LOB和BROWN语料库时怎么去掉行号和标记

They are not nuisances. If you don't want them there, simply Find & Replace them to wipe them off before doing concordances with WordSmith. Here is how to Find & Replace them:

1. You need EditPlus to help you with this. You can get an evaluation version at: http://www.editplus.com.

2. Open the files (e.g. all the 15 LOB files) with EditPlus (you'd better backup your files first);

3. In the menu, click Search, Replace, type ^[a-z0-9]+[ ]+[0-9]+[ ] in Find what, and leave it empty for Replace with;

4. Check the option Regular expression and All open files, and click Replace all to get all the files ready for you to do your desired "nuisances-free" concordances.

The same applies to BROWN corpus.

Good luck!

清风出袖 · 2008-05-01

回复: 请问检索LOB和BROWN语料库时怎么去掉行号和标记

作者 laohong:
They are not nuisances. If you don't want them there, simply Find & Replace them to wipe them off before doing concordances with WordSmith. Here is how to Find & Replace them:

1. You need EditPlus to help you with this. You can get an evaluation version at: http://www.editplus.com.

2. Open the files (e.g. all the 15 LOB files) with EditPlus (you'd better backup your files first);

3. In the menu, click Search, Replace, type ^[a-z0-9]+[ ]+[0-9]+[ ] in Find what, and leave it empty for Replace with;

4. Check the option Regular expression and All open files, and click Replace all to get all the files ready for you to do your desired "nuisances-free" concordances.

The same applies to BROWN corpus.

Good luck!

thanks a lot, laohong, dr xiao! you both are the first to come to the rescue of every c-pal with detailed explanations and professional expertise! happy labor day holiday to you both and other administrators!

miaoyong@hotmail.com · 2008-05-01

回复: 请问检索LOB和BROWN语料库时怎么去掉行号和标记

maybe lob and brow are not "clear" material-- many tags have been inserted already-- so you can search on line to down load the software that are developed for lob-- these soft can search these text effectively
.
so is the case of brown

frankfrank1985 · 2008-05-02

回复: 请问检索LOB和BROWN语料库时怎么去掉行号和标记

Thank you so much. Is there software specifically designed for lob and brown?

frankfrank1985 · 2008-05-02

回复: 请问检索LOB和BROWN语料库时怎么去掉行号和标记

^[a-z0-9]+[ ]+[0-9]+[ ]好像有点问题,结果不对啊

oscar3 · 2008-05-02

回复: 请问检索LOB和BROWN语料库时怎么去掉行号和标记

作者 frankfrank1985:
^[a-z0-9]+[ ]+[0-9]+[ ]好像有点问题,结果不对啊

用什么工具试的？我核实了一下，用EditPlus 2.31没有问题。

maggie0153 · 2008-05-03

回复: 请问检索LOB和BROWN语料库时怎么去掉行号和标记

问个问题。为什么editplus不支持
[*]，<*>, *不是代表所有的任意字符吗？而且*也是正则里的东东啊？
我的意思是我想去掉所有的
[*]，<*>。如果编写公式？？？我写
[*]，<*>,，发现无结果。

laohong · 2008-05-03

回复: 请问检索LOB和BROWN语料库时怎么去掉行号和标记

作者 maggie0153:
问个问题。为什么editplus不支持
[*]，<*>, *不是代表所有的任意字符吗？而且*也是正则里的东东啊？
我的意思是我想去掉所有的
[*]，<*>。如果编写公式？？？我写
[*]，<*>,，发现无结果。

请阅读 EditPlus 的帮助文件。

maggie0153 · 2008-05-03

回复: 请问检索LOB和BROWN语料库时怎么去掉行号和标记

作者 frankfrank1985:
^[a-z0-9]+[ ]+[0-9]+[ ]好像有点问题,结果不对啊

我觉得如果你的库中有大写字母的话前面应该这样写，对不

^[a-zA-Z0-9]...........

maggie0153 · 2008-05-03

回复: 请问检索LOB和BROWN语料库时怎么去掉行号和标记

作者 laohong:
请阅读 EditPlus 的帮助文件。

laohong 在啊。我马上去阅读。但您能不能指导下如果去掉所有的[]，<>这样的东西，公式如何编写？我昨天弄了老半天也没成功。谢谢了

而且发现brown的比较好处理，因为它的码都是出现在段落首！但如果象clec那样，在段落中间有码那？？

maggie0153 · 2008-05-03

回复: 请问检索LOB和BROWN语料库时怎么去掉行号和标记

其实很多检索软件都带了过滤功能，过滤后的结果就是没有码的，那样可以重新保存在txt吗？

laohong · 2008-05-03

回复: 请问检索LOB和BROWN语料库时怎么去掉行号和标记

作者 maggie0153:
我觉得如果你的库中有大写字母的话前面应该这样写，对不

^[a-zA-Z0-9]...........

没有必要，除非你选择“Case Sensitive”。

laohong · 2008-05-03

回复: 请问检索LOB和BROWN语料库时怎么去掉行号和标记

作者 maggie0153:
laohong 在啊。我马上去阅读。但您能不能指导下如果去掉所有的[]，<>这样的东西，公式如何编写？我昨天弄了老半天也没成功。谢谢了

而且发现brown的比较好处理，因为它的码都是出现在段落首！但如果象clec那样，在段落中间有码那？？

在[a-z0-9]+[ ]+[0-9]+[ ]前加^的目的就是要找所有出现在段首的符合那个表达式的字符；如果去掉^就可以找到文本中任意地方符合的字符了。

要取掉所有 [] 标记及其内的东西，可以试一下：
如果 [] 内只有字母，没有空格：\[[a-z]+\]
如果 [] 内既有字母，又有空格：\[[a-z ]+\]
如果 [] 内既有字母，又有数字：\[[a-z0-9]+\]
……
同理，可以处理其它符号，具体内容请阅读 EditPlus 里关于正则表达式的帮助部分。

laohong · 2008-05-03

回复: 请问检索LOB和BROWN语料库时怎么去掉行号和标记

作者 maggie0153:
其实很多检索软件都带了过滤功能，过滤后的结果就是没有码的，那样可以重新保存在txt吗？

指望能有那个软件帮你解决所有问题的想法是不切实际的。自己动手学习做一些基本的文本处理的工作是必须的。

请问检索LOB和BROWN语料库时怎么去掉行号和标记

frankfrank1985

frankfrank1985

xiaoz

永远的超级管理员

frankfrank1985

laohong

管理员

清风出袖

高级会员

miaoyong@hotmail.com

frankfrank1985

frankfrank1985

oscar3

高级会员

maggie0153

laohong

管理员

maggie0153

maggie0153

maggie0153

laohong

管理员

laohong

管理员

laohong

管理员