How to remove tags at one go?

xujiajin · 2005-08-15

Sometimes POS-tagged data affect the concordancing, or in other cases, we need to apply another annotation scheme to the corpus; therefore, we want the texts to be clean and raw.

I tried “<*> (wildcard used)” in MS Word “Replace”, but this attempt provided blank documents finally.

Any idea?

xiaoz · 2005-08-15

Concordancers like WordSmith allow users to ignore such tags.

xujiajin · 2005-08-15

Thanks. But how can I revert the tagged corpus to raw texts?

xiaoz · 2005-08-15

Once removed, these annotation or metaadata tags cannot be recovered easily - Make sure to make a copy of the data b4 this operation.

If you decide to remove them, the method suggested by Jiajin is an easy way for those do not programme, though a few lines of codes can do the job for a a batch of files.

xusun575 · 2005-08-15

回复：How to remove tags at one stretch?

以下是引用 xujiajin 在 2005-8-15 20:26:34 的发言：
Sometimes POS-tagged data affect the concordancing, or in other cases, we need to apply another annotation scheme to the corpus; therefore, we want the texts to be clean and raw.

I tried “<*> (wildcard used)” in MS Word “Replace”, but this attempt provided blank documents finally.

Any idea?

MS Word can do a perfect job for your purpose, but there is no such a wildcard as "*" in " find /replace" function. Instead, three types of wildcard can be used:"^?" stands for any character,"^$" stands for any letter and "^#" for any number. To do the replacement, you should deal with the set with the most characters or letters, say , three characters, first, them you move onto two and one. for example, start with three quoted set, if there should be one, by typing in the box the form of " <^?^?^?>" and have it replaced by a blank space.

xujiajin · 2005-08-15

Thanks a lot for the hints.

If ? stands for any single character, then it must include any numeral.
Actually ? works for any Chinese character or any letter.

It is necessary to replace by a white space for English texts but not for Chinese ones.

This is what Microsoft Office Word Help says about the use of "wildcard"
要查找和替换的项目的通配符
如果要查找：
任意单个字符
键入 ?
例如，s?t 可查找“sat”和“set”。
任意字符串
键入 *
例如，s*d 可查找“sad”和“started”。
单词的开头
键入 <
例如，<(inter) 查找“interesting”和“intercept”，但不查找“splintered”。

单词的结尾
键入 >
例如，(in)> 查找“in”和“within”，但不查找“interesting”。
指定字符之一
键入 [ ]
例如，w[io]n 查找“win”和“won”。

指定范围内任意单个字符
键入 [-]
例如，[r-t]ight 查找“right”和“sight”。必须用升序来表示该范围。
中括号内指定字符范围以外的任意单个字符
键入 [!x-z]
例如，t[!a-m]ck 查找“tock”和“tuck”，但不查找“tack”和“tick”。

n 个重复的前一字符或表达式
键入 {n}

例如，fe{2}d 查找“feed”，但不查找“fed”。

至少 n 个前一字符或表达式
键入 {n,}

例如，fe{1,}d 查找“fed”和“feed”。

n 到 m 个前一字符或表达式
键入 {n,m}

例如，10{1,3} 查找“10”、“100”和“1000”。

一个以上的前一字符或表达式
键入 @

例如，lo@t 查找“lot”和“loot”。

注释
可使用括号对通配符和文字进行分组，以指明处理次序，例如，可以通过键入“<(pre)*(ed)>”来查找“presorted”和“prevented”。
可使用 \n 通配符搜索表达式，然后将其替换为经过重新排列的表达式，例如，在“查找内容”框键入“(Newton)(Christie)”，在“替换为”框键入“\2\1”，Word 将找到“Newton Christie”并将其替换为“Christie Newton”。

xujiajin · 2005-08-15

Cont'd
使用通配符搜索
使用通配符查找和替换

例如，可用星号 (*) 通配符搜索字符串（使用“s*d”将找到“sad”和“started”）。

单击“编辑”菜单中的“查找”或“替换”命令。
如果看不到“使用通配符”复选框，请单击“高级”按钮。
选中“使用通配符”复选框。
在“查找内容”框中输入通配符，请执行下列操作之一：
若要从列表中选择通配符，请单击“特殊字符”按钮，再单击所需通配符，然后在“查找内容”框键入要查找的其他文字。
在“查找内容”框中直接键入通配符。
如果要替换该项，请在“替换为”框键入替换内容。
单击“查找下一处”、“替换”或者“全部替换”按钮。
按 Esc 可取消正在执行的搜索。

注释

选中“使用通配符”复选框后，Word 只查找与指定文本精确匹配的文本（请注意，“区分大小写”和“全字匹配”复选框会变灰而不可用，表明这些选项已自动选中，您不能关闭这些选项）。
要查找已被定义为通配符的字符，请在该字符前键入反斜扛 (\)，例如，要查找问号，可键入“\？”。

xujiajin · 2005-08-15

Cont'd
可以在“查找内容”或“替换为”框中使用的代码
若要指定：
段落标记 ()
键入 ^p（选中“使用通配符”复选框时在“查找内容”框中无效）或键入 ^13
制表符 ()
键入 ^t 或键入 ^9
ASCII 字符
键入 ^nnn，其中 nnn 是字符代码
ANSI character
键入 ^0nnn，其中 nnn 是字符代码
长划线 ( ― )
键入 ^+
短划线 ( C )
键入 ^=
脱字号
键入 ^^
手动换行符 ()
键入 ^l 或键入 ^11
分栏符
键入 ^n 或键入 ^14
分页符或分节符
键入 ^12（替换时，插入分页符）
手动分页符
键入 ^m（当选中“使用通配符”复选框时，还将查找或替换分节符）
不间断空格 ()
键入 ^s
不间断连字符 ()
键入 ^~
可选连字符 ()
键入 ^-
只能在“查找内容”框中使用的代码（选中“使用通配符”复选框时）
图片或图形（仅嵌入）
键入 ^g
只能在“查找内容”框中使用的代码（清除“使用通配符”复选框时）
任意字符
键入 ^?
任意数字
键入 ^#
任意字
键入 ^$
Unicode 字符
键入 ^Unnnn，其中“nnnn”是字符代码
图片或图形（仅嵌入）
键入 ^1
脚注标记
键入 ^f 或键入 ^2
尾注标记
键入 ^e
域
键入 ^d
正在打开域大括号（当域代码可见时）
键入 ^19
正在关闭域大括号（当域代码可见时）
键入 ^21
批注
键入 ^a 或键入 ^5
分节符
键入 ^b
全角空格 (Unicode)
键入 ^u8195
半角空格 (Unicode)
键入 ^u8194
白色空格
键入 ^w（可以是常规空格、不间断空格以及制表符的任意组合）
只能在“替换为”框中使用的代码
“Windows 剪贴板”的内容
键入 ^c
“查找内容”框的内容
键入 ^&

xusun575 · 2005-08-16

Aha! it's great that u provided us with a combined use of both "wildcard" and " special characters". u r right that when '使用通配符" is checked, then "*" functions. but im only very comfortable with “特殊字符”when i use the "find/replace" function of ms word and have never dipped myself inside the combination use of the two. thank u for ur enlightment which will perfect my "find/replace" skill. hopeful, in ur case, "特殊字符"alone will do the job.

xujiajin · 2005-08-16

特殊字符("More" in the English version) is actually a subset of the so-called code-based search strategies.
http://www.corpus4u.org/upload/forum/2005081600563760.doc

xiaoz · 2005-08-16

It appears that there is a lot to be exploited in MS-word and other Office applications like Excel.

xujiajin · 2005-08-16

Excel is a handy tool for statistics and has much room to be explored corpuswise. dzhigner seems to be the most authorative expert at this forum.

动态语法 · 2005-08-16

回复：How to remove tags at one stretch?

Warning:
1) It's always a good idea to make a backup copy of your files!
2) You must make sure that whatever is in <> [] or ( ) are actually
tags and NOT real text.

------

If your tags are marked in this style:

<tag> my text </tag>

To get rid of the tags altogether, under MS Word, select Use Wildcards, then

Find what:

\<*\>

Replace with:

(do nothing)

(See attached screen capture.)

If your tags are marked in the styles of [] or ():

Find what: can be done with: \[*\] or $*$

Replace with:
(do nothing)

xujiajin · 2005-08-16

Timely warnings.

Another reminder (or suggestion):
Use ? instead of asterisk.
For instance, "Find word: <*>" will remove any string between the two brackets.
eg:
I have<MOD> a book, and the<DET> book is bought at the flee market<NOUN>.

"Find word: <*>" does not simply <MOD>, <DET> and <NOUN>, it removes "<MOD> a book, and the<DET> book is bought at the flee market<NOUN>.", and as a result only "I"--you yourself survives the removal.

动态语法 · 2005-08-16

回复：How to remove tags at one stretch?

以下是引用 xujiajin 在 2005-8-16 1:42:08 的发言：

"Find word: <*>" does not simply <MOD>, <DET> and <NOUN>, it removes "<MOD> a book, and the<DET> book is bought at the flee market<NOUN>.", and as a result only "I"--you yourself survives the removal.

Really? That didn't happen to me. My result was the plain text without any tags:

I have a book, and the book is bought at the flee market.
"Find word: " does not simply , and , it removes " a book, and the book is bought at the flee market.", and as a result you only get "I".

Did you checked anything else?

动态语法 · 2005-08-16

回复：How to remove tags at one stretch?

On the other hand, if I used this tag,

\<?\>

it will only find (and repalce) tags like

<x>

where one character is inside <>, which is not always desirable.

动态语法 · 2005-08-16

回复：How to remove tags at one stretch?

Sorry, I misunderstood your post. You were talking about
the pattern: <*>, not \<*\>.

At any rate it's always a good idea to
experiment with a junk file before attempting anything drastic.

xujiajin · 2005-08-16

The following step-by-step removal works reliably without using wildcard (do not check Use Wildcards)
<^?> one character is inside <>

<^?^?> two characters are inside <>

<^?^?^?> three characters are inside <>

.
.
.

How to remove tags at one go?

xujiajin

管理员

xiaoz

永远的超级管理员

xujiajin

管理员

xiaoz

永远的超级管理员

xusun575

高级会员

xujiajin

管理员

xujiajin

管理员

xujiajin

管理员

xusun575

高级会员

xujiajin

管理员

xiaoz

永远的超级管理员

xujiajin

管理员

动态语法

管理员

xujiajin

管理员

动态语法

管理员

动态语法

管理员

动态语法

管理员

xujiajin

管理员