[原创]语料库分析的几个新思路

wzli · 2006-07-21

在2006年7月北京的“语料库在外语教学和研究中的应用”研修班上，有几个新的研究思路值得关注：

1）梁茂成博士提出的对给定语篇统计word cluster词表，并利用该词表对其他文本进行批量检索，这样每个索引行都有一个对应的文件路径，把文件路径列表通过EXCEl处理读入到SPSS中进行频率统计，可得到某一word cluster在一个单篇文本中的频数，如果把这些文本中其他对应的参数输入，可做相关分析或差异检验。梁博士一个更高级的做法是，只抽取这些cluster的语法模式，如N + N + V + Adjective + N，分析的意义会更大。

2）高超在辅导学员是提出的一个思路：观察一个搜索词跨距内各个位置出现的搭配词数，会得到一个V型分布，即越靠近搜索词，搭配词的数量就越有限，而远离搜索词的位置上搭配词选择更为自由。利用这个思路，可以把这种分布做成矩阵，并以此来检验在一个搭配中，哪些词是核心词。我们以前只观察每个位置的词频以及该词语搜索词的搭配强度，而没有充分利用wordsmith中的pattern这个功能。

3）索引行有利于观察词的搭配和用法，但在提取数据时不够精致。可以对collocates词表某一列排序，并在保存文件时指定列即可。比如需要提取检索词或cluster的各种形式，并获得频数，可在collocates栏中降序排序。

4）利用powergrep某一种pattern，比如查找动词与后跟第一个名词，可用正则表达式中的*？来控制。

tiger · 2006-07-21

insightful suggestions indeed

动态语法 · 2006-07-22

回复：[原创]语料库分析的几个新思路

以下是引用 wzli 在 2006-7-21 22:58:55 的发言：
在2006年7月北京的“语料库在外语教学和研究中的应用”研修班上，有几个新的研究思路值得关注：

1）梁茂成博士提出的对给定语篇统计word cluster词表，并利用该词表对其他文本进行批量检索，这样每个索引行都有一个对应的文件路径，把文件路径列表通过EXCEl处理读入到SPSS中进行频率统计，可得到某一word cluster在一个单篇文本中的频数，如果把这些文本中其他对应的参数输入，可做相关分析或差异检验。梁博士一个更高级的做法是，只抽取这些cluster的语法模式，如N + N + V + Adjective + N，分析的意义会更大。

Interesting thought, but doesn't WST's Concordance Plot already have this function?
It gives out info about the total no. of words in the texts, total no. of hits in each text,
as well as per thousand hit rate, etc.

wzli · 2006-07-22

Sure we could obtain the frequency data and even the normalized frequency from plot. But I really feel Liang's way of doing is also quite neat, particulary the excel part.

xujiajin · 2006-07-22

Mr. 动态语法, actually 梁's method goes beyond the mere count of # of hits which shows the occurrences of a search word. But if we want to know the hits of a batch of search words in one specific text (pointing to a writer/speaker/learner), we can hardly count the #s ourselves when we have thousands of files.

In this case, we copy the file names (@ file name actually refers to a small piece of writing) to SPSS, which counts the occurrences automatically. Still we can go one step further to compare the occurrences and learners' score (i.e. their writing proficiency) via correlation and other statistical means.

xudekuan · 2006-07-22

回复：[原创]语料库分析的几个新思路

以下是引用 wzli 在 2006-7-21 22:58:55 的发言：
在2006年7月北京的“语料库在外语教学和研究中的应用”研修班上，有几个新的研究思路值得关注：

1）梁茂成博士提出的对给定语篇统计word cluster词表，并利用该词表对其他文本进行批量检索，这样每个索引行都有一个对应的文件路径，把文件路径列表通过EXCEl处理读入到SPSS中进行频率统计，可得到某一word cluster在一个单篇文本中的频数，如果把这些文本中其他对应的参数输入，可做相关分析或差异检验。梁博士一个更高级的做法是，只抽取这些cluster的语法模式，如N + N + V + Adjective + N，分析的意义会更大。

How to abstracte N + N + V + Adjective + N with Wordsmith?

动态语法 · 2006-07-22

回复：[原创]语料库分析的几个新思路

以下是引用 xujiajin 在 2006-7-22 9:51:27 的发言：
Mr. 动态语法, actually 梁's method goes beyond the mere count of # of hits which shows the occurrences of a search word. But if we want to know the hits of a batch of search words in one specific text (pointing to a writer/speaker/learner), we can hardly count the #s ourselves when we have thousands of files. In this case, we copy the file names (@ file name actually refers to a small piece of writing) to SPSS, which counts the occurrences automatically. Still we can go one step further to compare the occurrences and learners' score (i.e. their writing proficiency) via correlation and other statistical means.

Yes, that can be done with WST's Concordance Plot function - see the attached screen
capture. If we are dealing with one file and multiple search words, it's actually simpler
than multiple files and multiple search words. The latter case is the one that cannot be
easily presented in WST. And for that WST just gives the counts for each file and for
each search item.

---

Does this show what you need to know?:

动态语法 · 2006-07-22

回复：[原创]语料库分析的几个新思路

here is a plot of multiple clusters in multiple files, with #s for cluster and individual file pairing:

动态语法 · 2006-07-22

回复：[原创]语料库分析的几个新思路

What I am trying to show is that PART of the operations mentioned in the first post can
be done with WST alone. If one wants something more sophisticated one can always
go with the SPSS approach.

Re exporting to Excel, that's indeed a convenient way to get the clusters out of the
WST result. But other programs can get you the same result. For example, using MS
Word's table function we can get the clusters in a separate column and get them off of
the table. Also, another, albeit crude, way of getting the clusters off is to use the Block
Selection function that comes with many word/text processors (e.g. in MS Word - ALT +
Selection, Textpad, etc.).

xujiajin · 2006-07-22

回复：[原创]语料库分析的几个新思路

Does this show what you need to know?:

In your screen capture, it does not give us the aggregated number (15) of IF IT IS, WHEN I WAS, THAT # # and # # THE in the file sbc046.txt.

xujiajin · 2006-07-22

回复：[原创]语料库分析的几个新思路

以下是引用动态语法在 2006-7-22 14:08:04 的发言：
What I am trying to show is that PART of the operations mentioned in the first post can
be done with WST alone. If one wants something more sophisticated one can always
go with the SPSS approach.

Re exporting to Excel, that's indeed a convenient way to get the clusters out of the
WST result. But other programs can get you the same result. For example, using MS
Word's table function we can get the clusters in a separate column and get them off of
the table. Also, another, albeit crude, way of getting the clusters off is to use the Block
Selection function that comes with many word/text processors (e.g. in MS Word - ALT +
Selection, Textpad, etc.).

You are right.
But one thing has not been solved in the two images you posted upstairs.

The clusters can be copied to Word or Notepad in many ways, but it is not easy to get the frequencies of a group of clusters (say, by the end of, at the end of, in the end etc) in sixty files respectively.

In other words, we want to know how many three- and four-word clusters used in the first text (actually the first student's essay), again we want to know how many occurrences of the clusters appeared in the second student, and so on and so forth.

动态语法 · 2006-07-22

回复：[原创]语料库分析的几个新思路

以下是引用 xujiajin 在 2006-7-22 14:30:56 的发言：

Does this show what you need to know?:

Click to expand...

In your screen capture, it does not give us the agaregated number (15) of IF IT IS, WHEN I WAS, THAT # # and # # THE in the file sbc046.txt.

Okay, got it. Two comments:

1) Personally I don't believe that's a very useful measure to have. Imagine why we
want to collapse the results of Cluster A=2, Cluster B=50, Cluster C=6, etc...Maybe
sometimes it's going to be useful (e.g. when the number of types of clusters is small),
but if we have something like 200 cluster types to work with, it will blur eveything to render
it meaningful. At least the latter case is what scares me. But that's just my humble opinion.

2) The aggregate can also be obtained by exporting the whole thing to Excel and do
an addition for the 'hits' column.

xujiajin · 2006-07-22

The attached PPT might give you a better idea about the usefulness of the aggregate in the acquisitional study of learner English.
http://forum.corpus4u.org/upload/forum/2006072215291436.ppt

动态语法 · 2006-07-22

回复：[原创]语料库分析的几个新思路

Regardless, the aggregate for individual clusters is actually given by WST when it first
generates the clusters. That is, for aggregates one doesn't even need the plot function.
If you need a grand total Excel can do it for you.

All this assumes that individual cluster-file pairing is not of your concern.

动态语法 · 2006-07-22

回复：[原创]语料库分析的几个新思路

以下是引用 xujiajin 在 2006-7-22 15:28:49 的发言：
The attached PPT might give you a better idea about the usefulness of the aggregate in the acquisitional study of learner English.
http://forum.corpus4u.org/upload/forum/2006072215291436.ppt

That's definitely an interesting area to look into. At the same time I'd be more
interested to find out what specific clusters, how many
types and with what kind of frequencies are being used for comparison.
Maybe that's the thinking behind this initial exercise?

xujiajin · 2006-08-19

最近听了系列的关于定性研究（qualitative research）的讲座,认为很有必要将语料库的定量和定性研究结合起来。

清风出袖 · 2006-08-19

回复: Re: [原创]语料库分析的几个新思路

作者 xujiajin:
最近听了系列的关于定性研究（qualitative research）的讲座,认为很有必要将语料库的定量和定性研究结合起来。[/QUOTE

Could you please take the trouble to upload the PPTS of the series of the lectures you mentioned, if there are any? Thanks a lot, Dr. Xu Jiajin!

xujiajin · 2006-08-19

I don't have the slides. I guess FLTRP will put them on its web site soon. Will add a pointer here as they'll have uploaded them.

xiaoz · 2006-08-20

Absolutely! I have always advocated - and demonstrated in my own corpus research - that quantitative and qualitative analysis should be combined. And so should theory-driven and data-driven approaches.

作者 xujiajin:
最近听了系列的关于定性研究（qualitative research）的讲座,认为很有必要将语料库的定量和定性研究结合起来。

xujiajin · 2006-08-20

The canonical qualitative (linguistic) research takes as its main considerations the language user's retrospective and reflective thinking of his/her language use process instead of the language product made available to corpus researchers.

Theory-driven research is another way of qualitative thinking, a deductive way, in my view.

[原创]语料库分析的几个新思路

普通会员

高级会员

管理员

普通会员

管理员

Moderator

管理员

管理员

管理员

管理员

管理员

管理员

管理员

管理员

管理员

管理员

高级会员

管理员

永远的超级管理员

管理员