[求助]再问关于计算因子得分问题

stream

普通会员
Dr.Xiao.关于计算因子得分我有两个问题请教!1.我看到您写的那篇文章" Two Approaches to Genre Analysis"后面关于计算因子的平均得分问题。您在后面的note中说the mean factor score 的算法就是把载荷量的绝对值相加,而不是正负值直接相加。我想确认一下。
2.关于Biber书上提到的计算因子得分前应该把数据standardized一下,eg. private verbs和public verbs的频率分别为113,124,标准化处理以后分别为2.4和4.2, 那么factor score= 2.4+4.2.但是我发现这样算出的因子得分和factor score(spss中)对话框中的计算因子得分的方法(回归法)所得结果不一样。(原始数据不管是否standardized 结果都和Biber的结果不一样)。请您帮我指点一下,我究竟应该是采用Biber 的标准化以后直接相加呢,还是在spss中用normalized数据,然后选定factor score对话框中的回归法呢?谢谢!
 
1) I don't know why come with this impression from the note as shown below. What I mean is just the opposite: the sum should take account of the plus/minus signs.

The dimension score of a text is computed by adding together the factor score
of each feature with a positive loading and then subtracting the factor score of each
feature, if any, with a negative loading. For example, suppose for the genre of academic
prose the mean factor scores of the four features with positive weights on dimension
3 are C0.57, +0.53, +0.51, and +0.60, while those for features with negative
weights are C0.44, C0.43, and C0.51. The dimension score of dimension 3 for
academic prose would therefore be +2.45:
C0.57 + 0.53 + 0.51 + 0.60 C (C0.44) C (C0.43) C (C0.51) = 2.45.

2) If you have followed my advice in my previous postings by coping and pasting the per 1000 word frequencies in the dispersion plot in Wordsmith, you have already have the standardised frequencies in SPSS.
 
yes, i have already the normalized frequecies from the dispersion plot in Wordsmith. But the point is that the normalization is not the same as standardization."All normalized frequecies are stardardized to a mean of 0.0 and a standard deviation of 1.0 before the factor scores were computed."(Biber,1988, p95) .

As to "C0.57 + 0.53 + 0.51 + 0.60 C (C0.44) C (C0.43) C (C0.51) = 2.45."
my question is : -0.57 是载荷量为负值的语言特征,但为什么和正值放在一起呢?后面的减去负值不就是等于加上负值的绝对值吗?
关于平均因子得分:比如:在一个体裁中只有三篇文章, 它们的因子得分 分别是2.4,4.2,-1.5,那么要求这三篇文章的平均因子得分应该是:2.4+4.2-(-1.5)=8.1 还是2.4+4.2-1.5=5.1?
关于计算一个纬度的因子的分: 比如:dimension 1 共有三个语言特征privat verbs ,public verbs, that clause, 它们的标准得分是:1.2, 2.2, -3.6, 那么这个纬度的因子得分就是:1.2+2.2-(-3.6=7, 是不是这样呢?我都快算晕了。Dr.xiao, 您再帮我参考一下吧。谢谢!
 
Biber的因子得分计算方法不够准确。不应该把因子loadings小于0.35的排除。也不应该直接用标准化分数直接与因子的loadings相乘。而是与因子分数相关系数相乘。
从统计学上讲,还是spss中给定的方法比较合理。但是为了使用Biber的已有数据,还是应该按他的方法好。否则不可比。因子得分不应该减去负值。即:
“它们的因子得分 分别是2.4,4.2,-1.5,那么要求这三篇文章的平均因子得分应该是:是2.4+4.2-1.5=5.1”。
 
谢谢yinghuang 的回复,我现在知道怎样计算因子的平均得分了。我现在是借用Biber已有的模式,即7个dimention中不同的语言特征,但是没有使用他的数据,请问,我是用Biber 的方法呢还是用spss中计算因子得分的方法呢?谢谢!
 
回复:[求助]再问关于计算因子得分问题

1) Standaisation referred to by Biber in your citation is actually done when min, max, range and std dev are computed.
2) A linguistic feature can carry either a positive or negative value on a dimension. My note in that paper relates to the computation of the factor score of a text, i.e. the sum of scores for each linguistic feature. It is not the factor score of a genre, i.e. the mean of scores for all samples in that genre.
3) Half a chapter (pp. 287-307) in Unit C5 of "Corpus-based Language Studies: An advanced resource book" (Routledge, 2006) introduces how to compute factor scores using WordSmith and SPSS step by step, with text and screen shots. I would advise you to read that part.

以下是引用 stream2006-4-28 10:40:30 的发言:
yes, i have already the normalized frequecies from the dispersion plot in Wordsmith. But the point is that the normalization is not the same as standardization."All normalized frequecies are stardardized to a mean of 0.0 and a standard deviation of 1.0 before the factor scores were computed."(Biber,1988, p95) .

As to "C0.57 + 0.53 + 0.51 + 0.60 C (C0.44) C (C0.43) C (C0.51) = 2.45."
my question is : -0.57 是载荷量为负值的语言特征,但为什么和正值放在一起呢?后面的减去负值不就是等于加上负值的绝对值吗?
关于平均因子得分:比如:在一个体裁中只有三篇文章, 它们的因子得分 分别是2.4,4.2,-1.5,那么要求这三篇文章的平均因子得分应该是:2.4+4.2-(-1.5)=8.1 还是2.4+4.2-1.5=5.1?
关于计算一个纬度的因子的分: 比如:dimension 1 共有三个语言特征privat verbs ,public verbs, that clause, 它们的标准得分是:1.2, 2.2, -3.6, 那么这个纬度的因子得分就是:1.2+2.2-(-3.6=7, 是不是这样呢?我都快算晕了。Dr.xiao, 您再帮我参考一下吧。谢谢!
 
回复:[求助]再问关于计算因子得分问题

以下是引用 stream2006-4-29 16:21:17 的发言:
谢谢yinghuang 的回复,我现在知道怎样计算因子的平均得分了。我现在是借用Biber已有的模式,即7个dimention中不同的语言特征,但是没有使用他的数据,请问,我是用Biber 的方法呢还是用spss中计算因子得分的方法呢?谢谢!

如果你使用Biber的方法计算因子得分,那就按他的方法计算就行了。即只看大于0.35的loadings。我说的是绝对值大于0.35,不是实际的正负值。如果你按spss中的算法,我觉得也可以。spss中是将所有的因子得分系数(请注意:不是factor loadings)与标准化的数值相乘。但结果肯定与Biber的算法会有差异。而且,我以前在goole上偶尔看到一个博士论文摘要,大意是论文对Biber的模式重新用语料来做因子分析。他发现Biber的得出的那几个维度并不是唯一,还有其他的interpretation。好像该博士论文已经出版了。只是我现在找不到该消息了。
 
Dr.xiao. You have kindly provide the online information about the book "Corpus-based Language Studies: An advanced resource book", but how can i get the book in China? Do you know the publication information of the book in China?
 
回复:[求助]再问关于计算因子得分问题

I have sent the relevant sections to your 126 email account.
 
Dr.Xiao, i have read the relevant section you send me. you said in this section that the readers can download the program for factor computation from your companion website. i wonder what the website is, can you kindly show it to me? thanks a lot!
 
Your can download th program at http://www.ling.lancs.ac.uk/corplang/cbls/resources.asp

You will need to copy and paste the lfet panel of the Wordsmith dispersion plot window into a text file named datafile.txt

Also you will need to modify the program (using Notepad) to replace 14 with the number of files in your corpus (as indicated in wordsmith dispersion window).

And you will need to install Perl.
 
回复: [求助]再问关于计算因子得分问题

2) If you have followed my advice in my previous postings by coping and pasting the per 1000 word frequencies in the dispersion plot in Wordsmith, you have already have the standardised frequencies in SPSS.

Yeah. I have the same problem with stream. The standardized freqs are not freqs per 1000 words according to Biber's algorithm (Biber 1988: 94, note 4 at the bottom). At least, the per 1000 word freqs cannot be directly used for calculating factor scores.
 
回复: [求助]再问关于计算因子得分问题

After re-reading of the couple of pages in Biber (1988), I come up with the computational method as follows:

1. Standardized score = (raw freq - mean freq)/SD

2. The factor score of a text = standardized score 1 + standardized score 2 + standardized score 3 ... standardized score n.
(Note: 1 to n refer to the number of features in each of six or seven dimensions.)

3. The mean score of a dimension = sum of factor scores of individual texts / number of texts
 
回复: [求助]再问关于计算因子得分问题

Yes I used the same formula (see Corpus-Based Language Studies p. 303) as you cited from Biber. The normalised frequencies (per 1000 frequencies in this case) are used to compute basic statistics such as the mean and the std dev, which are used in turn to compute the factor score of a particular feature. I have the impression that I have taken a different approach from Biber's (i.e. per linguistic feature for all texts vs. per text for all features) - but I cannot remember exactly - which produced the same final results.


After re-reading of the couple of pages in Biber (1988), I come up with the computational method as follows:

1. Standardized score = (raw freq - mean freq)/SD

2. The factor score of a text = standardized score 1 + standardized score 2 + standardized score 3 ... standardized score n.
(Note: 1 to n refer to the number of features in each of six or seven dimensions.)

3. The mean score of a dimension = sum of factor scores of individual texts / number of texts
 
回复: [求助]再问关于计算因子得分问题

Biber (1988: 75) also used frequencies of linguistic features which were normalized to 1000 word basis.

Yeah. I have the same problem with stream. The standardized freqs are not freqs per 1000 words according to Biber's algorithm (Biber 1988: 94, note 4 at the bottom). At least, the per 1000 word freqs cannot be directly used for calculating factor scores.
 
回复: [求助]再问关于计算因子得分问题

Yes. Biber did, but that was not for factor score calculation (cf. p. 94, footnote 4). Factor scores are actually standard deviations one particular feature or dimension from the mean.

Normalized freqs were only for a general and rough comparison of potentially important linguistic features identified from previous literature (p. 76), which were prepared for factor analysis.

Put it simply, normalized freqs were used for dimension identification (factor analysis).

Factor scores were used for register variation study.

P. 96:
This table [Table 4.5] does not enable characterization of particular genres, but it provides an assessment of the overall distribution of particular features in English texts. Some features occur very frequently, for example, nouns with a mean of 180 per 1,000 words; other features occur very infrequently, for example, causative adverbial subordinators with a mean of 1 per 1,000 words.

Factor score computing procedures can be found on pages 94-95.
 
回复: [求助]再问关于计算因子得分问题

Yes I see what you mean, but that is exactly what I mean by the first formula given on p. 303 in Corpus-Based Language Studies.

Biber's (1988: 94) specific example

113 = (2.4 x 30.4) + 40.1

can be translated into the following more general formula:

frequency_per1000 = (factor_score x std_dev) + mean

which can be reformulated as:

factor_score = (frequency_per1000 - mean)/std_dev

This is exactly my formula given on p.303 of the CBLS book - I have used some posh symbols to express this same idea.
 
回复: [求助]再问关于计算因子得分问题

In the pages from 94-95, no straightforward reference is made that 113 is a normalized frequency, normalization is mentioned on page 75 though. I don't think factor analysis (the step prior to dimension analysis) has a direct link to factor score calculation (step two).

I am still not convinced that 113 here is a normalized freq.

To seek for a final solution, I searched the K:6 text in LOB for past tense verbs, quite surprisingly, it turned out to be 184 (which is greater than the max value 119 in Table 4.5 on page 77) out of 2002 words. In this case, the normalized freq should 92, instead of 113.

Things get a lot more complicated till this moment.

I was thinking we used different taggers, therefore we got different past tense verbs.

Well, things get stuck here.

Another reason I presume that 113 is raw freq is because in Biber (1988), all freqs, normalized and raw alike, are accurate to one decimal place, but all values in the following example equation

(113+ 124+30+14+5+3)=289 (p. 94)

uses integer numbers. This shouldn't be a coincidence.
 

附件

  • Factor score from Variations across speech and writing-Biber 1988.pdf
    237.8 KB · 浏览: 35
  • K6_TAG.txt
    19.1 KB · 浏览: 19
  • K6.txt
    10.7 KB · 浏览: 16
回复: [求助]再问关于计算因子得分问题

There is no doubt that Biber has used normalised frequencies (per 1000 words).

1) In his example 113 = (2.4 x 30.4) + 40.1, 40.1 is the mean which is based on per 1000 frequencies, and the std dev is also based on normailised frequencies. What's the point of raw frequency minus the normalised mean?

2) CLAWS has tagged past tense forms differently for different kinds of verbs: lexical verbs, have (had), do (did), and be (was and were). When I included all of these in my search in the text you provided, the total is exactly 226, meaning 113 per 1000 words.

Of course Biber did not use CLAWS to tag the text of course - it appears that his tagger is also very reliable.
 
Back
顶部