R 语言: Fisher's Exact Test 脚本
1. Why Fisher's Exact Test?
Because Chi-squared Test is not so accurate when the expected frequency is less than 5.
2. How to use the script compute_fisher.r?
It's very easy. Just copy all the code into R and change the fist line setwd() to the directory where you put contingency_table.txt?
3. What's contingency_table.txt?
It's a file for your data. It has 5 fields seperated by tab:
(1) Item is the linguistic unit you are studying. It can be word, n-gram, grammatical strutures, etc.
(2) O11 - O22 is the observed frequency in the contingency table. O11 means first row, first column; O12 means first row, second column ...
If you are familiar with Chi-squared Test, there will be no problem understanding this format.
4. Where's the result and how to interprete it?
The result is fisher_stat.txt.
In it you will find two fields are added to the original contingency table: odds_ratio and p_value.
The smaller the p_value, the more significant the frequency distribution.
If odds_ratio > 1, the linguistic unit is overused for the first row of contingency table. Otherwise, it is underused.
5. Can this script be used for contingency tables other than 2 x 2?
NO. For multivariable contingency tables, you have to consider using Logistic Regression or Loglinear Model.
1. Why Fisher's Exact Test?
Because Chi-squared Test is not so accurate when the expected frequency is less than 5.
2. How to use the script compute_fisher.r?
It's very easy. Just copy all the code into R and change the fist line setwd() to the directory where you put contingency_table.txt?
3. What's contingency_table.txt?
It's a file for your data. It has 5 fields seperated by tab:
(1) Item is the linguistic unit you are studying. It can be word, n-gram, grammatical strutures, etc.
(2) O11 - O22 is the observed frequency in the contingency table. O11 means first row, first column; O12 means first row, second column ...
If you are familiar with Chi-squared Test, there will be no problem understanding this format.
4. Where's the result and how to interprete it?
The result is fisher_stat.txt.
In it you will find two fields are added to the original contingency table: odds_ratio and p_value.
The smaller the p_value, the more significant the frequency distribution.
If odds_ratio > 1, the linguistic unit is overused for the first row of contingency table. Otherwise, it is underused.
5. Can this script be used for contingency tables other than 2 x 2?
NO. For multivariable contingency tables, you have to consider using Logistic Regression or Loglinear Model.
附件
Last edited: