从BNC中根据age group来提取text

laohong

管理员
Staff member
作者 johnli:
Dear Sir or Madam,
I did try to use Xiara to search for those texts according to authors' age group or the speakers' age group. I opened xaira client and click the button of "XML query". Yet I got only one result of "<age>46</age>". I am really confused. I was wondering if you could give me a help on this. I want to extract the texts in BNC according to the authors' and the speakers' age group. Then I want to make a word list of them based on the different age group and make the word frequency for each group.
By the way, there is no search function in xaira's help manual.

By the way, how can I send some questions to your forum for the sake of discussion.

Thanks for you time and consideration. I look forward to your reply.

Yours sincerely,
John Lee
 
回复: 从BNC中根据age group来提取text

Why only one result of <age>46</age> ?
Before you submit your query in Xaira client, click View in the menu, then choose Preference, and check Concordance. You'll see more results displayed for your query.

Wordlist of Age Groups in BNC
For your purpose of extracting texts from BNC to generate wordlist, I would suggest you to use BNC Indexer to get the list of files first, then get wordlist with WordSmith or other corpus tools. BNC Indexer is available at: http://webdeptos.uma.es/filifa/personal/amoreno/indexer/

Discussion in the forum
You are welcome to post any questions, comments, or ideas here.
 
回复: 从BNC中根据age group来提取text

For function grouping of texts according to different settings, eg, age groups, I can't make it work. According to the website you sent to me http://www.lexically.net/downloads/version4/handling_bnc/processing_dave_lees_class_cod.htm
there are class codes for different kinds of text files. However, this is the only general information for a whole text. Agegroup information is not there in this list. Mabybe BNC xml edition is different since I don't have any BNC world edition at hand. Yet, within each file of BNC XML, there will be utterances by speakers of different age groups.


First, http://www.lexically.net/downloads/version5/HTML/index.html?corpus_corruption_overview.htm this website offers an example for searching information based on agegroup. However, BNC xml edition use different tagging letters for different agegroups. There are 6 agegroups in BNC XML edition: Ag0(under 15), Ag1(15-24), Ag2(25-34), Ag3(35-44),Ag4(45-59), Ag5(over59), X(unknow.). Moreover, speech texts are not organized according to Agegroup. For example, if we open the \F\F7\F76.xml in BNC XML text folder, we might realize this. Moreover, one text files contains stenetences by speakers of different age groups. In this situation, we can not use "Only if containing" in the tag setting by simpley inputting "Ag0" in order to extracting words or that part of the file spoken by a ceratin agegroup.

 
回复: 从BNC中根据age group来提取text

Second, the Structure of BNC XML edition make it difficult for wordsmith to extract words sploken by different agegroups. This XML file states several speakers of different ages first in the heading part. There will be a unqiue xml:id for each speaker of different age groups. Then for their utterances, they starts with <u who=“xml:id”> for each speaker as followings. In this case, how could it be possible to extract the utterances by speakers of Ag0, Ag1 etc separately.

An exmaple is quoted as follows:
- <particDesc n="C18">
- <person ageGroup="Ag4" xml:id="PS1L1" role="unspecified" sex="m" soc="AB" dialect="XHC" firstLang="EN-GBR" educ="X">
<age>46</age>

<persName>Andrew</persName>

<occupation>teacher</occupation>

</person>


- <person ageGroup="Ag0" xml:id="PS1L2" role="unspecified" sex="f" soc="DE" dialect="XLO" firstLang="EN-GBR" educ="X">
<age>14</age>

<persName>Gillian</persName>

<occupation>student</occupation>

</person>


- <person ageGroup="X" dialect="NONE" n="W0000" role="other" sex="u" soc="UU" xml:id="F76PSUNK">
<persName>Unknown speaker</persName>

</person>


- <person ageGroup="X" dialect="NONE" n="W000M" role="other" sex="u" soc="UU" xml:id="F76PSUGP">
<persName>Group of unknown speakers</persName>

</person>

<stext type="OTHERSP"> <u who="F76PSUNK">
- <s n="4">
<w c5="VHB" hw="have" pos="VERB">Have</w>

<w c5="XX0" hw="not" pos="ADV">n't</w>

<w c5="VVN" hw="get" pos="VERB">got</w>

<w c5="DT0" hw="any" pos="ADJ">any</w>

<c c5="PUN">.</c>

</s>


</u>










 
Back
顶部