[求助] doc2txt 转换时如何不改变口语转写符号的格式

tiger

高级会员
本人正在转写一些语音材料,由于在notepad中不能保证一些转写符号如打断、重叠等的对齐,所以只好在word中进行。问题是检索软件concordance、concapp等只能检索notepad文档,而在将转写好的word文档转换成notepad时,那些转写符号都改动了位置,不能对齐。请问如何解决?谢谢。
 
语音转写建议使用 Transcriber, a tool for segmenting, labeling and transcribing speech。
http://trans.sourceforge.net/en/presentation.php
 
回复:[求助]word/notpad转换时如何才能不改变其中口语转写符号的格式

There are different types of transcription systems. If you rely on the word processor and symbols for transcription notations there bound to be problems. If you use a markup langauge of some sort (hence plain text), this wouldn't be a problem.

That said, even if your converted text becomes "messed up" it doesn't really matter as far as searches go.
 
回复: my solution!

One way to solve your problem is to standardize those non-utterance tags and index them. Here is what I'm doing with my classroom discourse project:

An example transcript with event marks:

Trn0001 Class ##
Trn0002 RA Oh I see.
Trn0003 Class ##
Trn0004 RA Good morning class, ( ) good morning class.
Trn0005 Class *CHORUS* Good morning Miss XXX.
Trn0006 Class ##
Trn0007 Class *CHORUS* And Madam YYY.
Trn0008 Teacher Good morning, boys.
Trn0009 Class Good morning, Madam YYY.
Trn0010 Teacher Sit down.
Trn0011 Class ##
Trn0012 RA Spare chair ( ) .
Trn0013 Teacher Yes.
Trn0014 Class ##
Trn0015 Teacher Help me to push this chair.
Trn0016 Class ##
Trn0017 Teacher Why don't you, plug it in there?
Trn0018 Class ##
Trn0019 Teacher Sit down.
Trn0020 Class ##
Trn0021 Teacher Okay now, take out your English textbook.
Trn0022 Class ##
.......


Here is the event index list:
Event Symbol Explanation
ENT001 %% Background conversation that is inaudible
ENT002 ## Background noise
ENT003 *CHORUS* Choral voices
ENT004 \$ Laughter
ENT005 \$\$ Extended Laughter
ENT006 [$] Laughter Quality
ENT007 [V] Verbatim Reading
ENT008 (O) May or may not be talk
ENT009 ( ) Ungotten talk
......


In this way, the transcript for corpus analysis will become:
Trn0001 spk3 ENT002.
Trn0002 spk2 Oh I see.
Trn0003 spk3 ENT002.
Trn0004 spk2 Good morning class, ENT009 good morning class.
Trn0005 spk3 ENT003 Good morning Miss XXX.
Trn0006 spk3 ENT002.
Trn0007 spk3 ENT003 And Madam YYY.
Trn0008 spk1 Good morning, boys.
Trn0009 spk3 Good morning, Madam YYY.
Trn0010 spk1 Sit down.
Trn0011 spk3 ENT002.
Trn0012 spk2 Spare chair ENT009 .
Trn0013 spk1 Yes.
Trn0014 spk3 ENT002.
Trn0015 spk1 Help me to push this chair.
Trn0016 spk3 ENT002.
Trn0017 spk1 Why don't you, plug it in there?
Trn0018 spk3 ENT002.
Trn0019 spk1 Sit down.
Trn0020 spk3 ENT002.
Trn0021 spk1 Okay now, take out your English textbook.
Trn0022 spk3 ENT002.
........


For your easy reading, the example transcripts and event index above are actually in three columns. For the sake of conversion, you may wan to use a perl script to do the index and conversion at one go. Hope this is of help to you. Good luck!
 
回复:[求助]word/notpad转换时如何才能不改变其中口语转写符号的格式

Have you read this book?

Talking data. Transcription and coding in discourse research

* Editors: Edwards, Jane A. & Lampert, Martin D.
* Hillsdale, N.J. : Erlbaum 1993 - 325 p.
* ISBN: 0-8058-0349-1
 
Yes, it is a very good book in this area. Follow the link below, you can read the book online. It's rather old in "techies" as it's publisehd more than 13 years ago, the time when only a few people knew what a computer was.

Here is the link to the online reading (the line from "http" to "jsp", you may need copy it and paste it to your web browser to open) :

http://www.questia.com/library/book/talking-data-transcription-and-coding-in-discourse-research-by-jane-a-edwards-martin-d-lampert.jsp
 
回复:[求助] doc2txt 转换时如何不改变口语转写符号的格式

some of the pages cannot be displayed there.
 
Back
顶部