I'm working with some transcriptions and I've struggled with their normalization. Some of them have square brackets within other square brackets to specify the different noises / sound events that can be found when listening to the corresponding audio file. This is an example of one file's line:
U012_W038 [other_speech_adult: [laughter] yeah you can you can read [undefined] tomorrow] [other_speech_adult: are you recording me now] this is annoying eh [noise] [noise_bkgspeech/]
In every line the format corresponds to
<audio file reference> <transcription>
My ideal output would be:
The output should look something similar to this:
U012_W038 yeah you can you can read tomorrow are you recording me now this is annoying eh
I tried to solve this problem using sed, but I wouldn't mind trying perl or any other text processing tool. My closest attempt so far is:
sed 's/\[[^]]*]//g'
Do you think there's a way to solve this out by coding or it's to be by a manual checking?
Thanks in advance!
Based on your example, something like
perl -pe 's/\[[a-z_]+:|\[[a-z_\/]+\]|\]//g' file
This can be easily expressed in sed
too, but the regex variations differ between dialects. If you have sed -E
or sed -r
, you could probably use this regex verbatim.
This version
perl -pe 's/\[[^]:]+\]//g;s/\[[^]:]*:([^]:]+)*\]/\1/g;s/ +/ /g' file
detects non-paired brackets.
You can use this command if your file is named audio
:
sed 's/\[\([^]]*:\)//g; s/\[[^]]*]//g; s/]//g; s/ */ /g' audio
This gave me on your example :
U012_W038 yeah you can you can read tomorrow are you recording me now this is annoying eh
Step by step this command does:
sed 's/\\[\\([^]]*:\\)//g'
: deletes everything between [
and :
included
sed 's/\\[[^]]*]//g'
: deletes everything between [
and ]
included.
sed 's/]//g'
: deletes the remaining ]
sed 's/ */ /g'
: deletes all consecutive blanks.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.