简体   繁体   中英

Regex with multiple square brackets

I'm working with some transcriptions and I've struggled with their normalization. Some of them have square brackets within other square brackets to specify the different noises / sound events that can be found when listening to the corresponding audio file. This is an example of one file's line:

U012_W038 [other_speech_adult: [laughter] yeah you can you can read [undefined] tomorrow] [other_speech_adult: are you recording me now] this is annoying eh [noise] [noise_bkgspeech/]

In every line the format corresponds to

<audio file reference> <transcription>

My ideal output would be:

  1. Get the text that is not enclosed by any square brackets: eg: "this is annoying eh"
  2. Extract the text that's inside of square brackets only if ":" is found. The text to catch would be the one after the colon. eg: yeah you can you can read

The output should look something similar to this:

U012_W038 yeah you can you can read tomorrow are you recording me now this is annoying eh

I tried to solve this problem using sed, but I wouldn't mind trying perl or any other text processing tool. My closest attempt so far is:

sed 's/\[[^]]*]//g'

Do you think there's a way to solve this out by coding or it's to be by a manual checking?

Thanks in advance!

Based on your example, something like

perl -pe 's/\[[a-z_]+:|\[[a-z_\/]+\]|\]//g' file

This can be easily expressed in sed too, but the regex variations differ between dialects. If you have sed -E or sed -r , you could probably use this regex verbatim.

This version

perl -pe 's/\[[^]:]+\]//g;s/\[[^]:]*:([^]:]+)*\]/\1/g;s/ +/ /g' file

detects non-paired brackets.

You can use this command if your file is named audio :

sed 's/\[\([^]]*:\)//g; s/\[[^]]*]//g; s/]//g; s/  */ /g' audio                           

This gave me on your example :

U012_W038 yeah you can you can read tomorrow are you recording me now this is annoying eh

Step by step this command does:

  • sed 's/\\[\\([^]]*:\\)//g' : deletes everything between [ and : included

  • sed 's/\\[[^]]*]//g' : deletes everything between [ and ] included.

  • sed 's/]//g' : deletes the remaining ]

  • sed 's/ */ /g' : deletes all consecutive blanks.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM