简体   繁体   中英

How to match numbers on one list (e.g. 2 and 3) with the approximate sum on another list (e.g. 5)?

I am trying to match some audio files to some written passages of text.

I started with a single audio file of someone reading the typed passage. Then, I split the audio files at every period of silence, with sox , and similarly split the types text such that each unique sentence is on a unique line.

The splits did not occur perfectly at every period however, but whenever the speaker paused. I need to create a list of which audio files correspond to which typed sentences, eg:

0001.wav This is a sentence.
0002.wav This is another sentence.

Note that sometimes 2 or more audio files corresponds to a single sentence, eg:

  • 0001.wav ("this is a") + 0002.wav ("sentence") = "This is a sentence."

To help with matching the texts, I've used software to count the syllables in the audio and count the syllables in the typed text.

I have two files with this data. The first, "sentences.txt", is a list of all of the sentences from the text, presented one per line, with their syllable count, eg:

5 This is a sentence.
7 This is another sentence.
8 This is yet another sentence.
9 This is still yet another sentence.

I can remove the sentence data with awk -f" " { print $1 } sentences.txt to have this syllables_in_text.txt :

5
7
8
9

The second file, syllables_in_audio.txt has a list of audio files, in the same order, with approximate syllable counts. Sometimes a little lower than the actual number in the text, because the syllable-counting software is not perfect:

0001.wav 3
0002.wav 2
0003.wav 4
0004.wav 5
0005.wav 7
0006.wav 3
0007.wav 2
0008.wav 3

How can I print a list ("output.txt") of audio files to such that the audio file filenames appear on the same line as the text sentences in "sentences.txt", eg:

0001.wav 0002.wav
0003.wav 0004.wav
0005.wav
0006.wav 0007.wav 0009.wav

Below is a table of the two files to demonstrate how the two files, if placed side-by-side line up. Files "0001.wav" and "0002.wav" are both needed to make the sentence "This is a sentence." These file names are listed on line 1 in "output.txt" while the corresponding sentence is written in text on line of of "sentences.txt":

Contents of "output.txt":    | Contents of "sentences.txt":
0001.wav 0002.wav            | 5 This is a sentence.
0003.wav 0004.wav            | 7 This is another sentence.
0005.wav                     | 8 This is yet another sentence.
0006.wav 0007.wav 0009.wav   | 9 This is still yet another sentence.

You can create an awk script as follows. Pseudocode:

BEGIN { 
        init counter=1
        read your first file (syllables_in_text.txt) with getline till the end (while...)
            store its value in firstfile[counter]
            counter++
        # when you had finished reading your first file
        init another_counter=1
        read your second file (syllables_in_audio.txt) with getline till the end (while...)
            if $2 (second col from your file) <= firstfile[another_counter]
                 store $1 like o[another_counter]=" " $1 
               else
                 another_counter++  
                 store $1 like o[another_counter]=" " $1
        finally loop over the o array after sorint it
            print its contents after removing the leading space
}

But there are another solutions as well...

Can you give the explanation on the rule how to match (2 and 3) on another list (5)?

I make the sample to get start, please correct me.

$ cat sentences.txt
5 This is a sentence.
7 This is another sentence.
8 This is yet another sentence.
9 This is still yet another sentence.

$ cat syllables_in_audio.txt
0001.wav 5
0002.wav 5
0003.wav 7
0004.wav 7
0005.wav 8
0006.wav 9
0007.wav 9
0008.wav 9

So you should be fine to run the awk command to get the output:

awk 'NR==FNR{a[$1]=$0;next}{b[$2]=b[$2]==""?$1:b[$2] FS $1}END{for (i in a) printf "%-40s|%s\n", b[i], a[i]}' sentences.txt syllables_in_audio.txt

result

0001.wav 0002.wav                       |5 This is a sentence.
0003.wav 0004.wav                       |7 This is another sentence.
0005.wav                                |8 This is yet another sentence.
0006.wav 0007.wav 0008.wav              |9 This is still yet another sentence.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM