简体   繁体   中英

How to match numbers on one list (e.g. 2 and 3) with the approximate sum on another list (e.g. 5)?

I am trying to match some audio files to some written passages of text.

I started with a single audio file of someone reading the typed passage. Then, I split the audio files at every period of silence, with sox , and similarly split the types text such that each unique sentence is on a unique line.

The splits did not occur perfectly at every period however, but whenever the speaker paused. I need to create a list of which audio files correspond to which typed sentences, eg:

0001.wav This is a sentence.
0002.wav This is another sentence.

Note that sometimes 2 or more audio files corresponds to a single sentence, eg:

  • 0001.wav ("this is a") + 0002.wav ("sentence") = "This is a sentence."

To help with matching the texts, I've used software to count the syllables in the audio and count the syllables in the typed text.

I have two files with this data. The first, "sentences.txt", is a list of all of the sentences from the text, presented one per line, with their syllable count, eg:

5 This is a sentence.
7 This is another sentence.
8 This is yet another sentence.
9 This is still yet another sentence.

I can remove the sentence data with awk -f" " { print $1 } sentences.txt to have this syllables_in_text.txt :


The second file, syllables_in_audio.txt has a list of audio files, in the same order, with approximate syllable counts. Sometimes a little lower than the actual number in the text, because the syllable-counting software is not perfect:

0001.wav 3
0002.wav 2
0003.wav 4
0004.wav 5
0005.wav 7
0006.wav 3
0007.wav 2
0008.wav 3

How can I print a list ("output.txt") of audio files to such that the audio file filenames appear on the same line as the text sentences in "sentences.txt", eg:

0001.wav 0002.wav
0003.wav 0004.wav
0006.wav 0007.wav 0009.wav

Below is a table of the two files to demonstrate how the two files, if placed side-by-side line up. Files "0001.wav" and "0002.wav" are both needed to make the sentence "This is a sentence." These file names are listed on line 1 in "output.txt" while the corresponding sentence is written in text on line of of "sentences.txt":

Contents of "output.txt":    | Contents of "sentences.txt":
0001.wav 0002.wav            | 5 This is a sentence.
0003.wav 0004.wav            | 7 This is another sentence.
0005.wav                     | 8 This is yet another sentence.
0006.wav 0007.wav 0009.wav   | 9 This is still yet another sentence.

You can create an awk script as follows. Pseudocode:

        init counter=1
        read your first file (syllables_in_text.txt) with getline till the end (while...)
            store its value in firstfile[counter]
        # when you had finished reading your first file
        init another_counter=1
        read your second file (syllables_in_audio.txt) with getline till the end (while...)
            if $2 (second col from your file) <= firstfile[another_counter]
                 store $1 like o[another_counter]=" " $1 
                 store $1 like o[another_counter]=" " $1
        finally loop over the o array after sorint it
            print its contents after removing the leading space

But there are another solutions as well...

Can you give the explanation on the rule how to match (2 and 3) on another list (5)?

I make the sample to get start, please correct me.

$ cat sentences.txt
5 This is a sentence.
7 This is another sentence.
8 This is yet another sentence.
9 This is still yet another sentence.

$ cat syllables_in_audio.txt
0001.wav 5
0002.wav 5
0003.wav 7
0004.wav 7
0005.wav 8
0006.wav 9
0007.wav 9
0008.wav 9

So you should be fine to run the awk command to get the output:

awk 'NR==FNR{a[$1]=$0;next}{b[$2]=b[$2]==""?$1:b[$2] FS $1}END{for (i in a) printf "%-40s|%s\n", b[i], a[i]}' sentences.txt syllables_in_audio.txt


0001.wav 0002.wav                       |5 This is a sentence.
0003.wav 0004.wav                       |7 This is another sentence.
0005.wav                                |8 This is yet another sentence.
0006.wav 0007.wav 0008.wav              |9 This is still yet another sentence.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM