简体   繁体   中英

find index of subset of list containing strings

I am working on NLP in python and I have converted audio files to text and then found time offsets against each word spoken in speech, and then stored words in wordlist plus the time in timelist .

I have three lists, 1st list named strlist , 2nd named wordlist , 3rd named timelist strlist contains phrase lets say

strlist = ["in", "the", "family"]

wordlist contains paragraph or lets say sentences

wordlist = ["there", "are", "few", "things", "to", " be", "in", "the", "family", "means"]

timelist contains some time values against each word stores in wordlist lets suppose

timelist=[2,3,4,5,7,4,8,9,5,3]

I want to know if strlist 's phrase (consists of few words) is present in wordlist or not. If it is present then I want to check the time values stores in timelist against those words.

 from pathlib import Path
  import io
 from google.oauth2 import service_account

credentials = service_account.Credentials.from_service_account_file('proven- 
mystery-310205-f04fb2ab3d69.json')
str='in my family'
strlist = list(str.split(" "))
timelist=[]
wordlist=[]
strlist.append("")
for i in strlist:
  print(i)
speech_file = Path("C:/Users/Tani/PycharmProjects/pythonProject/t.wav")
print("Start")

from google.cloud import speech_v1 as speech

print("checking credentials")

client = speech.SpeechClient(credentials=credentials)

print("Checked")
with io.open(speech_file, 'rb') as audio_file:
    content = audio_file.read()

print("audio file read")

audio = speech.RecognitionAudio(content=content)

print("config start")
config = speech.RecognitionConfig(
     encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
     language_code='en-US',
     audio_channel_count=2,
     enable_separate_recognition_per_channel=True,
     enable_word_time_offsets=True)
 print("Recognizing:")
 response = client.recognize(config=config,audio=audio)

 print("Recognized")

 for result in response.results:
     alternative = result.alternatives[0]
     #print('Transcript: {}'.format(alternative.transcript))

  for word_info in alternative.words:
       word = word_info.word
       start_time = word_info.start_time
       end_time = word_info.end_time
       wordlist.append(word)
       timelist.append(start_time.seconds)
 print(str)
 for a, b in zip(wordlist,timelist):
      print('Word: {}, time: {}'.format(
      a,
      b))
 print("findout time")


 for s in strlist:
    if s in wordlist:
       position = wordlist.index(s)
       time_s = timelist[position]
       print(f"Word: '{s}', Time: {time_s}")

I have a code that does the job. It can be improved, of course:

strlist = ["in", "the", "family"]
wordlist = ["there", "are", "few", "things", "to", " be", "in", "the", "family", "means"]
timelist=[2,3,4,5,7,4,8,9,5,3]

for s in strlist:
    if s in wordlist:
        position = wordlist.index(s)
        time_s = timelist[position]
        print(f"Word: '{s}', Time: {time_s}")

And the output is:

Word: 'in', Time: 8
Word: 'the', Time: 9
Word: 'family', Time: 5

There's another code that produces the same results, but it works only if you don't have repeated words:

strlist = ["in", "the", "family"]
wordlist = ["there", "are", "few", "things", "to", " be", "in", "the", "family", "means"]
timelist=[2,3,4,5,7,4,8,9,5,3]

map = {word: time for word, time in zip(wordlist, timelist)}
for s in strlist:
    print(f"Word: '{s}', Time: {map[s]}")

Feel free to test both.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM