[英]How to compare strings with not exact matching
I need to compare two output strings namely the original transcription and a transcription of a Speech-to-Text service. 我需要比较两个输出字符串,即原始转录和语音转文本服务的转录。 Often numbers are written in a numerical format or as a word eg "four" or "4". 通常,数字以数字格式或单词之类的形式书写,例如“四个”或“ 4”。 How to compare strings considering these different methods of transcribing? 考虑到这些不同的转录方法,如何比较字符串?
So far I just transformed both strings in lower case letters and split each word with a space as seperator. 到目前为止,我只是将两个字符串都转换为小写字母,并用空格分隔每个单词。
#Read the two files and store them in s1_raw and s2_raw
with open('original.txt', 'r') as f:
s1_raw = f.read()
with open('comparison.txt', 'r') as f:
s2_raw = f.read()
#Transform all letters to minuscule letter
s1 = s1_raw.lower()
s2 = s2_raw.lower()
#Split texts with space as seperator to have a list of words
s1_set = s1.split(' ')
s2_set = s2.split(' ')
#Used later for confidence calculation
count1 = len(s1_set)
count2 = 0
x = 0
#Check which string is longer to prevent running out of indices
if len(s1_set) < len(s2_set):
#Loop through whole list and compare word by word
for x in range (0, len(s1_set)):
if s1_set[x] == s2_set[x]:
count2 += 1
x += 1
else:
#Loop through whole list and compare word by word
for x in range (0, len(s2_set)):
if s1_set[x] == s2_set[x]:
count2 += 1
x += 1
#Confidence level= correct words divided by total words
confidence = count2/count1
#Print out result
print('The confidence level of this service is {:.2f}%'.format(confidence*100))
I want to measure the accuracy of the transcription for several *.txt files and consider all the different ways of how the different Speech-to-Text services transcribe. 我想测量几个* .txt文件的转录准确性,并考虑所有不同的语音到文本服务转录方式。
You have to normalize the text before comparing it. 您必须先对文本进行标准化,然后再进行比较。 First decide if four
or 4
is your canonical form and convert all strings to that form. 首先确定four
或4
是您的规范形式,然后将所有字符串转换为该形式。
For example, if four
is the canonical form, then write code to replace 1
with one
, 213
with two hundred and thirteen
, and so on, and do the comparison with these. 例如,如果four
是标准形式,然后编写代码来替换1
与one
, 213
与two hundred and thirteen
,等等,并且不与这些进行比较。
Actually, I think that it is better to normalize to 4
rather than four
since there can be more than one way to express a number in some languages. 实际上,我认为将其标准化为4
而不是four
更好,因为在某些语言中可以有多种表达数字的方法。 By preferring 4
it is possible to normalize all equivalent transcriptions to one single form. 通过选择4
,可以将所有等效的转录标准化为一种单一形式。
Thanks @Michael Veksler. 谢谢@Michael Veksler。 I tried the NLTK library now to split the string into word lists more efficiently. 我现在尝试使用NLTK库将字符串更有效地拆分为单词列表。 Also, I tried to look for synonyms of each word and compare if the synonyms match. 另外,我尝试查找每个单词的同义词,并比较同义词是否匹配。 This still doesn't really solve the task so I wonder what else I could try. 这仍然不能真正解决任务,因此我想知道还能尝试什么。
I use those two libraries: 我使用这两个库:
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
Splitting the words is as easy as: 拆分单词很容易:
s1_set = word_tokenize(list1)
Now I try to find synonyms of the words and take the first found synonym. 现在,我尝试查找单词的同义词并采用第一个找到的同义词。 I append it to an empty list named "wl1". 我将其附加到名为“ wl1”的空列表中。 I check before, if any synonym is found, since this is not always the case. 我之前检查过是否找到任何同义词,因为并非总是如此。
for i in range(0, (len(s1_set)-1)):
#Find synonym of word in s1_set index i
t1 = wordnet.synsets(s1_set[i])
#Ensure t1 isn't empty
if t1:
wl1.append(t1[0].lemmas()[0].name())
Then I am again comparing word by word like in my first post above. 然后,我再次像上面我的第一篇文章一样逐字比较。 This method isn't a satisfying solution to my problem either. 这种方法也不是解决我的问题的令人满意的方法。 Can anyone think of a better method? 谁能想到更好的方法?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.