![](/img/trans.png)
[英]Using Python's xml.etree to find element start and end character offsets
[英]How to find character offsets in texts using python
我的目标是在两个对齐的文本文档中识别匹配的字符串,然后在每个文档中找到匹配字符串的起始字符的位置。
doc1=['the boy is sleeping', 'in the class', 'not at home']
doc2=['the girl is reading', 'in the class', 'a serious student']
我的尝试:
# find matching string(s) that exist in both document list:
matchstring=[x for x in doc1 if x in doc2]
Output=matchstring='in the class'
“
现在的问题是在doc1和doc2中找到匹配字符串的字符偏移量(不包括标点符号,包括空格)。
理想结果:
Position of starting character for matching string in doc1=20
Position of starting character for matching string in doc2=20
关于文字对齐有什么想法吗? 谢谢。
嘿,尝试这个:
doc1=['the boy is sleeping', 'in the class', 'not at home']
doc2=['the girl is reading', 'in the class', 'a serious student']
temp=''.join(list(set(doc1) & set(doc2)))
resultDoc1 = ''.join(doc1).find(temp)
resultDoc2 = ''.join(doc2).find(temp)
print "Position of starting character for matching string in doc1=%d" % (resultDoc1 + 1)
print "Position of starting character for matching string in doc2=%d" % (resultDoc2 + 1)
它完全符合您的期望!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.