[英]Correct offset of characters with extra white spaces, newlines, etc
I am trying to implement a simple solution for offset correction due to 'junk' characters, for example:我正在尝试实施一个简单的解决方案来纠正由于“垃圾”字符引起的偏移量,例如:
string_1 = "London is the capital of the UK"
-> chars location: ("capital", 14,21)
and ("UK", 29, 31)
-> 字符位置:(
("capital", 14,21)
和("UK", 29, 31)
however, in the presence of newlines etc, the char location are changed:但是,在出现换行符等的情况下,字符位置会发生变化:
string_2 = "London is the\\n\\ncapital of the\\n UK"
-> chars location ("capital" are (21,28)), ("UK" are (36,38))
-> 字符位置
("capital" are (21,28)), ("UK" are (36,38))
moreover, the string may contain any number of newlines and other artefacts as well as any number of key-words.此外,该字符串可能包含任意数量的换行符和其他人工制品以及任意数量的关键字。
My question is, given a text with extra characters (ASCII, newlines, etc), and some cleaning function, how to adjust the locations of certain keywords in the cleaned text?我的问题是,给定一个带有额外字符(ASCII、换行符等)和一些清理功能的文本,如何调整清理文本中某些关键字的位置?
string_2 = cleaning_txt(string_1)
-> ("capital", 14, 21) --> ("capital", 21, 28)
-> ("UK", 29, 31) --> ("UK", 36, 38)
->
("capital", 14, 21) --> ("capital", 21, 28)
-> ("UK", 29, 31) --> ("UK", 36, 38)
str.find
works fine for this purpose: str.find
为此目的工作正常:
string_1 = "London is the capital of the UK"
string_2 = "London is the\n\ncapital of the UK"
def find_pos(s, match):
pos = s.find(match)
return (pos, pos+len(match))
match = 'capital'
find_pos(string_1, match)
# (14, 21)
find_pos(string_2, match)
# (21, 28)
You can clean your string this way:您可以通过以下方式清洁字符串:
x = "London is the\n\ncapital of the UK"
x = x.replace('\n','')
while ' ' in x:
x = x.replace(' ', ' ')
print(x)
Gives:给出:
London is thecapital of the UK
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.