简体   繁体   English

Python 正则表达式查找混入随机空白的单词

[英]Python Regex Find Word with Random White Space Mixed in

How do you write a regular expression to match a specific word in a string, when the string has white space added in random places?当字符串在随机位置添加空格时,如何编写正则表达式来匹配字符串中的特定单词?

I've got a string that has been extracted from a pdf document that has a table structure.我有一个从具有表结构的 pdf 文档中提取的字符串。 As a consequence of that structure the extracted string contains randomly inserted new lines and white spaces.由于该结构,提取的字符串包含随机插入的新行和空格。 The specific words and phrases that I'm looking for are there with characters all in the correct order, but chopped randomly with white spaces.我正在寻找的特定单词和短语都以正确的顺序排列,但是用空格随机切碎。 For example: "sta ck over flow".例如:“堆栈溢出”。

The content of the pdf document was extracted with PyPDF2 as this is the only option available on my company's python library. pdf 文档的内容是使用 PyPDF2 提取的,因为这是我公司的 python 库中唯一可用的选项。

I know that I can write a specific string match for this with a possible white space after every character, but there must be a better way of searching for it.我知道我可以为此编写一个特定的字符串匹配,每个字符后可能有一个空格,但必须有更好的搜索方法。

Here's an example of what I've been trying to do.这是我一直在尝试做的一个例子。

my_string = "find the ans weron sta ck over flow" 
# r's\s*t\s*a\s*c\s*k\s*'  # etc
my_cleaned_string = re.sub(r's\s*t\s*a\s*c\s*k\s*', '', my_string)

Any suggestions?有什么建议么?

The best you can probably do here is to just strip all whitespace and then search for the target string inside the stripped text:您可能在这里做的最好的事情就是去除所有空格,然后在去除的文本中搜索目标字符串:

my_string = "find the ans weron sta ck over flow"
my_string = re.sub(r'\s+', '', my_string)
if 'stack' in my_string:
    print("MATCH")

The reason I use "best" above is that in general you won't know if a space is an actual word boundary, or just random whitespace which has been inserted.我在上面使用“最佳”的原因是,通常您不知道空格是实际的单词边界,还是只是插入的随机空格。 So, you can really only do as good as finding your target as a substring in the stripped text.所以,你真的只能在剥离的文本中找到你的目标作为 substring 。 Note that the input text 'rust acknowledge' would now match positive for stack .请注意,输入文本'rust acknowledge'现在将匹配stack的正数。

Actually what you're doing is the best way.实际上,您正在做的事情最好的方法。 The only addition I can suggest is to dynamically construct such regexp from a word:我可以建议的唯一补充是从一个单词动态构造这样的正则表达式:

word = "stack"
regexp = r'\s*'.join(word)
my_string = "find the ans weron sta ck over flow" 
my_cleaned_string = re.sub(regexp, '', my_string)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM