简体   繁体   English

“配对词提取器” - 正则表达式

[英]“Pair words Extractor” - Regular Expression

I am using Regular Expression in Python to extract "And" Words. 我在Python中使用正则表达式来提取“和”字。 Meaning, words that are separated by and. 含义,由和分隔的单词。

For example 例如

  • banking and finance 银行和金融
  • profit and loss 收益与损失

Effort so far: 到目前为止的努力:

import re
read = open("sample.txt", "r")
regex = re.compile('(?:\S+\s)?\S*and\S*(?:\s\S+)?')
f=open('write.txt','w')
for line in read:
    words = regex.findall(line)
    for word in words:
        f.write(str(word)+'\n')
f.close()

This code seems to work well but finds and inside words such as commands. 这段代码似乎运行良好,但发现和内部的单词,如命令。

So I used this Regular Expression 所以我使用了这个正则表达式

regex = re.compile('a-zA-Z]+\s?\S*and\S*\s+[a-zA-Z]+')

which works well in website but returns only word and without the preceding word and succeeding word as output inside python. 哪个在网站上运行良好,但只返回单词而没有前面的单词和后续单词作为python中的输出。

My intention is to find words separated by and inside a document. 我的目的是找到文档中和文档内部分隔的单词。

Input 输入

This is a sample text to find profit and loss. It should also find banking and finance. But it should not find commands.

Current output 电流输出

  • profit and loss. 收益与损失。
  • banking and finance. 银行和金融。
  • find commands. 找到命令。

Expected out put 期待出局

  • profit and loss 收益与损失
  • banking and finance 银行和金融

You're making this more complicated than it needs to be. 你让它变得比它需要的更复杂。 Just use the following regex: 只需使用以下正则表达式:

\S+\sand\s\S+

See it in action 看到它在行动

The issue was the \\S* you added around the and . 问题是\\S*身边的加and That matches any number of non-whitespace characters around the "and", which would match words like "brandy". 它匹配“和”周围的任意数量的非空白字符,这将匹配像“白兰地”这样的单词。

You could try this: 你可以试试这个:

\w+(?=\sand\s)|(?<=\sand\s)\w+

Which is: 这是:

  • Some word ( \\w+ ) matched only where it precedes \\sand\\s with a positive lookahead assertion, OR 有些单词( \\w+ )仅匹配在\\sand\\s之前的前导断言为OR的位置
  • Some workd ( \\w+ ) matched only where it follows \\sAnd\\s with a positive look-behind assertion 有些工作( \\w+ )仅匹配\\sAnd\\s后面带有正面后置断言的位置

The positive lookbehind needs a string of fixed length so you can't do (?<=\\s+and\\s+) so this solution assumes all the spacing is single spaces. 积极的lookbehind需要一个固定长度的字符串,所以你不能这样做(?<=\\s+and\\s+)所以这个解决方案假设所有的间距都是单个空格。

Tested at regex101.com 在regex101.com上测试

在此输入图像描述

Edit 编辑

Further to the update in the question to get the something and something else as a three-word phrase you can try: 除了问题的更新以获取某些内容 其他内容作为三个单词的短语,您可以尝试:

\w+(?:\s+and\s+)\w+

Tested with this output: 使用此输出进行测试

在此输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM