简体   繁体   English

Python Regex句子过滤

[英]Python Regex sentence filtering

I'm trying to filter the following sentence 我正在尝试过滤以下句子

'I'm using C++ in high-tech applications!', said peter (in a confident way)

into its individual words to get 变成自己的话

I'm using C++ in high-tech applications said peter in a confident way

what I have so far is 我到目前为止所拥有的是

parsing=re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*",text)
' '.join(w for w in parsing if w not in string.punctuation)

however this produces 但这会产生

I'm using C in high-tech applications said peter in a confident way

So 'C++' incorrectly turns into 'C' because '+' is in string.punctuation. 因此,“ C ++”错误地变成了“ C”,因为“ +”在string。标点中。 Is there anyway I can modify the regex code to allow for '+''s not to be tokenized? 无论如何,我是否可以修改正则表达式代码以允许不对'+'进行标记化? Any alternative method to get the desired output would also be welcome, thanks! 任何其他获得所需输出的方法都将受到欢迎,谢谢!

Just use (\\w|\\+) instead of \\w . 只需使用(\\w|\\+)而不是\\w This will use both word characters and the plus sign. 这将同时使用文字字符和加号。

Alternatively, you could use [a-zA-Z+] or ideally [\\w+] as suggested by Kyle Strand. 另外,您可以使用[a-zA-Z+]或理想情况下使用[\\w+]如Kyle Strand所建议。

Similar to C0deH4cker's answer but slightly simpler, replace all instances of \\w with [\\w+] . 与C0deH4cker的答案类似,但稍微简单一点,用[\\w+]替换\\w所有实例。

>>> parsing=re.findall(r"[\w+]+(?:[-'][\w+]+)*|'|[-.(]+|\S[\w+]*",text)
>>> parsing
["'", "I'm", 'using', 'C++', 'in', 'high-tech', 'applications', '!', "'", ',', 'said', 'peter', '(', 'in', 'a', 'confident', 'way', ')']
>>> ' '.join(w for w in parsing if w not in string.punctuation)
"I'm using C++ in high-tech applications said peter in a confident way"

Note that your original solution splits "C++" into three distinct tokens, so even excluding + from string.punctuation wouldn't have solved your problem: 请注意,您原始的解决方案将“ C ++”分为三个不同的标记,因此即使从string.punctuation排除+也无法解决您的问题:

>>> parsing=re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*",text)
>>> parsing
["'", "I'm", 'using', 'C', '+', '+', 'in', 'high-tech', 'applications', '!', "'", ',', 'said', 'r', '(', 'in', 'a', 'confident', 'way', ')']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM