简体   繁体   English

如何在 python 中使用正则表达式从段落中提取连字符或星号之间的句子

[英]How to extract a sentence between hyphen or asterisk from a paragraph using regex in python

import re
line="Hello world -- sam -- , How are you? what are *you* doing?"
pattern=r"(?<=\-|\*)(.*?)(?=\-\*)"
print(re.findall(pattern,line))

The output I get for it is "None".我得到的 output 是“无”。 Help me and explain me - which pattern should I use, so that i would get this output:帮我解释一下——我应该使用哪种模式,这样我才能得到这个 output:

sam
you

Are you looking for this?你在找这个吗?

 /[-]{2}\s*(.*?)[-]{2}\s*|[\*]{1}\s*(.*?)[\*]{1}\s*/gm

Capture Group1.捕获组 1。

This is a preview https://regex101.com/r/ms1dxy/5这是预览https://regex101.com/r/ms1dxy/5

Details:细节:

1st Alternative [-]{2}\s*(.*?)[-]{2}\s*

[-]{2} match character - exactly 2 times.

\s* matches any whitespace character (equal to [\r\n\t\f\v ]) between zero and unlimited times

1st Capturing Group (.*?)

.*? matches any character (except for line terminators) between zero and unlimited times

[-]{2} match character - exactly 2 times.

\s* matches any whitespace character (equal to [\r\n\t\f\v ]) between zero and unlimited times


2nd Alternative [\*]{1}\s*(.*?)[\*]{1}\s*

[\*]{1} match character * exactly 1 time.

\s* matches any whitespace character (equal to [\r\n\t\f\v ]) between zero and unlimited times

1st Capturing Group (.*?)

.*? matches any character (except for line terminators) between zero and unlimited times

[\*]{1} match character * exactly 1 time.

\s* matches any whitespace character (equal to [\r\n\t\f\v ]) between zero and unlimited times

Your question doesn't show enough understanding about the constraints of regex to get a proper answer.您的问题对正则表达式的约束没有足够的了解,无法获得正确的答案。 HOWEVER if this ( RegEx ) is new to you, that seems perfectly fine.但是,如果这个 ( RegEx ) 对你来说是新的,那似乎很好。 What I am (actually) trying to say is this:我(实际上)想说的是:

This would work:起作用:

((?:--[\w\s]+--)|(?:\*[\w\s]+\*))

And in this one, there are an arbitrary / unspecified number of spaces allowed between the token and the `delimiters' .在这一个中, token“定界符”之间允许有任意/未指定数量的空格。

... But so would this RegEx work as well - and it would match a different subset of String's (including the one you have provided in your question): ...但是这个RegEx也可以工作 - 它会匹配String's不同子集(包括您在问题中提供的子集):

((?:-- \w+ --)|(?:\*\w+\*))

This RegEx matches precisely the number of spaces your have provided in your example, but would reject other matches that you might have had in mind.RegEx精确匹配您在示例中提供的空格数,但会拒绝您可能想到的其他匹配项。 That is the unclear part about the example in the question asked.这是所问问题中示例的不明确部分。 Below, the tokens would not match with the above expression (None of them would):下面,标记将与上面的表达式不匹配(它们都不匹配):

 "How are you * doing * today?" "Do you think --Regular Expressions-- are useful to programmers?" "This particular -- #token3 -- has a non-word symbol in it"

This Regular-Expression is probably the most "all encompassing" solution, but perhaps you have no need for matching non-word containing Tokens :这个 Regular-Expression 可能是最“包罗万象”的解决方案,但也许您不需要匹配不含单词Tokens

((?:--[^-\n]+--)|(?:\*[^\*\n]+\*))

This regex would match any text at all as a Token - except those that contain a newline character \n , or the specified delimiters * or - .此正则表达式将匹配任何文本作为令牌- 包含换行符\n或指定分隔符*-的文本除外。 For instance, read this examples:例如,阅读以下示例:

 "This example -- token has spaces and the $ symbol -- This does match," "This one *here-has-a-few-dashes*. which suits this regex just fine." "This example --misses-completely-- because the token contains the delimiter!"

In short there are probably dozens of variants in terms of regular-expressions for the python could that has been posted, all of which would solve the one example noted in this question.简而言之,就 python 的正则表达式而言,可能已经发布了数十种变体,所有这些变体都可以解决该问题中提到的一个示例。 Furthermore, it might be necessary to use other post (after) reg-ex matching processing also.此外,可能还需要使用其他后(后)reg-ex 匹配处理。 For example, you might need String's trim() function, or a String replace ... I, personally, cannot tell.例如,您可能需要 String 的trim() function 或 String replace ……我个人无法分辨。 Keep at it.坚持下去。

You do not consume all consecutive left and right contexts.您不会消耗所有连续的左右上下文。 This is wrong use of lookarounds.这是环顾四周的错误使用。

Use采用

[-*]+\s*([^\s*-].*?)\s*[-*]+

See proof .证明

Explanation解释

--------------------------------------------------------------------------------
  [-*]+                    any character of: '-', '*' (1 or more
                           times (matching the most amount possible))
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    [^\s*-]                  any character except: whitespace (\n,
                             \r, \t, \f, and " "), '*', '-'
--------------------------------------------------------------------------------
    .*?                      any character except \n (0 or more times
                             (matching the least amount possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))

Python code : Python 代码

import re
line="Hello world -- sam -- , How are you? what are *you* doing?"
pattern=r"[-*]+\s*([^\s*-].*?)\s*[-*]+"
print(re.findall(pattern,line))

Result:结果:

['sam', 'you']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM