简体   繁体   中英

How to extract a sentence between hyphen or asterisk from a paragraph using regex in python

import re
line="Hello world -- sam -- , How are you? what are *you* doing?"
pattern=r"(?<=\-|\*)(.*?)(?=\-\*)"
print(re.findall(pattern,line))

The output I get for it is "None". Help me and explain me - which pattern should I use, so that i would get this output:

sam
you

Are you looking for this?

 /[-]{2}\s*(.*?)[-]{2}\s*|[\*]{1}\s*(.*?)[\*]{1}\s*/gm

Capture Group1.

This is a preview https://regex101.com/r/ms1dxy/5

Details:

1st Alternative [-]{2}\s*(.*?)[-]{2}\s*

[-]{2} match character - exactly 2 times.

\s* matches any whitespace character (equal to [\r\n\t\f\v ]) between zero and unlimited times

1st Capturing Group (.*?)

.*? matches any character (except for line terminators) between zero and unlimited times

[-]{2} match character - exactly 2 times.

\s* matches any whitespace character (equal to [\r\n\t\f\v ]) between zero and unlimited times


2nd Alternative [\*]{1}\s*(.*?)[\*]{1}\s*

[\*]{1} match character * exactly 1 time.

\s* matches any whitespace character (equal to [\r\n\t\f\v ]) between zero and unlimited times

1st Capturing Group (.*?)

.*? matches any character (except for line terminators) between zero and unlimited times

[\*]{1} match character * exactly 1 time.

\s* matches any whitespace character (equal to [\r\n\t\f\v ]) between zero and unlimited times

Your question doesn't show enough understanding about the constraints of regex to get a proper answer. HOWEVER if this ( RegEx ) is new to you, that seems perfectly fine. What I am (actually) trying to say is this:

This would work:

((?:--[\w\s]+--)|(?:\*[\w\s]+\*))

And in this one, there are an arbitrary / unspecified number of spaces allowed between the token and the `delimiters' .

... But so would this RegEx work as well - and it would match a different subset of String's (including the one you have provided in your question):

((?:-- \w+ --)|(?:\*\w+\*))

This RegEx matches precisely the number of spaces your have provided in your example, but would reject other matches that you might have had in mind. That is the unclear part about the example in the question asked. Below, the tokens would not match with the above expression (None of them would):

 "How are you * doing * today?" "Do you think --Regular Expressions-- are useful to programmers?" "This particular -- #token3 -- has a non-word symbol in it"

This Regular-Expression is probably the most "all encompassing" solution, but perhaps you have no need for matching non-word containing Tokens :

((?:--[^-\n]+--)|(?:\*[^\*\n]+\*))

This regex would match any text at all as a Token - except those that contain a newline character \n , or the specified delimiters * or - . For instance, read this examples:

 "This example -- token has spaces and the $ symbol -- This does match," "This one *here-has-a-few-dashes*. which suits this regex just fine." "This example --misses-completely-- because the token contains the delimiter!"

In short there are probably dozens of variants in terms of regular-expressions for the python could that has been posted, all of which would solve the one example noted in this question. Furthermore, it might be necessary to use other post (after) reg-ex matching processing also. For example, you might need String's trim() function, or a String replace ... I, personally, cannot tell. Keep at it.

You do not consume all consecutive left and right contexts. This is wrong use of lookarounds.

Use

[-*]+\s*([^\s*-].*?)\s*[-*]+

See proof .

Explanation

--------------------------------------------------------------------------------
  [-*]+                    any character of: '-', '*' (1 or more
                           times (matching the most amount possible))
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    [^\s*-]                  any character except: whitespace (\n,
                             \r, \t, \f, and " "), '*', '-'
--------------------------------------------------------------------------------
    .*?                      any character except \n (0 or more times
                             (matching the least amount possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))

Python code :

import re
line="Hello world -- sam -- , How are you? what are *you* doing?"
pattern=r"[-*]+\s*([^\s*-].*?)\s*[-*]+"
print(re.findall(pattern,line))

Result:

['sam', 'you']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM