简体   繁体   中英

Using re.findall() to search text for multiple criteria [python3.8]

edited to divert attention from the type of text data I'm using and redirect attention to the actual question

Further edited to note that:

  1. The linked question on the closing notice does not in any way help me with my issue, and in fact went as far as to further confuse me with an abundance of criteria and syntax
  2. I have found a solution to the problem herein and will share it as an answer upon this question reopening so that anyone else with this same issue will have at least a starting point, if not a resolution to this issue.

Situation:

I have written a program that uses the requests module to grab all the text from a website, given that I use the exact same code for a system that does work, this piece is not an issue. I am trying to use re.findall() to grab data in the order it appears. In the system that works, the line I use is

paragraphs = re.findall(r'c1(.*?)c1', str(mytext))

where c1 stands in place of my first set of criteria I then use a few lines to get rid of what I don't need.

What I've tried:

I've attempted the following pieces of code, and none have worked. The information I've been able to find sadly doesn't address my issue. We could theorise all day as to why a guide for this is scarce, but the fact is a few hours of google got me nowhere.

First attempt:

I tried simply keeping it in-line

re.findall(r'c1(.*?)c1c2(.*?)c2', str(mytext))

where c2 stands in place of my second criteria Unfortunately this returns [] which is useless for me.

Second attempt:

I thought that maybe the way I did this was wrong, so I shuffled it around a bit

re.findall(r'c1(.*?)c1', r'c2(.*?)c2', str(mytext))

re.findall(r'c1(.*?)c1'r'c2(.*?)c2', str(mytext))

re.findall(r'c1(.*?)c1' or 'c2(.*?)c2', str(mytext))

re.findall(r'c1(.*?)c1' or r'c2(.*?)c2', str(mytext))

But in the case of the first two, same as my initial attempt. The last two got only c1(.*?)c1 , which is useful data, but it doesn't contain the c2(.*?)c2 at all, let alone in the order it appears in the text .

Third attempt:

Don't run this code this crashed my laptop with an infinite loop. I had done some research by this point and discovered the re.search() function

paragraphs = []
ticker = ''
while ticker != 'None':
    ticker = re.search(r'c1(.*?)c1', str(mytext))
    if (ticker == 'None'):
        ticker = re.search(r'c2(.*?)c2', str(mytext))
    if (ticker != 'None'):
        paragraphs.append(ticker)
print(paragraphs)

Clearly, this was a dumb idea. It tried to make the paragraphs[] have an infinite list of the first c1(.*?)c1 .

Question:

How, if at all, do I use re.findall() to create a list paragraphs that will go through the text in mytext and pick out everything that meets the criteria c1(.*?)c1 and c2(.*?)c2 and place them in the order they appear?

eg if the text is (spaces added for clarity, will not exist in file)

c2 hello c2 c1 world c1 c2 !!! c2

The program will be

#get the text
#do the re.findall() function and assign to the list paragraphs
print(paragraphs)

And will return

>>>['hello', 'world', '!!!']

As

re.findall(r'c1(. ?)c1c2(. ?)c2', str(mytext))

returns nothing because your'e passing too many arguments in here try to put OR in between you'll get your output. like

re.findall(r'c1(. ?)c2', mytext) or re.findall(r'c2(. ?)c3', mytext)

You may use

[x.group(2) for x in re.finditer(r'(c1|c2)(.*?)\1', mytext, flags=re.S)]

See the regex demo . Or, to match the shortest substrings:

[x.group(2) for x in re.finditer(r'(c1|c2)((?:(?!c1|c2).)*?)\1', mytext, flags=re.S)]

The regex matches

  • (c1|c2) - Group 1: c1 or c2
  • (.*?) - Group 2: any 0 or more chars as few as possible
  • \1 - the same value as in Group 1.

The for x in re.finditer(r'(c1|c2)(.*?)\1', mytext) iterates over all matches and x.group(2) will return Group 2 values only.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM