edited to divert attention from the type of text data I'm using and redirect attention to the actual question
Further edited to note that:
Situation:
I have written a program that uses the requests module to grab all the text from a website, given that I use the exact same code for a system that does work, this piece is not an issue. I am trying to use re.findall()
to grab data in the order it appears. In the system that works, the line I use is
paragraphs = re.findall(r'c1(.*?)c1', str(mytext))
where c1 stands in place of my first set of criteria I then use a few lines to get rid of what I don't need.
What I've tried:
I've attempted the following pieces of code, and none have worked. The information I've been able to find sadly doesn't address my issue. We could theorise all day as to why a guide for this is scarce, but the fact is a few hours of google got me nowhere.
First attempt:
I tried simply keeping it in-line
re.findall(r'c1(.*?)c1c2(.*?)c2', str(mytext))
where c2 stands in place of my second criteria Unfortunately this returns []
which is useless for me.
Second attempt:
I thought that maybe the way I did this was wrong, so I shuffled it around a bit
re.findall(r'c1(.*?)c1', r'c2(.*?)c2', str(mytext))
re.findall(r'c1(.*?)c1'r'c2(.*?)c2', str(mytext))
re.findall(r'c1(.*?)c1' or 'c2(.*?)c2', str(mytext))
re.findall(r'c1(.*?)c1' or r'c2(.*?)c2', str(mytext))
But in the case of the first two, same as my initial attempt. The last two got only c1(.*?)c1
, which is useful data, but it doesn't contain the c2(.*?)c2
at all, let alone in the order it appears in the text .
Third attempt:
Don't run this code this crashed my laptop with an infinite loop. I had done some research by this point and discovered the re.search()
function
paragraphs = []
ticker = ''
while ticker != 'None':
ticker = re.search(r'c1(.*?)c1', str(mytext))
if (ticker == 'None'):
ticker = re.search(r'c2(.*?)c2', str(mytext))
if (ticker != 'None'):
paragraphs.append(ticker)
print(paragraphs)
Clearly, this was a dumb idea. It tried to make the paragraphs[]
have an infinite list of the first c1(.*?)c1
.
Question:
How, if at all, do I use re.findall()
to create a list paragraphs
that will go through the text in mytext
and pick out everything that meets the criteria c1(.*?)c1
and c2(.*?)c2
and place them in the order they appear?
eg if the text is (spaces added for clarity, will not exist in file)
c2 hello c2 c1 world c1 c2 !!! c2
The program will be
#get the text
#do the re.findall() function and assign to the list paragraphs
print(paragraphs)
And will return
>>>['hello', 'world', '!!!']
As
re.findall(r'c1(. ?)c1c2(. ?)c2', str(mytext))
returns nothing because your'e passing too many arguments in here try to put OR in between you'll get your output. like
re.findall(r'c1(. ?)c2', mytext) or re.findall(r'c2(. ?)c3', mytext)
You may use
[x.group(2) for x in re.finditer(r'(c1|c2)(.*?)\1', mytext, flags=re.S)]
See the regex demo . Or, to match the shortest substrings:
[x.group(2) for x in re.finditer(r'(c1|c2)((?:(?!c1|c2).)*?)\1', mytext, flags=re.S)]
The regex matches
(c1|c2)
- Group 1: c1
or c2
(.*?)
- Group 2: any 0 or more chars as few as possible \1
- the same value as in Group 1. The for x in re.finditer(r'(c1|c2)(.*?)\1', mytext)
iterates over all matches and x.group(2)
will return Group 2 values only.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.