简体   繁体   中英

Given a list of strings, how to locate position of first string matching substring using RegEx?

I have the body of an email parsed into a list of strings (each line is a string). Emails that are replies will have a section along the bottom that is repeating the prior email and begin with " > ", like this....

Hi Dude,

This is just an example.

On Fri, Apr 1, 2016 at 10:14 AM, Some Dude (somedude@example.com)

> The prior email text

I'd like to run some text analytics on the message using NLP, but would like to drop the junk at the bottom first. I imagine what I want to use is the re module, find the following line via regex:

On Fri, Apr 1, 2016 at 10:14 AM, Some Dude (somedude@example.com)

And once I have the location, slice the list to that position. But I'm having trouble locating the position of that line. There's probably sexier ways to write this, but here's what I have so far:

pattern = r'\AOn +([A-Z]+[a-z]{2}), +([A-Z]+[a-z]{2}) +([1-31])'
indices = [i for i, x in enumerate(text) if re.search(pattern, x)]

I presume my issue is in my regex pattern (which does appear valid and does match the line in places like https://www.regex101.com/ , but I'm stuck there as indices is returning an empty set [] . In the example text provided above, I'd like it to return 4 (the 5th line).

enumerate(text) is an iterator over characters of text . Since you want to find the line number, you'll have to iterate over lines. For example, you could split text into individual lines using str.split('\\n') .

>>> [i for i, x in enumerate(text.split('\n')) if x and re.search(pattern, x)]
[4]

Considering that you only need to find the first matching line, it's possible to use next and a generator expression like this:

>>> next(i for i, x in enumerate(text.split('\n')) if x and re.search(pattern, x))
4

To get the rest of the text you could concatenate the "remainder" of the iterator:

>>> it = enumerate(text.split('\n'))
>>> next(i for i, x in it if x and re.search(pattern, x))
4
>>> '\n'.join(x for _, x in it)
'\n> The prior email text\n'

or alter the regular expression to match the whole line:

>>> match = re.search(r'On +([A-Z]+[a-z]{2}), +([A-Z]+[a-z]{2}) +([1-31]).*?\n', text)
>>> text[match.end():] # Don't forget to check if match is None
'\n> The prior email text\n'

You'd avoid using regex, especially if all you need to find is the position of the > character.

>>> text[text.index('>'):]
'> The prior email text\n'

I would tackle this problem different. Iterate over all the lines.

Start with junk_begins = -1
When you see a line starting with > (no need for a regex, just use startsWith ), set junk_begins to the current line if junk_begins == -1 .
When you see a line starting WITHOUT > , set junk_begins back to -1

After looping through all the lines, you will have junk_begins pointing to the line number of the first line where every line afterwards starts with >

No regex required. Before you make the list(consume the iterator, I mean) filter it,

cleaned = [line for line in source if not line.lstrip().startswith(">")]

See if it woks.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM