简体   繁体   中英

Regular Expressions: Find Names in String using Python

I have never had a very hard time with regular expressions up until now. I am hoping the solution is not obvious because I have probably spent a few hours on this problem.

This is my string:

<b>Carson Daly</b>: <a href="https://rads.stackoverflow.com/amzn/click/com/B009DA74O8" rel="nofollow noreferrer">Ben Schwartz</a>, Soko, Jacob Escobedo (R 2/28/14)<br>'

I want to extract 'Soko', and 'Jacob Escobedo' as individual strings. If I takes two different patterns for the extractions that is okay with me.

I have tried "\\s([A-Za-z0-9]{1}.+?)," and other alterations of that regex to get the data I want but I have had no success. Any help is appreciated.

The names never follow the same tag or the same symbol. The only thing that consistently precedes the names is a space (\\s).

Here is another string as an example:

<b>Carson Daly</b>: Wil Wheaton, the Birds of Satan, Courtney Kemp Agboh<br>

An alternative approach would be to parse the string with an HTML parser, like lxml .

For example, you can use the xpath to find everything between a b tag with Carson Daly text and br tag by checking preceding and following siblings:

from lxml.html import fromstring

l = [
    """<b>Carson Daly</b>: <a href="http://rads.stackoverflow.com/amzn/click/B009DA74O8">Ben Schwartz</a>, Soko, Jacob Escobedo (R 2/28/14)<br>'""",
    """<b>Carson Daly</b>: Wil Wheaton, the Birds of Satan, Courtney Kemp Agboh<br>"""
]

for html in l:
    tree = fromstring(html)
    results = ''
    for element in tree.xpath('//node()[preceding-sibling::b="Carson Daly" and following-sibling::br]'):
        if not isinstance(element, str):
            results += element.text.strip()
        else:
            text = element.strip(':')
            if text:
                results += text.strip()

    print results.split(', ')

It prints:

['Ben Schwartz', 'Soko', 'Jacob Escobedo (R 2/28/14)']
['Wil Wheaton', 'the Birds of Satan', 'Courtney Kemp Agboh']

If you want to do it in regex (and with all the disclaimers on that topic), the following regex works with your strings. However, do note that you need to retrieve your matches from capture Group 1. In the online demo , make sure you look at the Group 1 captures in the bottom right pane. :)

<[^<]*</[^>]*>|<.*?>|((?<=,\s)\w[\w ]*\w|\w[\w ]*\w(?=,))

Basically, with the left alternations (separated by | ) we match everything we don't want, then the final parentheses on the right capture what we do want.

This is an application of this question about matching a pattern except in certain situations (read that for implementation details including links to Python code).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM