简体   繁体   中英

Regular Expressions in Python to extract data

I am trying to extract some contact details from a web page, and I successfully extracted some informations using Beautiful Soup.

But I can't extract some data because it is not properly constructed(html). So I am using regular expressions. But last couple of hours I'm trying to learn regular expressions and I am kinda struck.

 InstanceBeginEditable name="additional_content" 
<h1>Contact details</h1>
<h2>Diploma coordinator</h2>


                                Mr. Matthew Schultz<br />
<br />
                                    610 Maryhill Drive<br />


                                Green Bay<br />
                                WI<br />
                                United States<br />
                                54303<br />
Contact by email</a><br />
                                Phone (1) 920 429 6158          
                                <hr /><br />

I need to extract,

Mr. Matthew Schultz

610 Maryhill Drive Green Bay WI United States 54303

And phone number. I tried things which I found from google search. But none works(because of my little knowledge, but here my last effort.

con = ""
for content in contactContent.contents:
    con += str(content)

print con

address = re.search("Mr.\b[a-zA-Z]", con)

print str(address)

Sometimes I get None.

Please help guys!

PS. Content is freely available in net No copyright infringed.

OK, using your data, EDIT to embed the parsing routine inside a function

def parse_list(source):
    lines = ''.join( source.split('\n') )
    lines = lines[ lines.find('</h2>')+6 : lines.find('Contact by email') ]                   
    lines = [ line.strip()
              for line in lines.split('<br />')
              if line.strip() != '']
    return lines

# Parse the page and retrieve contact string from the relevant <div>
con = ''' InstanceBeginEditable name="additional_content" 
<h1>Contact details</h1>
<h2>Diploma coordinator</h2>


                                Mr. Matthew Schultz<br />
<br />
                                    610 Maryhill Drive<br />


                                Green Bay<br />
                                WI<br />
                                United States<br />
                                54303<br />
Contact by email</a><br />
                                Phone (1) 920 429 6158          
                                <hr /><br />'''


# Extract details and print to console

details = parse_list(con)
print details

This will output a list:

['Mr. Matthew Schultz', '610 Maryhill Drive', 'Green Bay', 'WI', 'United States', '54303']

You asked about doing this with a regex. Assuming you get a new multiline string with this data for each div, you could extract the data like this:

import re

m = re.search('</h2>\s+(.*?)<br />\s+<br />\s+(.*?)<br />\s+(.*?)<br />\s+(.*?)<br />\s+(.*?)<br />\s+(.*?)<br />', con )
if m:
    print m.groups()

output:

('Mr. Matthew Schultz', '610 Maryhill Drive', 'Green Bay', 'WI', 'United States', '54303')

I see you are off to an OK start with regex. The key to regex is to remember that you generally want to define a digit or group of digits, followed by a quantity expression which tells it how many times you want your expression repeated. In this case, we start with </h2> followed by \\s+ which tells the regex engine we want one or more space characters (which includes newline). The only other nuance here is the next expression which is (.*?) is a lazy capture all - it will grab anything until it runs into the next expression which is the next <br /> .

Edit: also, you should be able to clean up the regex by taking advantage of the fact that after the name all of the address information is in a uniform format. I played with it a little but wasn't getting it so if you wanted to improve it that would be an approach.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM