简体   繁体   中英

Extracting a table from webpage with regex

I want to extract the table containing the IP blocks from this site .

Looking at the HTML source I can clearly see that the area I want is structured like this:

[CONTENT BEFORE TABLE]
<table border="1" cellpadding="6" bordercolor="#000000">
[IP ADDRESSES AND OTHER INFO]
</table>
[CONTENT AFTER TABLE]

So I wrote this little snippet:

import urllib2,re
from lxml import html
response = urllib2.urlopen('http://www.nirsoft.net/countryip/za.html')

content = response.read()

print re.match(r"(.*)<table border=\"1\" cellpadding=\"6\" bordercolor=\"#000000\">(.*)</table>(.*)",content)

The content's of the page is fetched (and correct) without problems. The regex match always returns None however (the print here is just for debugging).

Considering the structure of the page, I can't understand why there isn't a match. I would expect there to be three groups with the second group being the table contents.

By default, . does not match newlines. You need to specify the dot-all flag to have it do this:

re.match(..., content, re.DOTALL)

Below is a demonstration:

>>> import re
>>> content = '''
... [CONTENT BEFORE TABLE]
... <table border="1" cellpadding="6" bordercolor="#000000">
... [IP ADDRESSES AND OTHER INFO]
... </table>
... [CONTENT AFTER TABLE]
... '''
>>> pat = r"(.*)<table border=\"1\" cellpadding=\"6\" bordercolor=\"#000000\">(.*)</table>(.*)"
>>> re.match(pat, content, re.DOTALL)
<_sre.SRE_Match object at 0x02520520>
>>> re.match(pat, content, re.DOTALL).group(2)
'\n[IP ADDRESSES AND OTHER INFO]\n'
>>>

The dot-all flag can also be activated by using re.S or by placing (?s) at the start of your pattern.

For parsing HTML i would prefer BeautifulSoup :

from bs4 import BeautifulSoup
import urllib2
soup = BeautifulSoup(urllib2.urlopen('http://www.nirsoft.net/countryip/za.html').read())
for x in soup.find_all('table', attrs={'border':"1",'cellpadding':"6",'bordercolor':"#000000"}):
    print x

for better result:

for x in soup.find_all('table', attrs={'border':"1",'cellpadding':"6",'bordercolor':"#000000"}):
    for y in x:
        try:
            if y.name == 'tr':
                print "\t".join(y.get_text().split())
       except:pass

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM