简体   繁体   中英

Extracting data from HTML with Python

I have following text processed by my code in Python:

<td>
<a href="http://www.linktosomewhere.net" title="title here">some link</a>
<br />
some data 1<br />
some data 2<br />
some data 3</td>

Could you advice me how to extract data from within <td> ? My idea is to put it in a CSV file with the following format: some link, some data 1, some data 2, some data 3 .

I expect that without regular expression it might be hard but truly I still struggle against regular expressions.

I used my code more or less in following manner:

tabulka = subpage.find("table")

for row in tabulka.findAll('tr'):
    col = row.findAll('td')
print col[0]

and ideally would be to get each td contend in some array. Html above is a result from python.

Get BeautifulSoup and just use it. It's great.

$> easy_install pip
$> pip install BeautifulSoup
$> python
>>> from BeautifulSoup import BeautifulSoup as BS
>>> import urllib2
>>> html = urllib2.urlopen(your_site_here)
>>> soup = BS(html)
>>> elem = soup.findAll('a', {'title': 'title here'})
>>> elem[0].text

You shouldn't use regexes on html. You should use BeautifulSoup or lxml. Here are some examples using BeautifulSoup:

Your td tags actually look like this:

<td>newline
<a>some link</a>newline
<br />newline
some data 1<br />newline
some data 2<br />newline
some data 3</td>

So td.text looks like this:

<newline>some link<newline><newline>some data 1<newline>some data 2<newline>some data 3

You can see that each string is separated by at least one newline, so that enables you to separate out each string.

from bs4 import BeautifulSoup as bs
import re

html = """<td>
<a href="http://www.linktosomewhere.net" title="title here">some link</a>
<br />
some data 1<br />
some data 2<br />
some data 3</td>"""

soup = bs(html)
tds = soup.find_all('td')
csv_data = []

for td in tds:
    inner_text = td.text
    strings = inner_text.split("\n")

    csv_data.extend([string for string in strings if string])

print(",".join(csv_data))

--output:--
some link,some data 1,some data 2,some data 3

Or more concisely:

for td in tds:
    print(re.sub("\n+", ",", td.text.lstrip() ) ) 

--output:--
some link,some data 1,some data 2,some data 3

But that solution is brittle because it won't work if your html looks like this:

<td>
<a href="http://www.linktosomewhere.net" title="title here">some link</a>
<br />some data 1<br />some data 2<br />some data 3</td>

Now td.text looks like this:

<newline>some link<newline>some data 1some data2some data3

And there isn't a way to figure out where some of the strings begin and end. But that just means you can't use td.text--there are still other ways to identify each string:

1)

from bs4 import BeautifulSoup as bs
import re

html = """<td>
<a href="http://www.linktosomewhere.net" title="title here">some link</a>
<br />some data 1<br />some data 2<br />some data 3</td>"""

soup = bs(html)
tds = soup.find_all('td')
csv_data = []

for td in tds:
    a_tags = td.find_all('a')

    for a_tag in a_tags:
        csv_data.append(a_tag.text)
        br_tags = a_tag.findNextSiblings('br')

        for br in br_tags:
            csv_data.append(br.next.strip())  #get the element after the <br> tag

csv_str = ",".join(csv_data)
print(csv_str)

--output:--
some link,some data 1,some data 2,some data 3

2)

for td in tds:
    a_tag = td.find('a')
    if a_tag: csv_data.append(a_tag.text)

    for string in a_tag.findNextSiblings(text=True):  #find only text nodes
        string = string.strip()
        if string: csv_data.append(string)

csv_str = ",".join(csv_data)
print(csv_str)

--output:--
some link,some data 1,some data 2,some data 3

3)

for td in tds:
    a_tag = td.find('a')
    if a_tag: csv_data.append(a_tag.text)

    text_strings = a_tag.findNextSiblings( text=re.compile('\S+') )  #find only non-whitespace text nodes
    csv_data.extend(text_strings)

csv_str = ",".join(csv_data)
print(csv_str)

--output:--
some link,some data 1,some data 2,some data 3

I've never used BeautifulSoup , but I would bet that it is 'html-tag-aware' and can handle 'filler' space. But since html markup files are structured (and usually generated by a web design program), you can also try a direct approach using Python's .split() method. Incidentally, I recently used this approach to parse out a real world url/html to do something very similar to what the OP wanted.

Although the OP wanted to pull only one field from the <a> tag, below we pull the 'usual two' fields.

CODE:

#--------*---------*---------*---------*---------*---------*---------*---------*
# Desc: Extracting data from HTML using split()
# Link: https://stackoverflow.com/questions/17126686/extracting-data-from-html-with-python
#--------*---------*---------*---------*---------*---------*---------*---------*

import sys

page     = """blah blah blah
<td>
<a href="http://www.link1tosomewhere.net" title="title1 here">some link1</a>
<br />
some data1 1<br />
some data1 2<br />
some data1 3</td>
mlah mlah mlah
<td>
<a href="http://www.link2tosomewhere.net" title="title2 here">some link2</a>
<br />
some data2 1<br />
some data2 2<br />
some data2 3</td>
flah flah flah
"""

#--------*---------*---------*---------*---------*---------*---------*---------#
while 1:#                          M A I N L I N E                             #
#--------*---------*---------*---------*---------*---------*---------*---------#
    page = page.replace('\n','')   # remove \n from test html page
    csv = ''
    li = page.split('<td><a ')
    for i in range(0, len(li)):
        if li[i][0:6] == 'href="':
            s = li[i].split('</td>')[0]
#                                  # li2 ready for csv            
            li2 = s.split('<br />')
#                                  # create csv file
            for j in range(0, len(li2)):
#                                  # get two fields from li2[0]               
                if j == 0:
                    li3 = li2[0].split('"')
                    csv = csv + li3[1] + ','
                    li4 = li3[4].split('<')
                    csv = csv + li4[0][1:] + ','
#                                  # no comma on last field - \n instead
                elif j == len(li2) - 1:
                    csv = csv + li2[j] + '\n'
#                                  # just write out middle stuff                    
                else:
                    csv = csv + li2[j] + ','
    print(csv)                    
    sys.exit()

OUTPUT:

>>> 
= RESTART: C:\Users\Mike\AppData\Local\Programs\Python\Python36-32\board.py =
http://www.link1tosomewhere.net,some link1,some data1 1,some data1 2,some data1 3
http://www.link2tosomewhere.net,some link2,some data2 1,some data2 2,some data2 3

>>> 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM