简体   繁体   中英

Beautifulsoup HTML data extraction with BeautifulSoup and Python

I have HTML text that looks like many instances of the following structure:

<DOC>
<DOCNO> XXX-2222 </DOCNO>
<FILEID>AP-NR-02-12-88 2344EST</FILEID>
<HEAD>Reports Former Saigon Officials Released from Re-education Camp</HEAD>
<TEXT>
Lots of text here
</TEXT>
</DOC>

What I need to do is index each structure, with the DocNo, Headline, and Text, to later be analysed (tokenised, etc.).

I was thinking of using BeautifulSoup, and this is the code I have so far:

soup = BeautifulSoup (file("AP880212.html").read()) 
num = soup.findAll('docno')

But this only gives me results of the following format:

<docno> AP880212-0166 </docno>, <docno> AP880212-0167 </docno>, <docno> AP880212-0168 </docno>, <docno> AP880212-0169 </docno>, <docno> AP880212-0170 </docno>

How do I extract the numbers within the <> ? And link them with the headlines and texts?

Thank you very much,

Sasha

To get the contents of the tags:

docnos = soup.findAll('docno')
for docno in docnos:
    print docno.contents[0]

Something like this:

html = """<DOC>
<DOCNO> XXX-2222 </DOCNO>
<FILEID>AP-NR-02-12-88 2344EST</FILEID>
<HEAD>Reports Former Saigon Officials Released from Re-education Camp</HEAD>
<TEXT>
Lots of text here
</TEXT>
</DOC>
"""

import bs4

d = {}

soup = bs4.BeautifulSoup(html, features="xml")
docs = soup.findAll("DOC")
for doc in docs:
    d[doc.DOCNO.getText()] = (doc.HEAD.getText(), doc.TEXT.getText())

print d
#{u' XXX-2222 ': 
#   (u'Reports Former Saigon Officials Released from Re-education Camp', 
#    u'\nLots of text here\n')}

Note that I pass features="xml" to the constructor. This is because there are a lot of non-standard html tags in your input. You will probably also want to .strip() text before you save it into the dictionary so it is not so whitespace sensitive (unless that is your intention, of course).

Update:

If there are multiple DOC's in the same file, and the features="xml" is limiting to one, its probably because the XML parser is expecting to have only one root element.

Eg If you wrap your entire input XML in a single root element, it should work:

<XMLROOT>
    <!-- Existing XML (e.g. list of DOC elements) -->
</XMLROOT>

so you can either do this in your file, or what I would suggest is to do this programmatically on the input text before you pass it to beautifulsoup:

root_element_name = "XMLROOT"  # this can be anything
rooted_html = "<{0}>\n{1}\n</{0}>".format(root_element_name, html)
soup = bs4.BeautifulSoup(rooted_html, features="xml")
docnos = soup.findAll('docno')
for docno in docnos:
       print docno.renderContents()

You can also use renderContents() to extract information from tags.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM