I'm trying to parse a local html
file with BeautifulSoup but having trouble navigating the tree.
The file is in the following format:
<div class="contents">
<h1>
USERNAME
</h1>
<div>
<div class="thread">
N1, N2
<div class="message">
<div class="message_header">
<span class="user">
USERNAME
</span>
<span class="meta">
Thursday, 1 January 2015 at 19:52 UTC
</span>
</div>
</div>
<p>
They're just friends
</p>
<div class="message">
<div class="message_header">
<span class="user">
USERNAME
</span>
<span class="meta">
Thursday, 1 January 2015 at 19:52 UTC
</span>
</div>
</div>
<p>
MESSAGE
</p>
...
I want to extract, for each thread:
for each div
class='message'
:
span
class='user'
and meta data p
directly after This is a long file with many of these threads and many messages within each thread.
So far I've just opened the file and turned it into a soup
raw_data = open('file.html', 'r')
soup = BeautifulSoup(raw_data)
contents = soup.find('div', {'class' : 'contents'})
I'm looking at storing this data in a dictionary in the format
dict[USERNAME] = ([(MESSAGE1, time1), [MESSAGE2, time2])
The username and meta info are relatively easy to grab, as they are nicely contained within their own span tags, with a class identifier. The message itself is hanging around in loose paragraph tags, this is the more tricky beast...
If you have a look at the "Going Sideways" section HERE it says "You can use .next_sibling and .previous_sibling to navigate between page elements that are on the same level of the parse tree".
with this in mind, you can extract the parts you want with this:
from bs4 import BeautifulSoup
your_html = "your html data"
souped_data = BeautifulSoup(your_html)
for message in souped_data.find_all("div", {"class": "message"}):
username = message.find('span', attrs={'class': 'user'}).get_text()
meta = message.find('span', attrs={'class': 'meta'}).get_text()
message = message.next_sibling
First, find all the message tags. Within each, you can search for the user and meta class names. However, this just returns the tag itself, use .get_text() to get the data of the tag. Finally, use the magical .next_sibling to get your message content, in the lonely old 'p' tags.
That gets you the data you need. As for the dictionary structure. Hmmm... I would throw them all in a list of dictionary objects. Then JSONify that badboy! However, maybe that's not what you need?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.