简体   繁体   中英

Parsing HTML File BeautifulSoup

I'm trying to parse a local html file with BeautifulSoup but having trouble navigating the tree.

The file is in the following format:

<div class="contents">
  <h1>
    USERNAME
  </h1>
<div>
  <div class="thread">
     N1, N2
     <div class="message">
       <div class="message_header">
         <span class="user">
           USERNAME
         </span>
       <span class="meta">
         Thursday, 1 January 2015 at 19:52 UTC
       </span>
     </div>
   </div>
   <p>
      They're just friends
   </p>
   <div class="message">
       <div class="message_header">
         <span class="user">
           USERNAME
         </span>
       <span class="meta">
         Thursday, 1 January 2015 at 19:52 UTC
       </span>
     </div>
   </div>
   <p>
      MESSAGE
   </p>
 ...

I want to extract, for each thread:

for each div class='message' :

  • the span class='user' and meta data
  • the message in the p directly after

This is a long file with many of these threads and many messages within each thread.

So far I've just opened the file and turned it into a soup

raw_data = open('file.html', 'r')
soup = BeautifulSoup(raw_data)

contents = soup.find('div', {'class' : 'contents'})

I'm looking at storing this data in a dictionary in the format

dict[USERNAME] = ([(MESSAGE1, time1), [MESSAGE2, time2])

The username and meta info are relatively easy to grab, as they are nicely contained within their own span tags, with a class identifier. The message itself is hanging around in loose paragraph tags, this is the more tricky beast...

If you have a look at the "Going Sideways" section HERE it says "You can use .next_sibling and .previous_sibling to navigate between page elements that are on the same level of the parse tree".

with this in mind, you can extract the parts you want with this:

from bs4 import BeautifulSoup

your_html = "your html data"

souped_data = BeautifulSoup(your_html)
for message in souped_data.find_all("div", {"class": "message"}):
    username = message.find('span', attrs={'class': 'user'}).get_text()
    meta = message.find('span', attrs={'class': 'meta'}).get_text()
    message = message.next_sibling 

First, find all the message tags. Within each, you can search for the user and meta class names. However, this just returns the tag itself, use .get_text() to get the data of the tag. Finally, use the magical .next_sibling to get your message content, in the lonely old 'p' tags.

That gets you the data you need. As for the dictionary structure. Hmmm... I would throw them all in a list of dictionary objects. Then JSONify that badboy! However, maybe that's not what you need?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM