简体   繁体   中英

Extract content with BeautifulSoup and Python

I'm trying to scrap a forum but I can't deal with the comments, because the users use emoticons, and bold font, and cite previous messages, and and and...

For example, here's one of the comments that I have a problem with:

<div class="content">
    <blockquote>
        <div>
            <cite>User write:</cite>
               I DO NOT WANT THIS  <img class="smilies" alt=":116:" title="116">
        </div>
    </blockquote>
    <br/>
    THIS IS THE COMMENT THAT I NEED!
</div>

I searching for help for the last 4 days and I couldn't find anything, so I decided to ask here.

This is the code that I'm using:

def make_soup(url):
    html = urlopen(url).read()
    return BeautifulSoup(html, "lxml")

def get_messages(url):
    soup = make_soup(url)

    msg = soup.find("div", {"class" : "content"})

    # I get in msg the hole message, exactly as I wrote previously
    print msg

    # Here I get:
    # 1. <blockquote> ... </blockquote>
    # 2. <br/>
    # 3. THIS IS THE COMMENT THAT I NEED!
    for item in msg.children:
        print item

I'm looking for a way to deal with messages in a general way, no matter how they are. Sometimes they put emoticons between the text and I need to remove them and get the hole message (in this situation, bsp will put each part of the message (first part, emoticon, second part) in different items).

Thanks in advance!

Use decompose http://www.crummy.com/software/BeautifulSoup/bs4/doc/#decompose

Decompose extracts tags that you don't want. In your case:

soup.blockquote.decompose()

or all unwanted tags:

for tag in ['blockquote', 'img', ... ]:
    soup.find(tag).decompose()

Your example:

>>> from bs4 import BeautifulSoup
>>> html = """<div class="content">
...     <blockquote>
...         <div>
...             <cite>User write:</cite>
...                I DO NOT WANT THIS  <img class="smilies" alt=":116:"    title="116">
...         </div>
...     </blockquote>
...     <br/>
...     THIS IS THE COMMENT THAT I NEED!
... </div>"""
>>> soup = BeautifulSoup(html, 'html.parser')
>>> soup.find('blockquote').decompose()
>>> soup.find("div", {"class" : "content"}).text.strip()
u'THIS IS THE COMMENT THAT I NEED!'

Update

Sometimes all you have is a tag starting point but you are actually interested in the content before or after that starting point. You can use .next_sibling and .previous_sibling to navigate between page elements that are on the same level of the parse tree:

>>> html = """<div>No<blockquote>No</blockquote>Yes.<em>Yes!</em>Yes?</div>No!"""
>>> soup = BeautifulSoup(html, 'html.parser')
>>> elm = soup.blockquote.next_sibling
>>> txt = ""
>>> while elm:
...    txt += elm.string
...    elm = elm.next_sibling
... 
>>> print(txt)
u'Yes.Yes!Yes?'

BeautifulSoup has a get_text method. Maybe this is what you want.

From their documentation:

markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup)

soup.get_text()
u'\nI linked to example.com\n'
soup.i.get_text()
u'example.com'

If the text you want is never within any additional tags, as in your example, you can use extract() to get rid of all the tags and their contents:

html = '<div class="content">\
    <blockquote>\
        <div>\
            <cite>User write:</cite>\
               I DO NOT WANT THIS  <img class="smilies" alt=":116:" title="116">\
        </div>\
    </blockquote>\
    <br/>\
    THIS IS THE COMMENT THAT I NEED!\
</div>'

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')
div = soup.find('div', class_='content')
tags = div.findAll(recursive=False)
for tag in tags:
  tag.extract()
text = div.get_text(strip=True)
print(text)

This gives:

THIS IS THE COMMENT THAT I NEED!

To deal with emoticons, you'll have to do something more complicated. You'll probably have to define a list of emoticons to recognize yourself, and then parse the text to look for them.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM