简体   繁体   中英

problems scraping web page using python

Hi I'm quite new to python and my boss has asked me to scrape this data however it is not my strong point so i was wondering how i would go about this.

The text that I'm after also changes in the quote marks every few minutes so I'm also not sure how to locate that.

I am using beautiful soup at the moment and Lxml however if there are better alternatives I'm happy to try them

This is the inspected element of the webpage:

div class = "sometext"
<h3> somemoretext </h3>
<p>
<span class = "title" title="text i want">text i want</span>
<br>
</p>

I have tried using:

from lxml import html
import requests
from bs4 import BeautifulSoup
page = requests.get('the url')
soup = BeautifulSoup(page.text)
r = soup.findAll('//span[@class="title"]/text()')
print r

Thank you in advance,any help would be appreciated!

perhaps find is the method you really need since you're only ever looking for one element. docs

r = soup.find('div', 'sometext').find('span','title')['title']

First do this to get what you are looking at in the soup:

soup = BeautifulSoup(page)
print soup

That way you can double check that you are actually dealing will what you think you are dealing with.

Then do this:

r = soup.findAll('span', attrs={"class":"title"})
for span in r:
    print span.text

This will get all the span tags with a class=title , and then text will print out all the text in between the tags.

Edited to Add

Note that esecules' answer will get you the title within the tag ( <span class = "title" title="text i want"> ) whereas mine will get the title from the text ( <span class = "title" >text i want</span> )

if you're familiar with XPath and you don't need feature that specific to BeautifulSoup , then using lxml only is enough (or maybe even better since lxml is known to be faster) :

from lxml import html
import requests

page = requests.get('the url')
root = html.fromstring(page.text)
r = root.xpath('//span[@class="title"]/text()')
print r

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM