简体   繁体   中英

Python - Scrapy unable to fetch data

I am just starting out with Python/Scrapy.

I have a written a spider that crawls a website and fetches information. But i am stuck in 2 places.

  1. I am trying to retrieve the telephone numbers from a page and they are coded like this

     <span class="mrgn_right5">(+001) 44 42676000,</span> <span class="mrgn_right5">(+011) 44 42144100</span> 

The code i have is:

getdata = soup.find(attrs={"class":"mrgn_right5"})
if getdata:
   aditem['Phone']=getdata.get_text().strip()
   #print phone

But it is fetching only the first set of numbers and not the second one. How can i fix this?

  1. On the same page there is another set of information

I am using this code

    getdata = soup.find(attrs={"itemprop":"pricerange"})
    if getdata:
        #print getdata
        aditem['Pricerange']=getdata.get_text().strip()
        #print pricerange

But it is not fetching any thing.

Any help on fixing these two would be great.

From a browse of the Beautiful Soup documentation , find will only return a single result. If multiple results are expected/required, then use find_all instead. Since there are two results, a list will be returned, so the elements of the list need to be joined together (for example) to add them to Phone field of your AdItem .

getdata = soup.find_all(attrs={"class":"mrgn_right5"})
if getdata:
   aditem['Phone'] = ''.join([x.get_text().strip() for x in getdata])

For the second issue, you need to access the attributes of the returned object. Try the following:

getdata = soup.find(attrs={"itemprop":"pricerange"})
if getdata:
    aditem['Pricerange'] = getdata.attrs['content']

And for the address information, the following code works but is very hacky and could no doubt be improved by someone who understands Beautiful Soup better than me.

getdata = soup.find(attrs={"itemprop":"address"})
address = getdata.span.get_text()
addressLocality = getdata.meta.attrs['content']
addressRegion = getdata.find(attrs={"itemprop":"addressRegion"}).attrs['content']
postalCode = getdata.find(attrs={"itemprop":"postalCode"}).attrs['content']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM