简体   繁体   English

关于网页抓取和缺失数据

[英]About Web Scraping and missing data

I am trying to scrape some data from yelp page.我正在尝试从 yelp 页面抓取一些数据。 However, some values are missing when I get the result and the missing data alters every time when I execute the code( Eg : At first execution 2 data are missing, at second execution 1 data is missing).但是,当我得到结果时,某些值丢失了,并且每次执行代码时丢失的数据都会改变(例如:第一次执行时缺少 2 个数据,第二次执行时缺少 1 个数据)。 Do you guys know why this happens?大家知道为什么会这样吗? Thank you!!谢谢!!

import time
review_listings= []
cols2 = ['restaurant name','username','ratings','review.text']


copy = 0
for url in data_rev['url']:  # Each url has 20 so start 
    start = time.time()
    for p in pages:
        url_review = url+ "&start={}".format(str(p))
        page = r.get(url_review)
        soup = BeautifulSoup(page.content,'html.parser')
        res_name = soup.find("h1",{"class":"lemon--h1__373c0__2ZHSL heading--h1__373c0___56D3 undefined heading--inline__373c0__1jeAh"}).text
        tables=soup.findAll('li',{'class':'lemon--li__373c0__1r9wz margin-b3__373c0__q1DuY padding-b3__373c0__342DA border--bottom__373c0__3qNtD border-color--default__373c0__3-ifU'})
        if(len(tables) == 0):
            
            print(url_review)
            break
        else:
            
            for table in tables:
                #name,ratings,username:
                username = table.find("span",{"class":"lemon--span__373c0__3997G text__373c0__2Kxyz fs-block text-color--blue-dark__373c0__1jX7S text-align--left__373c0__2XGa- text-weight--bold__373c0__1elNz"}).a.text
                ratings = table.find("span",{"class":"lemon--span__373c0__3997G display--inline__373c0__3JqBP border-color--default__373c0__3-ifU"}).div.get("aria-label")
                text = table.find("span",{"class":"lemon--span__373c0__3997G raw__373c0__3rKqk"}).text
                review_listings.append([res_name,username,ratings,text])
              
            rev_df = pd.DataFrame.from_records(review_listings,columns=cols2)
            
    size_df = len(rev_df)
    
    print("review sizes are =>",size_df - copy)

    print(res_name)
    copy = size_df
    end = time.time()
    print(end-start)

It appears that all the data you're interested in is stored as json in the page source.您感兴趣的所有数据似乎都以json格式存储在页面源中。 This could be a more reliable way to grab informations from this page:这可能是从该页面获取信息的更可靠方法:

import re
import json
import requests

## Using headers is always a good pratice
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}

response = requests.get('https://www.yelp.com/biz/saku-vancouver-3', headers=headers)
soup = BeautifulSoup(response.content)

# Lets find the 'script' tag which contains restaurant's information
data_tag = soup.find('script',text = re.compile('"@type":'))

#Load it properly as json
data = json.loads(data_tag.text)

print(data)

Output输出

{'@context': 'https://schema.org',
'@type': 'Restaurant',
'name': 'Saku',
'image': 'https://s3-media0.fl.yelpcdn.com/bphoto/_TjVeAVRczn0yITxvBqrCA/l.jpg',
'priceRange': 'CA$11-30',
'telephone': '',
'address': {'streetAddress': '548 W Broadway',
'addressLocality': 'Vancouver',
'addressCountry': 'CA',
'addressRegion': 'BC',
'postalCode': 'V5Z 1E9'},
'review': [{'author': 'Jackie L.',
  'datePublished': '1970-01-19',
  'reviewRating': {'ratingValue': 5},
  'description': 'With restaurants .... }
  ...]
}

Try the following to get the restaurant name, all the name of the reviewers, their reviews, ratings across multiple pages.尝试以下操作以获取餐厅名称、所有评论者姓名、他们的评论、跨多个页面的评分。 Of course if you haven't been blocked by that site already.当然,如果您尚未被该站点阻止。

import requests

url = 'https://www.yelp.com/biz/XAH2HpuUUtu7CUO26pbs4w/review_feed?'

params = {
    'rl': 'en',
    'sort_by': 'relevance_desc',
    'q': '',
    'start': ''
}

page = 0

while True:
    params['start'] = page
    res = requests.get(url,params=params)
    if not res.json()['reviews']:break
    for item in res.json()['reviews']:
        restaurant = item['business']['name']
        rating = item['rating']
        user = item['user']['markupDisplayName']
        review = item['comment']['text']
        print(restaurant,rating,user,review)

    page+=20

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM