简体   繁体   English

使用beautifulsoup4进行抓取时数据丢失

[英]Data is missing while scraping using beautifulsoup4

Actually I'm a newbie to the parsing stuff with Python Beautifulsoup4. 实际上我是使用Python Beautifulsoup4解析东西的新手。 I was scraping this website . 我在抓这个网站 I need Current Price Per Mil on the front page. 我需要首页上的当前每百万价格

I already spent 3 hours with this. 我已经用了3个小时。 While looking for the solution on internet. 在互联网上寻找解决方案。 I got to know that there is a library PyQT4 that can mimic like a web browser and load the content and then once it's done with loading you can extract ur required data. 我知道有一个库PyQT4可以模仿网络浏览器并加载内容,然后一旦完成加载,你就可以提取你需要的数据。 But I got crashed. 但是我崩溃了。

Used this approach to collect the data in raw text format. 使用此方法以原始文本格式收集数据。 I tried other approaches too. 我也尝试了其他方法。

def parseMe(url):
    soup = getContent(url)
    source_code = requests.get(url)
    plaint_text = source_code.text
    soup = BeautifulSoup(plaint_text, 'html.parser')
    osrs_text = soup.find('div', class_='col-md-12 text-center')
    print(osrs_text.encode('utf-8'))

Please have a look on this image . 请看这个图像 I think the problem is with ::before and ::after tags. 我认为问题在于:: before和:: after标签。 They appear once the page get loaded. 一旦页面加载,它们就会出现。
Any help will be highly appreciated. 任何帮助将受到高度赞赏。

You should use selenium instead of `requests: 您应该使用selenium而不是`requests:

from selenium import webdriver
from bs4 import BeautifulSoup

def parse(url):
    driver = webdriver.Chrome('D:\Programming\utilities\chromedriver.exe')
    driver.get('https://boglagold.com/buy-runescape-gold/')
    soup = BeautifulSoup(driver.page_source)
    return soup.find('h4', {'id': 'curr-price-per-mil-text'}).text

parse()

Output: 输出:

'Current Price Per Mil: 0.80USD'

The reason is that the value of that element is obtained through JavaScript, which requests can't handle. 原因是该元素的值是通过JavaScript获得的, requests无法处理。 This particular snippet of code uses the Chrome driver; 此特定代码段使用Chrome驱动程序; if you prefer, you can use the Firefox/some other browser equivalent (you will need to install the selenium library and look for the Chrome driver yourself). 如果您愿意,可以使用Firefox /其他等效的浏览器(您需要安装selenium库并自行查找Chrome驱动程序)。

The web page makes an XHR to fetch a JSON file with the but price in it 该网页使XHR以其中的价格获取JSON文件

import requests

r = requests.get('https://api.boglagold.com/api/product/?id=osrs-gold&couponCode=null')
j = r.json()
# print(j)
print('sellPrice', j['sellPrice'])
print('buyPrice', j['buyPrice'])

Outputs: 输出:

sellPrice 0.8
buyPrice 0.62

As mentioned by the other answers, this page only contains the text Current Price Per Mil: and 0USD . 正如其他答案所述,此页面仅包含文本Current Price Per Mil:0USD The value in the middle - 0.8 - is obtained dynamically with JS from the url described below (which can be obtained using a process described (for example) here and many other places . That site checks for bots so you have to use a method described (for example) here . 中间的值 - 0.8 - 是使用JS从下面描述的URL动态获得的(可以使用此处描述的过程(例如)和许多其他地方获得 。该站点检查机器人,因此您必须使用所描述的方法(例如)这里

So all together: 所以一起:

url = 'https://api.boglagold.com/api/product/?id=osrs-gold&couponCode=null'
import requests
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})

response.json()['sellPrice']

Output: 输出:

0.8

The issue is that the javascript dynamically adds the data you want to scrap on that website. 问题是,javascript会动态添加您要在该网站上废弃的数据。 You could try to run JS on the client side, wait for fetching the data you want to scrap and then get the DOM contents - if you want to do it that way, please look at @gmds answer to this question. 您可以尝试在客户端运行JS,等待获取要废弃的数据,然后获取DOM内容 - 如果您想这样做,请查看@gmds对此问题的回答。 The other method is to check what requests the javascript code is making and which one contains the information you need. 另一种方法是检查javascript代码发出的请求以及哪一个包含您需要的信息。 Then you can make that request(s) using python and get the required data without the need of using PyQT4 or even BS4. 然后你可以使用python发出请求并获得所需的数据,而无需使用PyQT4甚至BS4。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM