简体   繁体   English

如何从加密网站抓取确切信息

[英]How to scrape EXACT information from a crypto website

I've been working on a web-scraper to scrape the CoinEx website so I can have the live trades of Bitcoin in my program.我一直在开发一个网络爬虫来抓取 CoinEx 网站,这样我就可以在我的程序中进行比特币的实时交易。 I scraped this link and I was expecting to get all the information related to the class_="ticker-item" but the return was "--".我抓取了这个链接,我期待获得与 class_="ticker-item" 相关的所有信息,但返回的是“--”。 I think it's something with the scraping policy but is there a way I can bypass this.我认为这与抓取政策有关,但有没有办法绕过它。 Like to mimic whatever a regular browser has.喜欢模仿普通浏览器所拥有的任何东西。 I also tried using headers but the result was the same.我也尝试使用标题,但结果是一样的。 My Code :我的代码:

import requests
from bs4 import BeautifulSoup

url="https://coinex.com/exchange/btc-usdt"

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'}

r = requests.get(url,headers = headers)

soup = BeautifulSoup(r.content, "html5lib")

trades = soup.find_all("div", class_="ticker-item")

print(trades[0].div.text)

Result :结果 :

--

It seems the problem is that the html you see when viewing the page in the browser is not the same html that BeautifulSoup receives.问题似乎是您在浏览器中查看页面时看到的 html 与 BeautifulSoup 收到的 html 不同。 The reason is probably that the ticker-items are called using javascript, which is something the browser does for you, but BeautifulSoup does not.原因可能是代码项是使用 javascript 调用的,这是浏览器为您做的事情,但 BeautifulSoup 没有。

If you want to get the data, you are probably best of by finding their api if they have one.如果您想获取数据,最好找到他们的 api(如果有)。 Otherwise, you can look at the webpage using inspect, and look at the network tab.否则,您可以使用检查查看网页,并查看网络选项卡。 Here you can find where the website is pulling data from.在这里您可以找到网站从何处提取数据。 It will be some digging, but somewhere in there you should be able to find another link, which is where the website gets the data from.这将是一些挖掘,但在那里你应该能够找到另一个链接,这是网站获取数据的地方。 You can then use that link instead.然后,您可以改用该链接。 The data will probably be easier to extract that way as well.数据也可能更容易以这种方式提取。

If you want a quick and dirty method you can use the requests-html module.如果你想要一个快速而肮脏的方法,你可以使用requests-html模块。 This renders the webpage for you, including all the scripts because it uses a webbrowser under the hood.这将为您呈现网页,包括所有脚本,因为它在引擎盖下使用网络浏览器。 Therefore the output will be the same html you would see if you opened the website in a browser, and your extraction method should work there.因此,输出将与您在浏览器中打开网站时看到的 html 相同,并且您的提取方法应该在那里工作。 Of course this has a lot of overhead, because it spawns webbrowser processes, but it can be useful in some circumstances.当然,这有很多开销,因为它会产生 webbrowser 进程,但它在某些情况下很有用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM