简体   繁体   中英

How to scrape EXACT information from a crypto website

I've been working on a web-scraper to scrape the CoinEx website so I can have the live trades of Bitcoin in my program. I scraped this link and I was expecting to get all the information related to the class_="ticker-item" but the return was "--". I think it's something with the scraping policy but is there a way I can bypass this. Like to mimic whatever a regular browser has. I also tried using headers but the result was the same. My Code :

import requests
from bs4 import BeautifulSoup

url="https://coinex.com/exchange/btc-usdt"

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'}

r = requests.get(url,headers = headers)

soup = BeautifulSoup(r.content, "html5lib")

trades = soup.find_all("div", class_="ticker-item")

print(trades[0].div.text)

Result :

--

It seems the problem is that the html you see when viewing the page in the browser is not the same html that BeautifulSoup receives. The reason is probably that the ticker-items are called using javascript, which is something the browser does for you, but BeautifulSoup does not.

If you want to get the data, you are probably best of by finding their api if they have one. Otherwise, you can look at the webpage using inspect, and look at the network tab. Here you can find where the website is pulling data from. It will be some digging, but somewhere in there you should be able to find another link, which is where the website gets the data from. You can then use that link instead. The data will probably be easier to extract that way as well.

If you want a quick and dirty method you can use the requests-html module. This renders the webpage for you, including all the scripts because it uses a webbrowser under the hood. Therefore the output will be the same html you would see if you opened the website in a browser, and your extraction method should work there. Of course this has a lot of overhead, because it spawns webbrowser processes, but it can be useful in some circumstances.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM