简体   繁体   English

BeautifulSoup - 使用Python通过分页表刮取数据

[英]BeautifulSoup - Scraping data through paginated table using Python

I am scraping data through a betting site ( https://www.pointdevente.parionssport.fdj.fr/parisouverts/rugby ). 我正在通过博彩网站( https://www.pointdevente.parionssport.fdj.fr/parisouverts/rugby )抓取数据。

I can scrape a limited number of events on the current page. 我可以在当前页面上抓取有限数量的事件。 The issue I am facing is that I am unable to scrape through the rest of data in the table . 我面临的问题是我无法清除表中的其他数据。 How do I go to the next page or link. 如何进入下一页或链接。

Following is my code: 以下是我的代码:

import urllib2
from urllib2 import urlopen
import requests
import dryscrape
from bs4 import BeautifulSoup

dryscrape.start_xvfb()
SessionFDJ = dryscrape.Session()
SessionFDJ.visit('https://pointdevente.parionssport.fdj.fr/parisouverts/rugby/')
ResponseFDJ = SessionFDJ.body()
print(ResponseFDJ)

This page use JavaScript to get all data and change it. 此页面使用JavaScript获取所有数据并进行更改。 Use DevTools in Chrome/Firefox to see what files/urls are used by browser and you see 使用Chrome/Firefox DevTools查看浏览器使用的文件/网址

https://www.pointdevente.parionssport.fdj.fr/api/1n2/offre?sport=964500 https://www.pointdevente.parionssport.fdj.fr/api/1n2/offre?sport=964500

which gives all data as JSON . 它将所有数据都表示为JSON

It seems this page use API so find API documentation and you will no need BeautifulSoup 看来这个页面使用API所以找到API文档,你就不需要BeautifulSoup


import requests

url = 'https://www.pointdevente.parionssport.fdj.fr/api/1n2/offre?sport=964500'

r = requests.get(url)

for x in data:
    print(x['label'])

result: 结果:

Biarritz-Perpignan
Kenya-France
Australie-Japon
Etats-Unis-Ecosse
Argentine-Pays de Galles
Angleterre-Samoa
Montauban-Colomiers
Bourgoin-Angoulême
Aurillac-Mt-de-Marsan
Dax-Albi
Vannes-Béziers
Ospreys-Edimbourg
Glasgow-Munster
Sale-Exeter
Bath-Saracens
Pau-Clermont
Zebre-Llanelli
Angleterre-Australie
Connacht-Trévise
Gloucester-Bristol
Leicester-Northampton
Cardiff-Ulster
Grenoble-Montpellier
Lyon-Castres
St.Français-Bayonne
Leinster-Newport
La Rochelle-Racing 92
Toulouse-Brive
Narbonne-Oyonnax
Worcester-Wasps
Newcastle-Harlequins
Toulon-Bordeaux
Fidji-Canada
NlleZélande-Russie
Agen-Carcassonne
AfriqueduSud-Ouganda

This is a client-rendered application, there is no table info in HTML you can get via urllib. 这是一个客户端呈现的应用程序,您可以通过urllib获取HTML中的表信息。 All data is retrieved and rendered with Javascript. 使用Javascript检索和呈现所有数据。 It's even easier, you don't have to parse HTML. 它更容易,您不必解析HTML。

Here is a link, that has necessary data - https://www.pointdevente.parionssport.fdj.fr/api/1n2/offre?sport=964500 这是一个包含必要数据的链接 - https://www.pointdevente.parionssport.fdj.fr/api/1n2/offre?sport=964500

It returns JSON with all events, you can use Python json library to parse it. 它返回包含所有事件的JSON,您可以使用Python json库来解析它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM