简体   繁体   English

漂亮的汤解析网页

[英]Beautiful soup parsing web page

I am trying to scrape the following web page: https://www.racingpost.com with BS.我正在尝试使用 BS 抓取以下网页: https://www.racingpost.com ://www.racingpost.com。 For example I want to extract all the Course names .例如我想提取所有课程名称 Course names are under this tag:课程名称在此标签下:

<span class="rh-cardsMatrix__courseName">Wincanton</span>

My code is here:我的代码在这里:

from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://www.racingpost.com"
response = requests.get(url)
data = response.text
soup =  BeautifulSoup(data, "html.parser")
pages = soup.find_all('span',{'class':'rh-cardsMatrix__courseName'})
for page in pages:
    print(page.text)

And I don't get anything for output.我没有得到任何输出。 I think that it has some issues with parsing, and I have tried all available parsers for BS.我认为它在解析方面存在一些问题,并且我已经尝试了所有可用的 BS 解析器。 Could someone advise here?有人可以在这里建议吗? Is it even possible to do with BS?甚至可以用BS做吗?

The data you are looking for seems to be hidden in a script block at the end of the raw HTML.您要查找的数据似乎隐藏在原始 HTML 末尾的脚本块中。

You can try something like this:你可以尝试这样的事情:

import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
from pandas import json_normalize

url = 'https://www.racingpost.com'
res = requests.get(url).text

raw = res.split('cardsMatrix":{"courses":')[1].split(',"date":"2020-03-06","heading":"Tomorrow\'s races"')[0]
data = json.loads(raw)
df = json_normalize(data)

Output:输出:

id  abandoned   allWeather  surfaceType     colour  name    countryCode     meetingUrl  hashName    meetingTypeCode     races
0   1083    False   True    Polytrack   3   Chelmsford  GB  /racecards/1083/chelmsford-aw/2020-03-06    chelmsford-aw   Flat    [{'id': 753047, 'abandoned': False, 'result': ...
1   1212    False   False       4   Ffos Las    GB  /racecards/1212/ffos-las/2020-03-06     ffos-las    Jumps   [{'id': 750498, 'abandoned': False, 'result': ...
2   1138    False   True    Polytrack   11  Dundalk     IRE     /racecards/1138/dundalk-aw/2020-03-06   dundalk-aw  Flat    [{'id': 753023, 'abandoned': False, 'result': ...
3   513     False   True    Tapeta  5   Wolverhampton   GB  /racecards/513/wolverhampton-aw/2020-03-06  wolverhampton-aw    Flat    [{'id': 750658, 'abandoned': False, 'result': ...
4   565     False   False       0   Jebel Ali   UAE     /racecards/565/jebel-ali/2020-03-06     jebel-ali   Flat    [{'id': 753155, 'abandoned': False, 'result': ...
5   206     False   False       0   Deauville   FR  /racecards/206/deauville/2020-03-06     deauville   Flat    [{'id': 753186, 'abandoned': False, 'result': ...
6   54  True    False       1   Sandown     GB  /racecards/54/sandown/2020-03-06    sandown     Jumps   [{'id': 750510, 'abandoned': True, 'result': F...
7   30  True    False       2   Leicester   GB  /racecards/30/leicester/2020-03-06  leicester   Jumps   [{'id': 750501, 'abandoned': True, 'result': F...

Caveat: Be aware that you have to manually search for the string to properly split res at the end.警告:请注意,您必须手动搜索字符串才能在末尾正确拆分res

Edit: More robust solution.编辑:更强大的解决方案。

To get the script block in total and parse from there try this code:要总共获取脚本块并从那里解析,请尝试以下代码:

url = 'https://www.racingpost.com'
res = requests.get(url).content
soup = BeautifulSoup(res)

# salient data seems to be in 20th script block 
data = soup.find_all("script")[19].text
clean = data.split('window.__PRELOADED_STATE = ')[1].split(";\n")[0]
clean = json.loads(clean)
clean.keys()

Output:输出:

['stories', 'bookmakers', 'panelTemplate', 'cardsMatrix', 'advertisement']

Then retrieve eg data saved to key cardsMatrix :然后检索例如保存到钥匙cardsMatrix数据:

parsed = json_normalize(clean["cardsMatrix"]).courses.values[0]
pd.DataFrame(parsed)

Output again the above (but with more robust solution):再次输出上述内容(但使用更强大的解决方案):

id  abandoned   allWeather  surfaceType     colour  name    countryCode     meetingUrl  hashName    meetingTypeCode     races
0   1083    False   True    Polytrack   3   Chelmsford  GB  /racecards/1083/chelmsford-aw/2020-03-06    chelmsford-aw   Flat    [{'id': 753047, 'abandoned': False, 'result': ...
1   1212    False   False       4   Ffos Las    GB  /racecards/1212/ffos-las/2020-03-06     ffos-las    Jumps   [{'id': 750498, 'abandoned': False, 'result': ...

Viewing the source code of https://www.racingpost.com , no elements have the classname rh-cardsMatrix__courseName .查看https://www.racingpost.com的源代码,没有元素具有类名rh-cardsMatrix__courseName Querying for it on the page shows that it does exist when the page is rendered.在页面上查询它表明在页面呈现时它确实存在。 This suggests that the elements with that classname are generated with JavaScript, which BeautifulSoup doesn't support (it doesn't run JavaScript).这表明具有该类名的元素是使用 JavaScript 生成的,而 BeautifulSoup 不支持(它不运行 JavaScript)。

You'll instead want to find the endpoints on the webpage that return the data that create those elements (eg, look for XHRs for data) and use those to get the data that you need.相反,您需要在网页上找到返回创建这些元素的数据的端点(例如,查找数据的 XHR)并使用这些端点来获取您需要的数据。

Thanks mattbasta for your answer, it directed me to this question which solved my problems : soup = BeautifulSoup(data, "html.parser") pages = soup.find_all('span',{'class':'rh-cardsMatrix__courseName'})感谢 mattbasta 的回答,它引导我解决了这个问题:soup = BeautifulSoup(data, "html.parser") pages = soup.find_all('span',{'class':'rh-cardsMatrix__courseName'} )

PyQt4 to PyQt5 -> mainFrame() deprecated, need fix to load web pages PyQt4 到 PyQt5 -> mainFrame() 已弃用,需要修复以加载网页

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM