简体   繁体   English

Python Web Scraping:美丽的汤

[英]Python Web Scraping: Beautiful Soup

I have a problem with the scraping of a web page. 我在抓取网页时遇到问题。 I'm trying to get the difference of points (Ex: +2,+1,...) between two teams but when I apply the find_all method it returns an empty list... 我试图获得两个团队之间的差异(例如:+ 2,+ 1,......),但是当我应用find_all方法时,它会返回一个空列表...

from bs4 import BeautifulSoup
from requests import get
url='https://www.mismarcadores.com/partido/Q942gje8/#punto-a-punto;1'
response=get(url)
html_soup=BeautifulSoup(response.text,'html.parser')


html_soup.find_all('span',class_='match-history-diff-score-inc')

The problem you have is the web content is being generated dynamically through JavaScript. 您遇到的问题是Web内容是通过JavaScript动态生成的。 As such, requests is unable to handle it, and so you'd be better off using something like Selenium . 因此,请求无法处理它,因此您最好使用Selenium之类的东西。

EDIT: Per @λuser's suggestion, I've modified my answer to only use Selenium by searching for the elements you're looking for by XPath. 编辑: Per@λuser的建议,我修改了我的答案,只通过搜索XPath正在寻找的元素来使用Selenium。 Note that I used the XPath function starts-with() to get both match-history-diff-score-dec and match-history-diff-score-inc . 请注意 ,我使用XPath函数starts-with()来获取match-history-diff-score-decmatch-history-diff-score-inc Selecting only one of them was making you miss out on almost half of the relative score updates. 只选择其中一个让你错过近一半的相对分数更新。 This is why the output yields 103 results instead of 56. 这就是输出产生103结果而不是56结果的原因。

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://www.mismarcadores.com/partido/Q942gje8/#punto-a-punto;1")

table = driver.find_elements_by_xpath('//td//span[starts-with(@class, "match-history-diff-score-")]')

results = []
for tag in table:
    print(tag.get_attribute('innerHTML'))
print(results)

This outputs: 这输出:

['+2', '+1', '+2', '+2', '+1', '+2', '+4', '+2', '+2', '+4', '+7', '+5', '+8', '+5', '+7', '+5', '+3', '+2', '+5', '+3', '+5', '+3', '+5', '+6', '+4', '+6', '+7', '+6', '+5', '+2', '+4', '+2', '+5', '+7', '+6', '+8', '+5', '+3', '+1', '+2', '+1', '+4', '+7', '+5', '+8', '+6', '+9', '+11', '+10', '+9', '+11', '+9', '+10', '+11', '+9', '+7', '+5', '+3', '+2', '+1', '+3', '+1', '+3', '+2', '+1', '+3', '+2', '+4', '+1', '+2', '+3', '+6', '+3', '+5', '+2', '+1', '+1', '+2', '+4', '+3', '+2', '+4', '+1', '+3', '+5', '+7', '+5', '+8', '+7', '+6', '+5', '+4', '+1', '+4', '+6', '+9', '+7', '+9', '+7', '+10', '+11', '+12', '+10']

Selenium might solve your problem but I suggest you track down the network from your browser and find the request that is generating the data you need. Selenium可能会解决您的问题,但我建议您从浏览器中查找网络并找到生成所需数据的请求。 In your case it was d_mh_Q942gje8_es_1 . 在你的情况下,它是d_mh_Q942gje8_es_1

I don't prefer Selenium because it is too heavy and makes your script slow. 我不喜欢Selenium因为它太重而使你的脚本变慢。 It was built for automated testing, not web scraping. 它是为自动化测试而非网页抓取而构建的。

Here is my script using requests which undoubtedly runs faster than Selenium. 这是我使用requests脚本,无疑运行速度比Selenium快。

import requests
from bs4 import BeautifulSoup

url = 'https://d.mismarcadores.com/x/feed/d_mh_Q942gje8_es_1'

r = requests.get(url, headers={'x-fsign':'SW9D1eZo'}) # Got this from browser
soup = BeautifulSoup(r.text, 'html.parser')
diff_list = [diff.text for diff in soup.find_all('span',{'class' : 'match-history-diff-score-inc'})]
print(diff_list)

Output: 输出:

['+2', '+1', '+2', '+2', '+2', '+4', '+2', '+4', '+7', '+8', '+7', '+5', '+5', '+5', '+6', '+6', '+7', '+4', '+5', '+7', '+8', '+1', '+2', '+4', '+7', '+8', '+9', '+11', '+11', '+10', '+11', '+1', '+3', '+3', '+3', '+4', '+2', '+3', '+6', '+5', '+1', '+1', '+2', '+4', '+4', '+3', '+5', '+7', '+8', '+4', '+6', '+9', '+9', '+10', '+11', '+12']

If you check the page source (for example via view-source: in Chrome or Firefox, or by writing your html string to a file) you'll see that the element you are looking for (search match-history-diff-score-inc ) is not there. 如果您检查页面源(例如通过view-source:在Chrome或Firefox中,或者将html字符串写入文件),您将看到您要查找的元素(搜索match-history-diff-score-inc )不存在。 In fact, the rates are loaded dynamically using JS. 实际上,使用JS动态加载速率。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM