Python Web Scraping：美丽的汤

Question

我在抓取网页时遇到问题。 我试图获得两个团队之间的差异（例如：+ 2，+ 1，......），但是当我应用find_all方法时，它会返回一个空列表...

from bs4 import BeautifulSoup
from requests import get
url='https://www.mismarcadores.com/partido/Q942gje8/#punto-a-punto;1'
response=get(url)
html_soup=BeautifulSoup(response.text,'html.parser')


html_soup.find_all('span',class_='match-history-diff-score-inc')

Answer 1

您遇到的问题是Web内容是通过JavaScript动态生成的。 因此，请求无法处理它，因此您最好使用Selenium之类的东西。

编辑： Per@λuser的建议，我修改了我的答案，只通过搜索XPath正在寻找的元素来使用Selenium。 请注意 ，我使用XPath函数starts-with()来获取match-history-diff-score-dec和match-history-diff-score-inc 。 只选择其中一个让你错过近一半的相对分数更新。 这就是输出产生103结果而不是56结果的原因。

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://www.mismarcadores.com/partido/Q942gje8/#punto-a-punto;1")

table = driver.find_elements_by_xpath('//td//span[starts-with(@class, "match-history-diff-score-")]')

results = []
for tag in table:
    print(tag.get_attribute('innerHTML'))
print(results)

这输出：

['+2', '+1', '+2', '+2', '+1', '+2', '+4', '+2', '+2', '+4', '+7', '+5', '+8', '+5', '+7', '+5', '+3', '+2', '+5', '+3', '+5', '+3', '+5', '+6', '+4', '+6', '+7', '+6', '+5', '+2', '+4', '+2', '+5', '+7', '+6', '+8', '+5', '+3', '+1', '+2', '+1', '+4', '+7', '+5', '+8', '+6', '+9', '+11', '+10', '+9', '+11', '+9', '+10', '+11', '+9', '+7', '+5', '+3', '+2', '+1', '+3', '+1', '+3', '+2', '+1', '+3', '+2', '+4', '+1', '+2', '+3', '+6', '+3', '+5', '+2', '+1', '+1', '+2', '+4', '+3', '+2', '+4', '+1', '+3', '+5', '+7', '+5', '+8', '+7', '+6', '+5', '+4', '+1', '+4', '+6', '+9', '+7', '+9', '+7', '+10', '+11', '+12', '+10']

Answer 2

Selenium可能会解决您的问题，但我建议您从浏览器中查找网络并找到生成所需数据的请求。 在你的情况下，它是d_mh_Q942gje8_es_1 。

我不喜欢Selenium因为它太重而使你的脚本变慢。 它是为自动化测试而非网页抓取而构建的。

这是我使用requests脚本，无疑运行速度比Selenium快。

import requests
from bs4 import BeautifulSoup

url = 'https://d.mismarcadores.com/x/feed/d_mh_Q942gje8_es_1'

r = requests.get(url, headers={'x-fsign':'SW9D1eZo'}) # Got this from browser
soup = BeautifulSoup(r.text, 'html.parser')
diff_list = [diff.text for diff in soup.find_all('span',{'class' : 'match-history-diff-score-inc'})]
print(diff_list)

输出：

['+2', '+1', '+2', '+2', '+2', '+4', '+2', '+4', '+7', '+8', '+7', '+5', '+5', '+5', '+6', '+6', '+7', '+4', '+5', '+7', '+8', '+1', '+2', '+4', '+7', '+8', '+9', '+11', '+11', '+10', '+11', '+1', '+3', '+3', '+3', '+4', '+2', '+3', '+6', '+5', '+1', '+1', '+2', '+4', '+4', '+3', '+5', '+7', '+8', '+4', '+6', '+9', '+9', '+10', '+11', '+12']

Answer 3

如果您检查页面源（例如通过view-source：在Chrome或Firefox中，或者将html字符串写入文件），您将看到您要查找的元素（搜索match-history-diff-score-inc ）不存在。 实际上，使用JS动态加载速率。

Python Web Scraping：美丽的汤

问题描述

3 个解决方案

解决方案1
7 2018-05-13 18:12:18

解决方案2
7 2018-05-13 18:31:04

解决方案3
2 2018-05-13 18:10:46

Python Web Scraping：美丽的汤

问题描述

3 个解决方案

解决方案1 7 2018-05-13 18:12:18

解决方案2 7 2018-05-13 18:31:04

解决方案3 2 2018-05-13 18:10:46

解决方案1
7 2018-05-13 18:12:18

解决方案2
7 2018-05-13 18:31:04

解决方案3
2 2018-05-13 18:10:46