從表中提取數據的漂亮湯

Question

我正在嘗試從此網站https://www.basketball-reference.com/boxscores/201101100CHA.html的“ Four Factors表中提取數據。 我上桌遇到了麻煩。 我努力了

url = https://www.basketball-reference.com/boxscores/201101100CHA.html
html = requests.get(url).content
soup = BeautifulSoup(html,"html.parser")

div = soup.find('div',id='all_four_factors')

然后，當我嘗試使用tr = div.find_all('tr')拉行時，我什么也沒回來。

Answer 1

我看了一下您要抓取的HTML代碼，問題是您要獲取的標簽都在注釋部分 。 BeautifulSoup會將內部注釋視為一堆文本，而不是實際的HTML代碼。 因此，您需要做的是獲取注釋的內容，然后將此字符串放回BeautifulSoup中：

import requests
from bs4 import BeautifulSoup, Comment

url = 'https://www.basketball-reference.com/boxscores/201101100CHA.html'
html = requests.get(url).content
soup = BeautifulSoup(html,"html.parser")

div = soup.find('div', id='all_four_factors')

# Get everything in here that's a comment
comments = div.find_all(text=lambda text:isinstance(text, Comment))

# Loop through each comment until you find the one that
# has the stuff you want.
for c in comments:

    # A perhaps crude but effective way of stopping at a comment
    # with HTML inside: see if the first character inside is '<'.
    if c.strip()[0] == '<':
        newsoup = BeautifulSoup(c.strip(), 'html.parser')
        tr = newsoup.find_all('tr')
        print(tr)

對此的一個警告是，BS將假定注釋掉的代碼是有效的，格式正確的HTML。 不過，這對我有用，因此，如果頁面保持相對不變，它將繼續工作。

Answer 2

如果查看list(div.children)[5] ，這是唯一將tr作為子字符串的子級，您將意識到它是Comment對象，因此從技術上講，該div節點下沒有tr元素。 因此， div.find_all('tr')應該為空。

Answer 3

你為什么這么做：

div = soup.find('div',id='all_four_factors')

這將得到以下行，並嘗試在其中搜索“ tr”標簽。

<div id="all_four_factors" class="table_wrapper floated setup_commented commented">

您可以只使用第一部分中的原始湯變量，然后執行

tr = soup.find_all('tr')

從表中提取數據的漂亮湯

問題描述

3 個解決方案

解決方案1
3 已采納 2018-11-14 00:47:30

解決方案2
2 2018-11-14 00:37:19

解決方案3
0 2018-11-14 00:28:58

從表中提取數據的漂亮湯

問題描述

3 個解決方案

解決方案1 3 已采納 2018-11-14 00:47:30

解決方案2 2 2018-11-14 00:37:19

解決方案3 0 2018-11-14 00:28:58

解決方案1
3 已采納 2018-11-14 00:47:30

解決方案2
2 2018-11-14 00:37:19

解決方案3
0 2018-11-14 00:28:58