美麗的湯找不到標簽

Question

我目前正在嘗試使用Python 3.6中的請求和BeautifulSoup模塊進行練習，並遇到了一個我似乎無法在其他問題和答案中找到任何信息的問題。

似乎在頁面的某個時刻，Beuatiful Soup停止識別標簽和ID。 我試圖從這樣的頁面中提取播放數據：

http://www.pro-football-reference.com/boxscores/201609080den.htm

import requests, bs4

source_url = 'http://www.pro-football-reference.com/boxscores/201609080den.htm'
res = requests.get(source_url)
if '404' in res.url:
    raise Exception('No data found for this link: '+source_url)

soup = bs4.BeautifulSoup(res.text,'html.parser')

#this works
all_pbp = soup.findAll('div', {'id' : 'all_pbp'})
print(len(all_pbp))

#this doesn't
table = soup.findAll('table', {'id' : 'pbp'})
print(len(table))

在Chrome中使用檢查器，我可以看到該表肯定存在。 我也嘗試在HTML的后半部分使用'div'和'tr'，它似乎不起作用。 我已經嘗試了標准的'html.parser'以及lxml和html5lib，但似乎沒有任何效果。

我在這里做錯了什么，或者HTML或其格式中有什么東西阻止BeautifulSoup正確找到后來的標簽？ 我遇到過這家公司（hockey-reference.com，basketball-reference.com）運營的類似網頁的問題，但是能夠在其他網站上正確使用這些工具。

如果它是HTML的東西，有沒有更好的工具/庫來幫助提取這些信息？

BF，謝謝你的幫助

Answer 1

在對URL進行GET請求后，BS4將無法執行網頁的javascript。 我認為關注表是從客戶端javascript加載異步。

因此，在抓取HTML之前，需要首先運行客戶端javascript。 這篇文章描述了如何做到這一點！

Answer 2

好的，我得到的是什么問題。 你試圖解析評論，而不是普通的html元素。 對於這種情況，你應該使用BeautifulSoup Comment ，如下所示：

import requests
from bs4 import BeautifulSoup,Comment

source_url = 'http://www.pro-football-reference.com/boxscores/201609080den.htm'
res = requests.get(source_url)
if '404' in res.url:
    raise Exception('No data found for this link: '+source_url)

soup = BeautifulSoup(res.content,'html.parser')

comments=soup.find_all(string=lambda text:isinstance(text,Comment))

for comment in comments:
    comment=BeautifulSoup(str(comment), 'html.parser')
    search_play = comment.find('table', {'id':'pbp'})
    if search_play:
        play_to_play=search_play

美麗的湯找不到標簽

問題描述

2 個解決方案

解決方案1
3 已采納 2017-07-02 15:23:44

解決方案2
0 2017-07-02 05:54:54

美麗的湯找不到標簽

問題描述

2 個解決方案

解決方案1 3 已采納 2017-07-02 15:23:44

解決方案2 0 2017-07-02 05:54:54

解決方案1
3 已采納 2017-07-02 15:23:44

解決方案2
0 2017-07-02 05:54:54