beautifulSoup soup.select() 為 css 選擇器返回空

Question

我正在嘗試解析來自該站點https://news.ycombinator.com/的一些鏈接

我想 select 一個特定的表

document.querySelector("#hnmain > tbody > tr:nth-child(3) > td > table")

我知道 bs4 的 css 選擇器限制。 但問題是我什至不能像#hnmain > tbody with soup.select('#hnmain > tbody')這樣簡單的 select 因為它返回空

使用下面的代碼，我無法解析 tbody 而使用 js 我做了（截圖）

from bs4 import BeautifulSoup
import requests
print("-"*100)
print("Hackernews parser")
print("-"*100)
url="https://news.ycombinator.com/"
res=requests.get(url)
html=res.content
soup=BeautifulSoup(html)
table=soup.select('#hnmain > tbody')
print(table)

出去：

soup=BeautifulSoup(html)
[]

Answer 1

而不是通過正文和表格為什么不直接到鏈接 go 呢？ 我對此進行了測試，效果很好：

links=soup.select('a',{'class':'storylink'})

如果你想要表格，因為每頁只有一個，你不需要 go 通過其他元素 - 你可以 go 直接到它。

table = soup.select('table')

Answer 2

我沒有從 beautifulsoup 或 curl 腳本中獲得 html 標簽tbody 。 它的意思是

soup.select('tbody')

返回空列表。 這與您獲得空列表的原因相同。

要提取您正在尋找的鏈接，只需執行

soup.select("a.storylink")

它將從站點獲取您想要的鏈接。

Answer 3

數據以 3 行為一組排列，其中第三行是用於間隔的空行。 循環頂部行並使用 next_sibling 在每個點獲取關聯的第二行。 BS4 4.7.1+

from bs4 import BeautifulSoup as bs
import requests

r = requests.get('https://news.ycombinator.com/')
soup = bs(r.content, 'lxml')
top_rows = soup.select('.athing')

for row in top_rows:
    title = row.select_one('.storylink')
    print(title.text)
    print(title['href'])
    print('https://news.ycombinator.com/' + row.select_one('.sitebit a')['href'])
    next_row = row.next_sibling
    print(next_row.select_one('.score').text)
    print(next_row.select_one('.hnuser').text)
    print(next_row.select_one('.age a').text)
    print(next_row.select_one('a:nth-child(6)').text)
    print(100*'-')

beautifulSoup soup.select() 為 css 選擇器返回空

問題描述

3 個解決方案

解決方案1
1 2019-10-20 04:46:08

解決方案2
1 2019-10-20 05:09:43

解決方案3
1 2019-10-20 08:31:07

beautifulSoup soup.select() 為 css 選擇器返回空

問題描述

3 個解決方案

解決方案1 1 2019-10-20 04:46:08

解決方案2 1 2019-10-20 05:09:43

解決方案3 1 2019-10-20 08:31:07

解決方案1
1 2019-10-20 04:46:08

解決方案2
1 2019-10-20 05:09:43

解決方案3
1 2019-10-20 08:31:07