Web 抓取：無法使用 class 循環進入 div 元素以獲取文本和 URL

Question

我正在嘗試抓取一個使用div和class的網站來獲取其中的內容。

我能夠獲得正確的數據，但是當我將其放入循環時會出錯。

html = BeautifulSoup(response, 'html.parser')
post_list = html.find_all('div', class_='eodLhs')
print(post_list)
i = 0

for values in post_list:
     url_json = {'title': values.ul.li[i].a.text, 'url': values.ul.li[i].a['href']}
     names.append(values.ul.li[i].a.text)
i = i+1

Output of the print statement is: https://gist.github.com/parikhparth23/48669444506502f11409d43b30a4250d

它在這一行拋出錯誤：

url_json = {'title': values.ul.li[i].a.text, 'url': values.ul.li[i].a['href']}

我想在抓取后獲取文本和 URL。

Answer 1

根據您的要點，我認為您可以只使用 css 選擇器，以確保您在父 class 中有子 href。 在您現有的代碼中， i 增量應該發生在循環中，但如果您按照我的描述重寫，則不需要。 使用以運算符開頭的屬性值來刪除共享鏈接，因為我懷疑您只想要內容的原始鏈接

for i in soup.select(".eodLhs [href^='/']"):
    print({i.text:i['href']})

Web 抓取：無法使用 class 循環進入 div 元素以獲取文本和 URL

問題描述

1 個解決方案

解決方案1
1 已采納 2019-10-26 05:53:38

Web 抓取：無法使用 class 循環進入 div 元素以獲取文本和 URL

問題描述

1 個解決方案

解決方案1 1 已采納 2019-10-26 05:53:38

解決方案1
1 已采納 2019-10-26 05:53:38