HTML 使用重復的 div class 名稱抓取網站

Question

我目前正在研究 HTML 來抓取 baka-update。 但是，Div Class 的名稱是重復的。

由於我的目標是 csv 或 json，我想使用 [sCat] 中的信息作為列名和 [sContent] 來存儲.....他們是使用這種網站的方式嗎？

謝謝，

樣品https://www.mangaupdates.com/series.html?id=75363

圖 1 圖 2

from lxml import html
import requests

page = requests.get('http://www.mangaupdates.com/series.html?id=153558?')
tree = html.fromstring(page.content)

#Get the name of the columns.... I hope
sCat = tree.xpath('//div[@class="sCat"]/text()')
#Get the actual data
sContent = tree.xpath('//div[@class="sContent"]/text()')

print('sCat: ', sCat)
print('sContent: ', sContent)

我試過了，但我找不到@Jasper Nichol M Fabella

Answer 1

我嘗試編輯您的代碼並得到以下 output。 也許它會有所幫助。


from lxml import html
import requests

page = requests.get('http://www.mangaupdates.com/series.html?id=153558?')
tree = html.fromstring(page.content)
# print(page.content)

#Get the name of the columns.... I hope
sCat = tree.xpath('//div[@class="sCat"]')
#Get the actual data
sContent = tree.xpath('//div[@class="sContent"]')

print('sCat: ', len(sCat))
print('sContent: ', len(sContent))
json_dict={}

for i in  range(0,len(sCat)):
#     print(''.join(i.itertext()))
    sCat_text=(''.join(sCat[i].itertext()))
    sContent_text=(''.join(sContent[i].itertext()))
    json_dict[sCat_text]=sContent_text
print(json_dict)

我得到了以下 output

希望能幫助到你

Answer 2

您可以使用xpath表達式並在要抓取的內容上創建絕對路徑

Answer 3

你用什么刮？ 如果您使用的是 BeautifulSoup？ 然后，您可以使用帶有 class 標識符的 FindAll 方法搜索頁面上的所有內容並遍歷該內容。 你可以使用特殊的“ _class ”deginator

就像是

import bs4
soup = bs4.BeautifulSoup(html.source)
soup.find_all('div', class_='sCat')
# do rest of your logic work here

編輯：在您進行編輯之前，我在手機上的緩存頁面上打字。 所以沒有看到變化。 盡管我看到您正在使用原始 lxml 庫進行解析。 是的，這更快，但我不太熟悉，因為我只為一個項目使用了原始 lxml 庫，但我認為你可以鏈接兩種搜索方法來提取等效的東西。

Answer 4

這是一個帶有requests和lxml庫的示例：

from lxml import html
import requests

r = requests.get('https://www.mangaupdates.com/series.html?id=75363')
tree = html.fromstring(r.content)

sCat = [i.text_content().strip() for i in tree.xpath('//div[@class="sCat"]')]
sContent = [i.text_content().strip() for i in tree.xpath('//div[@class="sContent"]')]

HTML 使用重復的 div class 名稱抓取網站

問題描述

4 個解決方案

解決方案1
1 2019-11-14 07:59:49

解決方案2
0 2019-11-14 07:25:39

解決方案3
0 2019-11-14 07:55:52

解決方案4
0 2019-11-14 08:05:05

HTML 使用重復的 div class 名稱抓取網站

問題描述

4 個解決方案

解決方案1 1 2019-11-14 07:59:49

解決方案2 0 2019-11-14 07:25:39

解決方案3 0 2019-11-14 07:55:52

解決方案4 0 2019-11-14 08:05:05

解決方案1
1 2019-11-14 07:59:49

解決方案2
0 2019-11-14 07:25:39

解決方案3
0 2019-11-14 07:55:52

解決方案4
0 2019-11-14 08:05:05