我無法使用 Python 進行網絡抓取從特定 div 中提取信息。我能做什么？

Question

我正在嘗試練習一些網絡抓取工具。 我想要的是提取有關報紙上過去文章的信息（ID 和 URL）。 所以，我有一個 URL，將在其中應用程序。

我的問題是當我想從這些文章中提取信息時。 無論我使用什么類型的庫，我都無法使用網絡抓取來訪問這些信息，因為有一個“div”不允許我更深入地進行信息提取。

每篇文章都有一個名為“searchRecordList Detail_search search_divider clearfix”的類，其中存儲了圖像、URL 和其他信息。 所有這些文章也存儲在另一個名為“divSearchResults”的 div 中。 盡管如此，它不允許我提取或循環它。 Python 總是將其讀為空或類似的。

這是包含文章信息的 HTML 結構：

 <div id="divSearchResults" class="searchRecordContent"> <div class="searchRecordList Detail_search search_divider clearfix"> <div class="image"> <a style="display: block;" pubid="19789" pubtitle="Boston Globe" href="https://newspaperarchive.com/boston-globe-jul-16-1922-p-97/" id="a_img_161988851" rel="https://newspaperarchive.com/boston-globe-jul-16-1922-p-97" class="srcimg-link"> <img src="https://newspaperarchive.com/us/massachusetts/boston/boston-globe/1922/07-16/161988851-thumbnail.jpg" data-original="https://newspaperarchive.com/us/massachusetts/boston/boston-globe/1922/07-16/161988851-thumbnail.jpg" width="180" height="180" alt="Boston Globe" class="srcimg lazy" style="display: inline;"></a></div> <div class="detail"> <div class="pull-right flagIcon unitedstatesofamerica"><a aria-label="United States Of America" aria-valuetext="United States Of America" href="https://newspaperarchive.com/tags/?pep=dependency&amp;pr=10&amp;pci=7/" class="tooltipElement" rel="tooltip" data-original-title="Narrow results to this country only?"><svg aria-hidden="true" width="32px" height="32px" class="flagborder"><use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="/Content/assets/images/flag-icon.svg#unitedstatesofamerica"></use></svg></a></div> <h3><a pubid="19789" pubtitle="Boston Globe" id="161988851" class="result-link" rel="https://newspaperarchive.com/boston-globe-jul-16-1922-p-97" href="https://newspaperarchive.com/boston-globe-jul-16-1922-p-97/" target="_blank">Boston Globe</a><span tabindex="0">Sunday, July 16, 1922, Boston, Massachusetts, United States Of America</span></h3> <div tabindex="0" class="text"><b>dependency</b> within fivo years Of the death of such a vet right whatever unless they make claim FIBRE TUXEDO EXT...Boston Globe (Newspaper) - July 16, 1922, Boston, Massachusetts</div> <div class="bottomBtn"> <a class="btn btn-gradgrey" style="" id="ahref_161988851" href="javascript:void(0);" onclick="javascript:UpgradePopup();">Save to Treasure Box</a> <a class="btn btn-gradgrey" onclick="javascript:UpgradePopup();" href="javascript:void(0)">Don't Show Me Again</a> </div> <div tabindex="0" class="dateaddedgrey"> Date Added May 31, 2010</div> </div> </div> <div class="searchRecordList Detail_search search_divider clearfix"> </div> <div class="searchRecordList Detail_search search_divider clearfix"> </div> <div class="searchRecordList Detail_search search_divider clearfix"> </div> <div class="searchRecordList Detail_search search_divider clearfix"> </div> <div class="searchRecordList Detail_search search_divider clearfix"> </div> <div class="searchRecordList Detail_search search_divider clearfix"> </div> <div class="searchRecordList Detail_search search_divider clearfix"> </div> <div class="searchRecordList Detail_search search_divider clearfix"> </div> <div class="searchRecordList Detail_search search_divider clearfix"> </div> </div>

我使用過 BeautifulSoup 和 xpath 方法，但我無法訪問文章 div。

我嘗試在每篇文章中搜索不同的類，但沒有成功（類：詳細信息，結果鏈接）

 # First method # Code import requests from bs4 import BeautifulSoup url = 'https://newspaperarchive.com/tags/?pc=3091&psi=50&pci=7&pt=19789&ndt=bd&pd=1&pm=1&py=1920&pe=31&pem=12&pey=1929&pep=dependency' response = requests.get(url) soup = BeautifulSoup(response.content, "html.parser") results = soup.find_all("div", class_="searchRecordContent") print(results) # Second method # Code from lxml import html import requests url = 'https://newspaperarchive.com/tags/?pc=3091&psi=50&pci=7&pt=19789&ndt=bd&pd=1&pm=1&py=1920&pe=31&pem=12&pey=1929&pep=dependency' page = requests.get(url) tree = html.fromstring(page.content) r = tree.xpath('//*[@id="divSearchResults"]') print(r)

這是我可以從每篇文章中找到 URL 和 ID 的預期結果：

# Expected
<div id="divSearchResults" class="searchRecordContent">
<div class="searchRecordList Detail_search search_divider clearfix">
<div class="image">
<a style="display: block;" pubid="19789" pubtitle="Boston Globe" href="https://newspaperarchive.com/boston-globe-jul-16-1922-p-97/" id="a_img_161988851" rel="https://newspaperarchive.com/boston-globe-jul-16-1922-p-97" class="srcimg-link">
<img src="https://newspaperarchive.com/us/massachusetts/boston/boston-globe/1922/07-16/161988851-thumbnail.jpg" data-original="https://newspaperarchive.com/us/massachusetts/boston/boston-globe/1922/07-16/161988851-thumbnail.jpg" width="180" height="180" alt="Boston Globe" class="srcimg lazy" style="display: inline;"></a></div>
<div class="detail">
<div class="pull-right flagIcon unitedstatesofamerica"><a aria-label="United States Of America" aria-valuetext="United States Of America" href="https://newspaperarchive.com/tags/?pep=dependency&amp;pr=10&amp;pci=7/" class="tooltipElement" rel="tooltip" data-original-title="Narrow results to this country only?"><svg aria-hidden="true" width="32px" height="32px" class="flagborder"><use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="/Content/assets/images/flag-icon.svg#unitedstatesofamerica"></use></svg></a></div>
<h3><a pubid="19789" pubtitle="Boston Globe" id="161988851" class="result-link" rel="https://newspaperarchive.com/boston-globe-jul-16-1922-p-97" href="https://newspaperarchive.com/boston-globe-jul-16-1922-p-97/" target="_blank">Boston Globe</a><span tabindex="0">Sunday, July 16, 1922, Boston, Massachusetts, United States Of America</span></h3>
<div tabindex="0" class="text"><b>dependency</b> within fivo years Of the death of such a vet right whatever unless they make claim FIBRE TUXEDO EXT...Boston Globe (Newspaper) - July 16, 1922, Boston, Massachusetts</div>
<div class="bottomBtn">
<a class="btn btn-gradgrey" style="" id="ahref_161988851" href="javascript:void(0);" onclick="javascript:UpgradePopup();">Save to Treasure Box</a> <a class="btn btn-gradgrey" onclick="javascript:UpgradePopup();" href="javascript:void(0)">Don't Show Me Again</a>
</div>
<div tabindex="0" class="dateaddedgrey"> Date Added May 31, 2010</div>
</div>
</div>
.... 
### (the same way for the other 9 articles)

所以問題是：

如何使用 Python 從每篇文章中訪問“searchRecordList Detail_search search_divider clearfix”div？

Answer 1

內容是動態加載的。 我認為 POST 請求甚至可能是異步的。 一種方法是使用 Selenium，它允許 javascript 在頁面上運行。 您需要一個額外的等待條件才能出現內容。 我等待與加載微調器相關的元素之一，類ajax-loading-block-window ，以實現頁面加載完成時出現的style屬性值。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

d = webdriver.Chrome(r'C:\Users\User\Documents\chromedriver.exe')
d.get('https://newspaperarchive.com/tags/?pc=3091&psi=50&pci=7&pt=19789&ndt=bd&pd=1&pm=1&py=1920&pe=31&pem=12&pey=1929&pep=dependency/')
WebDriverWait(d, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.ajax-loading-block-window[style="height: 100%; display: none;"]')))
data = [(i.get_attribute('id') , i.get_attribute('href') ) for i in d.find_elements_by_css_selector('.result-link')]

我無法使用 Python 進行網絡抓取從特定 div 中提取信息。我能做什么？

問題描述

1 個解決方案

解決方案1
2 已采納 2019-07-30 06:46:32

我無法使用 Python 進行網絡抓取從特定 div 中提取信息。 我能做什么？

問題描述

1 個解決方案

解決方案1 2 已采納 2019-07-30 06:46:32

我無法使用 Python 進行網絡抓取從特定 div 中提取信息。我能做什么？

解決方案1
2 已采納 2019-07-30 06:46:32