簡體   English   中英

我無法使用 Python 進行網絡抓取從特定 div 中提取信息。 我能做什么?

[英]I cannot extract information from a particular div using web-scraping with Python. What can I do?

我正在嘗試練習一些網絡抓取工具。 我想要的是提取有關報紙上過去文章的信息(ID 和 URL)。 所以,我有一個 URL,將在其中應用程序。

我的問題是當我想從這些文章中提取信息時。 無論我使用什么類型的庫,我都無法使用網絡抓取來訪問這些信息,因為有一個“div”不允許我更深入地進行信息提取。

每篇文章都有一個名為“searchRecordList Detail_search search_divider clearfix”的類,其中存儲了圖像、URL 和其他信息。 所有這些文章也存儲在另一個名為“divSearchResults”的 div 中。 盡管如此,它不允許我提取或循環它。 Python 總是將其讀為空或類似的。

這是包含文章信息的 HTML 結構:

 <div id="divSearchResults" class="searchRecordContent"> <div class="searchRecordList Detail_search search_divider clearfix"> <div class="image"> <a style="display: block;" pubid="19789" pubtitle="Boston Globe" href="https://newspaperarchive.com/boston-globe-jul-16-1922-p-97/" id="a_img_161988851" rel="https://newspaperarchive.com/boston-globe-jul-16-1922-p-97" class="srcimg-link"> <img src="https://newspaperarchive.com/us/massachusetts/boston/boston-globe/1922/07-16/161988851-thumbnail.jpg" data-original="https://newspaperarchive.com/us/massachusetts/boston/boston-globe/1922/07-16/161988851-thumbnail.jpg" width="180" height="180" alt="Boston Globe" class="srcimg lazy" style="display: inline;"></a></div> <div class="detail"> <div class="pull-right flagIcon unitedstatesofamerica"><a aria-label="United States Of America" aria-valuetext="United States Of America" href="https://newspaperarchive.com/tags/?pep=dependency&amp;pr=10&amp;pci=7/" class="tooltipElement" rel="tooltip" data-original-title="Narrow results to this country only?"><svg aria-hidden="true" width="32px" height="32px" class="flagborder"><use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="/Content/assets/images/flag-icon.svg#unitedstatesofamerica"></use></svg></a></div> <h3><a pubid="19789" pubtitle="Boston Globe" id="161988851" class="result-link" rel="https://newspaperarchive.com/boston-globe-jul-16-1922-p-97" href="https://newspaperarchive.com/boston-globe-jul-16-1922-p-97/" target="_blank">Boston Globe</a><span tabindex="0">Sunday, July 16, 1922, Boston, Massachusetts, United States Of America</span></h3> <div tabindex="0" class="text"><b>dependency</b> within fivo years Of the death of such a vet right whatever unless they make claim FIBRE TUXEDO EXT...Boston Globe (Newspaper) - July 16, 1922, Boston, Massachusetts</div> <div class="bottomBtn"> <a class="btn btn-gradgrey" style="" id="ahref_161988851" href="javascript:void(0);" onclick="javascript:UpgradePopup();">Save to Treasure Box</a> <a class="btn btn-gradgrey" onclick="javascript:UpgradePopup();" href="javascript:void(0)">Don't Show Me Again</a> </div> <div tabindex="0" class="dateaddedgrey"> Date Added May 31, 2010</div> </div> </div> <div class="searchRecordList Detail_search search_divider clearfix"> </div> <div class="searchRecordList Detail_search search_divider clearfix"> </div> <div class="searchRecordList Detail_search search_divider clearfix"> </div> <div class="searchRecordList Detail_search search_divider clearfix"> </div> <div class="searchRecordList Detail_search search_divider clearfix"> </div> <div class="searchRecordList Detail_search search_divider clearfix"> </div> <div class="searchRecordList Detail_search search_divider clearfix"> </div> <div class="searchRecordList Detail_search search_divider clearfix"> </div> <div class="searchRecordList Detail_search search_divider clearfix"> </div> </div>

我使用過 BeautifulSoup 和 xpath 方法,但我無法訪問文章 div。

我嘗試在每篇文章中搜索不同的類,但沒有成功(類:詳細信息,結果鏈接)

 # First method # Code import requests from bs4 import BeautifulSoup url = 'https://newspaperarchive.com/tags/?pc=3091&psi=50&pci=7&pt=19789&ndt=bd&pd=1&pm=1&py=1920&pe=31&pem=12&pey=1929&pep=dependency' response = requests.get(url) soup = BeautifulSoup(response.content, "html.parser") results = soup.find_all("div", class_="searchRecordContent") print(results) # Second method # Code from lxml import html import requests url = 'https://newspaperarchive.com/tags/?pc=3091&psi=50&pci=7&pt=19789&ndt=bd&pd=1&pm=1&py=1920&pe=31&pem=12&pey=1929&pep=dependency' page = requests.get(url) tree = html.fromstring(page.content) r = tree.xpath('//*[@id="divSearchResults"]') print(r)

這是我可以從每篇文章中找到 URL 和 ID 的預期結果:

# Expected
<div id="divSearchResults" class="searchRecordContent">
<div class="searchRecordList Detail_search search_divider clearfix">
<div class="image">
<a style="display: block;" pubid="19789" pubtitle="Boston Globe" href="https://newspaperarchive.com/boston-globe-jul-16-1922-p-97/" id="a_img_161988851" rel="https://newspaperarchive.com/boston-globe-jul-16-1922-p-97" class="srcimg-link">
<img src="https://newspaperarchive.com/us/massachusetts/boston/boston-globe/1922/07-16/161988851-thumbnail.jpg" data-original="https://newspaperarchive.com/us/massachusetts/boston/boston-globe/1922/07-16/161988851-thumbnail.jpg" width="180" height="180" alt="Boston Globe" class="srcimg lazy" style="display: inline;"></a></div>
<div class="detail">
<div class="pull-right flagIcon unitedstatesofamerica"><a aria-label="United States Of America" aria-valuetext="United States Of America" href="https://newspaperarchive.com/tags/?pep=dependency&amp;pr=10&amp;pci=7/" class="tooltipElement" rel="tooltip" data-original-title="Narrow results to this country only?"><svg aria-hidden="true" width="32px" height="32px" class="flagborder"><use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="/Content/assets/images/flag-icon.svg#unitedstatesofamerica"></use></svg></a></div>
<h3><a pubid="19789" pubtitle="Boston Globe" id="161988851" class="result-link" rel="https://newspaperarchive.com/boston-globe-jul-16-1922-p-97" href="https://newspaperarchive.com/boston-globe-jul-16-1922-p-97/" target="_blank">Boston Globe</a><span tabindex="0">Sunday, July 16, 1922, Boston, Massachusetts, United States Of America</span></h3>
<div tabindex="0" class="text"><b>dependency</b> within fivo years Of the death of such a vet right whatever unless they make claim FIBRE TUXEDO EXT...Boston Globe (Newspaper) - July 16, 1922, Boston, Massachusetts</div>
<div class="bottomBtn">
<a class="btn btn-gradgrey" style="" id="ahref_161988851" href="javascript:void(0);" onclick="javascript:UpgradePopup();">Save to Treasure Box</a> <a class="btn btn-gradgrey" onclick="javascript:UpgradePopup();" href="javascript:void(0)">Don't Show Me Again</a>
</div>
<div tabindex="0" class="dateaddedgrey"> Date Added May 31, 2010</div>
</div>
</div>
.... 
### (the same way for the other 9 articles)

所以問題是:

如何使用 Python 從每篇文章中訪問“searchRecordList Detail_search search_divider clearfix”div?

內容是動態加載的。 我認為 POST 請求甚至可能是異步的。 一種方法是使用 Selenium,它允許 javascript 在頁面上運行。 您需要一個額外的等待條件才能出現內容。 我等待與加載微調器相關的元素之一,類ajax-loading-block-window ,以實現頁面加載完成時出現的style屬性值。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

d = webdriver.Chrome(r'C:\Users\User\Documents\chromedriver.exe')
d.get('https://newspaperarchive.com/tags/?pc=3091&psi=50&pci=7&pt=19789&ndt=bd&pd=1&pm=1&py=1920&pe=31&pem=12&pey=1929&pep=dependency/')
WebDriverWait(d, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.ajax-loading-block-window[style="height: 100%; display: none;"]')))
data = [(i.get_attribute('id') , i.get_attribute('href') ) for i in d.find_elements_by_css_selector('.result-link')]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM