繁体   English   中英

使用 Python 的 Web Scraping table 只是返回一个空列表

[英]Web Scraping table with Python is just returning an empty list back

我正在尝试使用 Python-Beautifulsoup 从该网站的所有页面和字典中抓取该表中的所有数据,如下面的代码所示。 但是,这只是返回一个空列表

此外,我还试图将拥有自己单独页面的每家公司都刮到该字典中。

from bs4 import BeautifulSoup
import requests 
from pprint import pprint

case_data = []

case_url = 'https://www.dataquest.io'
case_page = requests.get(case_url) 
soup_case = BeautifulSoup(case_page.content, 'html.parser') 
case_table = soup_case.find('div',{'class':'slds-table slds-table--bordered slds-max-medium-table_stacked cCaseList'})

pprint(case_table)
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time
import pandas as pd

options = Options()
options.add_argument('--headless')

driver = webdriver.Firefox(options=options)
driver.get("https://masked_per_user_request/")
time.sleep(2)

df = pd.read_html(driver.page_source)[0]

df.to_csv('result.csv', index=False)

driver.quit()

输出: 点击这里

请注意,数据是通过来自JSON后端的XHR请求呈现的,其中包含XHR-URL因此您可以通过POST请求调用它,包括JSON正文数据和Cookies

类似于以下内容:

import requests


data = {
    'message': '{"actions":[{"id":"108;a","descriptor":"serviceComponent://ui.communities.components.aura.components.forceCommunity.richText.RichTextController/ACTION$getParsedRichTextValue","callingDescriptor":"UNKNOWN","params":{"html":"<p style=\"text-align: justify;\"><span style=\"font-size: 14px;\">The RSPO aspires to ensure transparency throughout the complaints process and the reporting thereof. Decisions not to disclose information through the RSPO website require motivation on genuine grounds that disclosure will go against the interest of the complaints process and/or may jeopardize the well-being or safety of stakeholders involved, and that non-disclosure does not undermine adherence to the principles and objectives of RSPO:</span></p><p><br></p><ul><li style=\"text-align: justify;\"><span style=\"font-size: 14px;\">The non-disclosed information relates to a legitimate aim, i.e. peaceful and constructive resolution of complaints in accordance with RSPO objectives and P&amp;C;</span></li></ul><p><br></p><ul><li style=\"text-align: justify;\"><span style=\"font-size: 14px;\">The disclosure of said information threatens harm to that aim; and</span></li></ul><p style=\"text-align: justify;\"><br></p><ul><li style=\"text-align: justify;\"><span style=\"font-size: 14px;\">The harm to the aim is greater than the public interest in having the information disclosed.</span></li></ul><p>&nbsp;</p>"},"version":"47.0","storable":true},{"id":"88;a","descriptor":"apex://ComplaintsCaseController/ACTION$searchCaseList","callingDescriptor":"markup://c:CaseList","params":{"searchString":"","pageNumber":1,"defaultPageSize":"10"}},{"id":"111;a","descriptor":"serviceComponent://ui.communities.components.aura.components.forceCommunity.controller.HeadlineController/ACTION$getInitData","callingDescriptor":"UNKNOWN","params":{"uniqueNameOrId":"","pageType":""},"version":"47.0","storable":true}]}',
    'aura.context': '{"mode":"PROD","fwuid":"5fuxCiO1mNHGdvJphU5ELQ","app":"siteforce:communityApp","loaded":{"APPLICATION@markup://siteforce:communityApp":"0luQG4JZE_TU28tAfQgGSA"},"dn":[],"globals":{},"uad":false}',
    'aura.pageURI': '/Complaint/s/casetracker',
    'aura.token': 'undefined'
}

r = requests.post("https://masked_per_user_request/", json=data).json()


print(r)

您需要弄清楚 Cookie 参数。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM