[英]Unable to access pdf document via requests or selenium
我有一個巨大的 URL 列表,每個 URL 加載不同的 PDF 文檔。 這是其中之一: https://ccmspa.pinellascounty.org/PublicAccess/ViewDocumentFragment.aspx?DocumentFragmentID=74223655&CheckDocumentGroups=0
它很可能會在第一次嘗試時打開網站主頁,但如果您再次粘貼鏈接,它將打開一個 pdf 文檔。
我正在嘗試編寫一個 python 腳本來在本地下載這些文檔以使用 tika 提取 con.net,但是它第一次打開主頁的這種行為對我嘗試的任何事情都是不利的。
import requests
from tika import parser
link = "https://ccmspa.pinellascounty.org/PublicAccess/ViewDocumentFragment.aspx DocumentFragmentID=74223655&CheckDocumentGroups=0"
resp = requests.get(link)
with open('metadata.pdf', 'wb') as f:
f.write(resp.content)
raw = parser.from_file('metadata.pdf', xmlContent=False)
print(raw['content'])
output:
\n\n\n\n\n\n\n\n\n\n \n \t\t\n\n\t\tSkip to Main Content\xa0\xa0\xa0\xa0Logout\xa0\xa0\xa0\xa0My
Account\xa0\xa0\xa0\xa0\t\t\tHelp\n\n\n\n\n\n\n\t\t\t\nSelect a location\nPinellas County\n\n\xa0\nAll Case
Records Search\nCivil, Family Case Records\nCriminal & Traffic Case Records\nProbate Case Records\nCourt
Calendar\n\nAttorney Login\nRegistered User Login\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\n\n\t\t\
t\xa0\t\n\t\n\t\tClerk of the Circuit Court|Mortgage Foreclosure Sales|Pinellas County Government|Pinellas
County Sheriff's Office|Public Defender|Sixth Judicial Circuit|State of Florida|State Attorney|Self Help
Center|Court Forms|How-To Videos|Florida Courts eFiling Portal Video|Attorney Account Setup|Reports and
Statistics|Terms of Use|Contact UsCopyright 2003 Tyler Technologies. All rights Reserved.\n\t\n\n\n\n
\n
url = "https://ccmspa.pinellascounty.org/PublicAccess/ViewDocumentFragment.aspx?DocumentFragmentID=74223655&CheckDocumentGroups=0"
driver.get(url)
cookies = driver.get_cookies()
s = requests.Session()
for cookie in cookies:
s.cookies.set(cookie['name'], cookie['value'])
resp = s.get(url)
它不起作用,當我檢查響應 object 的 CookieJar 時,它是空的。 我不得不承認我對 cookies 的工作原理知之甚少,但這只是一次孤注一擲的嘗試。 我在這里誤解了什么? 我感謝任何意見。
#opens a new window and assigns it as the working window
def open_window(driver, link):
driver.execute_script(f"window.open('{link}')")
new_window = driver.window_handles[-1]
driver.switch_to.window(new_window)
url = "https://ccmspa.pinellascounty.org/PublicAccess/ViewDocumentFragment.aspx?DocumentFragmentID=74223655&CheckDocumentGroups=0"
driver.get(url)
open_window(driver, url)
#print source of new window
print(driver.page_source)
output 就是這個:
<html><head></head><body></body></html>
經過一些修補,解決方案 #2 奏效了。 但是,我沒有僅在訪問主頁后從驅動程序中獲取 cookies,而是讓瀏覽器啟動另一個查詢(針對該網站有一些額外的步驟),然后我使用了 cookies。它看起來像這樣
[{'domain': 'ccmspa.pinellascounty.org',
'expiry': 1670679832, #this is the time the cookie expires in epoch time
'httpOnly': True,
'name': '.ASPXFORMSPUBLICACCESS',
'path': '/',
'secure': True,
'value': '1DBB1EADBA199D246E84CCE7243202DCA6BBD7E383FE360ECBFC2E6150102C79F3EC2F6B232B85589C51976AF20EF7EBDF52CF74122A7A6E78B4C6F31434C58AB57E10005C41DE019814B704F12B150A0818585E85F0237EFCF1A11B205414325CA1850605FF932BC43CC5B36395488F40D58DA594899C4D62FF3ECCBE729C6BC001194225B6653CB89C1305C7FBCB26E1BCFCFF75476784D24ADFCA0AFF679A3BAA3131'},
{'domain': 'ccmspa.pinellascounty.org',
'httpOnly': True,
'name': 'ASP.NET_SessionId',
'path': '/',
'secure': True,
'value': '24552pqtb1tomjbw2gkzko55'},
{'domain': 'ccmspa.pinellascounty.org',
'httpOnly': False,
'name': 'EDLFDCVM',
'path': '/',
'sameSite': 'None',
'secure': True,
'value': '02282de498-9595-48s0hGpl59SkUKRZpRrS_b1TKJfXlz_3dGN9xGZ2tcTXrHuDsR5rN90I_Rp192pX48C1k'}]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.