简体   繁体   中英

Getting bytes of a webpage from selenium

I am trying to scrape a webpage with a pdf.

With request, I used the following code to get the bytes and save it with open()

    pdf_response = requests.get(pdf_url)
    
    with open("sample.pdf", 'wb') as f:
        f.write(pdf_response.content)
        f.close

And it works just fine,

However on the below webpage I am using selenium but could not get the bytes from response object to use in the above code,

#This does not return a byte object as requests
driver = webdriver.Chrome()
driver.get(base)

content = driver.page_source.encode('utf-8').strip()

link to pdf (this has captcha that I solve with 2captcha)

Current response that I recieve

''

I can get PDF using only requests

Only problem: I use pillow to generate image with full code and display it, and I have to manually recognize this code. But if you have some method to recognize it automatically then it is not problem.

import requests
import lxml.html
from PIL import Image
import io

headers = {
    'User-Agent': 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0',
}

# --- create Session ---

s = requests.Session()
s.headers.update(headers)

# --- load main page ---

url = 'https://www.sedar.com/GetFile.do?lang=EN&docClass=8&issuerNo=00028264&issuerType=03&projectNo=03079934&docId=4755532'  # JSON

r = s.get(url)

# --- get images ---

soup = lxml.html.fromstring(r.text)

image_urls = soup.xpath('//img/@src')

# --- generate one image ---

full_image = Image.new('RGB', (40*5, 50))

for i, url in enumerate(image_urls):
    #print(url)
    r = s.get('https://www.sedar.com/' + url)
    
    image = Image.open(io.BytesIO(r.content))
    
    full_image.paste(image, (40*i, 0))

# --- ask for code --- 

full_image.show()

code = input('code> ')

#print('code:', code)

# --- get PDF ---

r = s.post('https://www.sedar.com/CheckCode.do', data={'code': code})

if r.headers['Content-Type'] != 'application/pdf':
    print('It is not PDF file')
else:
    with open('output.pdf', 'wb') as fh:
        print('size:', fh.write(r.content))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM