简体   繁体   中英

Selenium & Beautiful Soup scraping is returning an unexpected result

The USPTO site offers public data that updates every week. Every time they release new data they release it in a form of " delta data" from the last week. Im trying to download this data using python so I wont have to do it manually every week.

在此处输入图像描述 there are a few weird things that are happening: 在此处输入图像描述

first, the browser.page_source holds html (but not the right one - I checked). But when I pass that html (as string) to BeatifulSoup, the soup.current_data is empty.

Second, the html that is returning is not the full html and does not contain delta or that section at all, even though it is in the site html in the browser:

在此处输入图像描述

Any ideas on how to get that file to download? I need to eventually call the deltaJsonDownload() js function.

Code to reproduce:

from bs4 import BeautifulSoup
from selenium import webdriver


url = 'https://ped.uspto.gov/peds/'
browser = webdriver.PhantomJS(executable_path='/usr/bin/phantomjs')
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(browser.page_source)
assert('delta' in browser.page_source)

When you analyse the website network calls, it makes an ajax request to get all the links for the data to download.

import requests

res = requests.get("https://ped.uspto.gov/api/")

data = res.json()

print(data)

Output:

{'message': None,
 'helpText': '{}',
 'xmlDownloadMetadata': [{'lastUpdated': 'Sat 15 Aug 2020 01:30:57-0400',
   'sizeInBytes': 10429068701,
   'fileName': 'pairbulk-delta-20200815-xml',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 10:20:10-0400',
   'sizeInBytes': 100685778,
   'fileName': '1900-1919-pairbulk-full-20200809-xml',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 10:20:14-0400',
   'sizeInBytes': 13877,
   'fileName': '1920-1939-pairbulk-full-20200809-xml',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 10:20:15-0400',
   'sizeInBytes': 93016,
   'fileName': '1940-1959-pairbulk-full-20200809-xml',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 10:20:15-0400',
   'sizeInBytes': 82353484,
   'fileName': '1960-1979-pairbulk-full-20200809-xml',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 10:20:16-0400',
   'sizeInBytes': 5019098918,
   'fileName': '1980-1999-pairbulk-full-20200809-xml',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 10:20:46-0400',
   'sizeInBytes': 33231977060,
   'fileName': '2000-2019-pairbulk-full-20200809-xml',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 10:23:23-0400',
   'sizeInBytes': 24313575,
   'fileName': '2020-2020-pairbulk-full-20200809-xml',
   'updatedFile': False}],
 'jsonDownloadMetadata': [{'lastUpdated': 'Sat 15 Aug 2020 03:08:00-0400',
   'sizeInBytes': 5957650088,
   'fileName': 'pairbulk-delta-20200815-json',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 15:18:23-0400',
   'sizeInBytes': 66467976,
   'fileName': '1900-1919-pairbulk-full-20200809-json',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 15:18:25-0400',
   'sizeInBytes': 10100,
   'fileName': '1920-1939-pairbulk-full-20200809-json',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 15:18:27-0400',
   'sizeInBytes': 69891,
   'fileName': '1940-1959-pairbulk-full-20200809-json',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 15:18:29-0400',
   'sizeInBytes': 54076774,
   'fileName': '1960-1979-pairbulk-full-20200809-json',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 15:18:31-0400',
   'sizeInBytes': 3009216952,
   'fileName': '1980-1999-pairbulk-full-20200809-json',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 15:18:46-0400',
   'sizeInBytes': 18853619536,
   'fileName': '2000-2019-pairbulk-full-20200809-json',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 15:20:30-0400',
   'sizeInBytes': 17518389,
   'fileName': '2020-2020-pairbulk-full-20200809-json',
   'updatedFile': False}],
 'links': [{'rel': 'swagger-api-docs', 'href': '/api-docs'}]}

Parse the json and using these links you can easily download the file you are looking for. But I would say these files are pretty huge files, better using streaming download in requests.

The link you are looking for is the first element in data["jsonDownloadMetadata"]

In order to get the downloadable links, parse the json

data = res.json()

for links in data["jsonDownloadMetadata"]:
    print(f"https://ped.uspto.gov/api/full-download?fileName={links['fileName']}")

Output:

https://ped.uspto.gov/api/full-download?fileName=pairbulk-delta-20200815-json
https://ped.uspto.gov/api/full-download?fileName=1900-1919-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=1920-1939-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=1940-1959-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=1960-1979-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=1980-1999-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=2000-2019-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=2020-2020-pairbulk-full-20200809-json

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM