Selenium & Beautiful Soup scraping is returning an unexpected result

Question

The USPTO site offers public data that updates every week. Every time they release new data they release it in a form of " delta data" from the last week. Im trying to download this data using python so I wont have to do it manually every week.

there are a few weird things that are happening:

first, the browser.page_source holds html (but not the right one - I checked). But when I pass that html (as string) to BeatifulSoup, the soup.current_data is empty.

Second, the html that is returning is not the full html and does not contain delta or that section at all, even though it is in the site html in the browser:

Any ideas on how to get that file to download? I need to eventually call the deltaJsonDownload() js function.

Code to reproduce:

from bs4 import BeautifulSoup
from selenium import webdriver


url = 'https://ped.uspto.gov/peds/'
browser = webdriver.PhantomJS(executable_path='/usr/bin/phantomjs')
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(browser.page_source)
assert('delta' in browser.page_source)

Answer 1

When you analyse the website network calls, it makes an ajax request to get all the links for the data to download.

import requests

res = requests.get("https://ped.uspto.gov/api/")

data = res.json()

print(data)

Output:

{'message': None,
 'helpText': '{}',
 'xmlDownloadMetadata': [{'lastUpdated': 'Sat 15 Aug 2020 01:30:57-0400',
   'sizeInBytes': 10429068701,
   'fileName': 'pairbulk-delta-20200815-xml',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 10:20:10-0400',
   'sizeInBytes': 100685778,
   'fileName': '1900-1919-pairbulk-full-20200809-xml',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 10:20:14-0400',
   'sizeInBytes': 13877,
   'fileName': '1920-1939-pairbulk-full-20200809-xml',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 10:20:15-0400',
   'sizeInBytes': 93016,
   'fileName': '1940-1959-pairbulk-full-20200809-xml',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 10:20:15-0400',
   'sizeInBytes': 82353484,
   'fileName': '1960-1979-pairbulk-full-20200809-xml',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 10:20:16-0400',
   'sizeInBytes': 5019098918,
   'fileName': '1980-1999-pairbulk-full-20200809-xml',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 10:20:46-0400',
   'sizeInBytes': 33231977060,
   'fileName': '2000-2019-pairbulk-full-20200809-xml',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 10:23:23-0400',
   'sizeInBytes': 24313575,
   'fileName': '2020-2020-pairbulk-full-20200809-xml',
   'updatedFile': False}],
 'jsonDownloadMetadata': [{'lastUpdated': 'Sat 15 Aug 2020 03:08:00-0400',
   'sizeInBytes': 5957650088,
   'fileName': 'pairbulk-delta-20200815-json',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 15:18:23-0400',
   'sizeInBytes': 66467976,
   'fileName': '1900-1919-pairbulk-full-20200809-json',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 15:18:25-0400',
   'sizeInBytes': 10100,
   'fileName': '1920-1939-pairbulk-full-20200809-json',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 15:18:27-0400',
   'sizeInBytes': 69891,
   'fileName': '1940-1959-pairbulk-full-20200809-json',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 15:18:29-0400',
   'sizeInBytes': 54076774,
   'fileName': '1960-1979-pairbulk-full-20200809-json',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 15:18:31-0400',
   'sizeInBytes': 3009216952,
   'fileName': '1980-1999-pairbulk-full-20200809-json',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 15:18:46-0400',
   'sizeInBytes': 18853619536,
   'fileName': '2000-2019-pairbulk-full-20200809-json',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 15:20:30-0400',
   'sizeInBytes': 17518389,
   'fileName': '2020-2020-pairbulk-full-20200809-json',
   'updatedFile': False}],
 'links': [{'rel': 'swagger-api-docs', 'href': '/api-docs'}]}

Parse the json and using these links you can easily download the file you are looking for. But I would say these files are pretty huge files, better using streaming download in requests.

The link you are looking for is the first element in data["jsonDownloadMetadata"]

In order to get the downloadable links, parse the json

data = res.json()

for links in data["jsonDownloadMetadata"]:
    print(f"https://ped.uspto.gov/api/full-download?fileName={links['fileName']}")

Output:

https://ped.uspto.gov/api/full-download?fileName=pairbulk-delta-20200815-json
https://ped.uspto.gov/api/full-download?fileName=1900-1919-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=1920-1939-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=1940-1959-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=1960-1979-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=1980-1999-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=2000-2019-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=2020-2020-pairbulk-full-20200809-json

Selenium & Beautiful Soup scraping is returning an unexpected result

Question

1 answers

solution1
1 ACCPTED 2020-08-15 09:40:19

Selenium & Beautiful Soup scraping is returning an unexpected result

Question

1 answers

solution1 1 ACCPTED 2020-08-15 09:40:19

solution1
1 ACCPTED 2020-08-15 09:40:19