The USPTO site offers public data that updates every week. Every time they release new data they release it in a form of " delta data" from the last week. Im trying to download this data using python so I wont have to do it manually every week.
there are a few weird things that are happening:
first, the browser.page_source
holds html (but not the right one - I checked). But when I pass that html (as string) to BeatifulSoup, the soup.current_data
is empty.
Second, the html that is returning is not the full html and does not contain delta
or that section at all, even though it is in the site html in the browser:
Any ideas on how to get that file to download? I need to eventually call the deltaJsonDownload()
js function.
Code to reproduce:
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://ped.uspto.gov/peds/'
browser = webdriver.PhantomJS(executable_path='/usr/bin/phantomjs')
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(browser.page_source)
assert('delta' in browser.page_source)
When you analyse the website network calls, it makes an ajax request to get all the links for the data to download.
import requests
res = requests.get("https://ped.uspto.gov/api/")
data = res.json()
print(data)
Output:
{'message': None,
'helpText': '{}',
'xmlDownloadMetadata': [{'lastUpdated': 'Sat 15 Aug 2020 01:30:57-0400',
'sizeInBytes': 10429068701,
'fileName': 'pairbulk-delta-20200815-xml',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 10:20:10-0400',
'sizeInBytes': 100685778,
'fileName': '1900-1919-pairbulk-full-20200809-xml',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 10:20:14-0400',
'sizeInBytes': 13877,
'fileName': '1920-1939-pairbulk-full-20200809-xml',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 10:20:15-0400',
'sizeInBytes': 93016,
'fileName': '1940-1959-pairbulk-full-20200809-xml',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 10:20:15-0400',
'sizeInBytes': 82353484,
'fileName': '1960-1979-pairbulk-full-20200809-xml',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 10:20:16-0400',
'sizeInBytes': 5019098918,
'fileName': '1980-1999-pairbulk-full-20200809-xml',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 10:20:46-0400',
'sizeInBytes': 33231977060,
'fileName': '2000-2019-pairbulk-full-20200809-xml',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 10:23:23-0400',
'sizeInBytes': 24313575,
'fileName': '2020-2020-pairbulk-full-20200809-xml',
'updatedFile': False}],
'jsonDownloadMetadata': [{'lastUpdated': 'Sat 15 Aug 2020 03:08:00-0400',
'sizeInBytes': 5957650088,
'fileName': 'pairbulk-delta-20200815-json',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 15:18:23-0400',
'sizeInBytes': 66467976,
'fileName': '1900-1919-pairbulk-full-20200809-json',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 15:18:25-0400',
'sizeInBytes': 10100,
'fileName': '1920-1939-pairbulk-full-20200809-json',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 15:18:27-0400',
'sizeInBytes': 69891,
'fileName': '1940-1959-pairbulk-full-20200809-json',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 15:18:29-0400',
'sizeInBytes': 54076774,
'fileName': '1960-1979-pairbulk-full-20200809-json',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 15:18:31-0400',
'sizeInBytes': 3009216952,
'fileName': '1980-1999-pairbulk-full-20200809-json',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 15:18:46-0400',
'sizeInBytes': 18853619536,
'fileName': '2000-2019-pairbulk-full-20200809-json',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 15:20:30-0400',
'sizeInBytes': 17518389,
'fileName': '2020-2020-pairbulk-full-20200809-json',
'updatedFile': False}],
'links': [{'rel': 'swagger-api-docs', 'href': '/api-docs'}]}
Parse the json and using these links you can easily download the file you are looking for. But I would say these files are pretty huge files, better using streaming download in requests.
The link you are looking for is the first element in data["jsonDownloadMetadata"]
In order to get the downloadable links, parse the json
data = res.json()
for links in data["jsonDownloadMetadata"]:
print(f"https://ped.uspto.gov/api/full-download?fileName={links['fileName']}")
Output:
https://ped.uspto.gov/api/full-download?fileName=pairbulk-delta-20200815-json
https://ped.uspto.gov/api/full-download?fileName=1900-1919-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=1920-1939-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=1940-1959-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=1960-1979-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=1980-1999-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=2000-2019-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=2020-2020-pairbulk-full-20200809-json
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.