简体   繁体   English

Selenium & Beautiful Soup scraping 返回意外结果

[英]Selenium & Beautiful Soup scraping is returning an unexpected result

The USPTO site offers public data that updates every week. USPTO 网站提供每周更新的公共数据。 Every time they release new data they release it in a form of " delta data" from the last week.每次他们发布新数据时,他们都会以上周的“增量数据”形式发布。 Im trying to download this data using python so I wont have to do it manually every week.我尝试使用 python 下载这些数据,所以我不必每周手动进行。

在此处输入图像描述 there are a few weird things that are happening:发生了一些奇怪的事情: 在此处输入图像描述

first, the browser.page_source holds html (but not the right one - I checked).首先, browser.page_source包含 html (但不是正确的 - 我检查过)。 But when I pass that html (as string) to BeatifulSoup, the soup.current_data is empty.但是,当我将 html(作为字符串)传递给 BeatifulSoup 时, soup.current_data为空。

Second, the html that is returning is not the full html and does not contain delta or that section at all, even though it is in the site html in the browser:其次,返回的 html 不是完整的 html 并且根本不包含delta或该部分,即使它位于浏览器中的站点 html 中:

在此处输入图像描述

Any ideas on how to get that file to download?关于如何下载该文件的任何想法? I need to eventually call the deltaJsonDownload() js function.我最终需要调用deltaJsonDownload() js function。

Code to reproduce:重现代码:

from bs4 import BeautifulSoup
from selenium import webdriver


url = 'https://ped.uspto.gov/peds/'
browser = webdriver.PhantomJS(executable_path='/usr/bin/phantomjs')
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(browser.page_source)
assert('delta' in browser.page_source)

When you analyse the website network calls, it makes an ajax request to get all the links for the data to download.当您分析网站网络调用时,它会发出 ajax 请求以获取所有链接以供下载数据。

import requests

res = requests.get("https://ped.uspto.gov/api/")

data = res.json()

print(data)

Output: Output:

{'message': None,
 'helpText': '{}',
 'xmlDownloadMetadata': [{'lastUpdated': 'Sat 15 Aug 2020 01:30:57-0400',
   'sizeInBytes': 10429068701,
   'fileName': 'pairbulk-delta-20200815-xml',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 10:20:10-0400',
   'sizeInBytes': 100685778,
   'fileName': '1900-1919-pairbulk-full-20200809-xml',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 10:20:14-0400',
   'sizeInBytes': 13877,
   'fileName': '1920-1939-pairbulk-full-20200809-xml',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 10:20:15-0400',
   'sizeInBytes': 93016,
   'fileName': '1940-1959-pairbulk-full-20200809-xml',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 10:20:15-0400',
   'sizeInBytes': 82353484,
   'fileName': '1960-1979-pairbulk-full-20200809-xml',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 10:20:16-0400',
   'sizeInBytes': 5019098918,
   'fileName': '1980-1999-pairbulk-full-20200809-xml',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 10:20:46-0400',
   'sizeInBytes': 33231977060,
   'fileName': '2000-2019-pairbulk-full-20200809-xml',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 10:23:23-0400',
   'sizeInBytes': 24313575,
   'fileName': '2020-2020-pairbulk-full-20200809-xml',
   'updatedFile': False}],
 'jsonDownloadMetadata': [{'lastUpdated': 'Sat 15 Aug 2020 03:08:00-0400',
   'sizeInBytes': 5957650088,
   'fileName': 'pairbulk-delta-20200815-json',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 15:18:23-0400',
   'sizeInBytes': 66467976,
   'fileName': '1900-1919-pairbulk-full-20200809-json',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 15:18:25-0400',
   'sizeInBytes': 10100,
   'fileName': '1920-1939-pairbulk-full-20200809-json',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 15:18:27-0400',
   'sizeInBytes': 69891,
   'fileName': '1940-1959-pairbulk-full-20200809-json',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 15:18:29-0400',
   'sizeInBytes': 54076774,
   'fileName': '1960-1979-pairbulk-full-20200809-json',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 15:18:31-0400',
   'sizeInBytes': 3009216952,
   'fileName': '1980-1999-pairbulk-full-20200809-json',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 15:18:46-0400',
   'sizeInBytes': 18853619536,
   'fileName': '2000-2019-pairbulk-full-20200809-json',
   'updatedFile': False},
  {'lastUpdated': 'Sun 09 Aug 2020 15:20:30-0400',
   'sizeInBytes': 17518389,
   'fileName': '2020-2020-pairbulk-full-20200809-json',
   'updatedFile': False}],
 'links': [{'rel': 'swagger-api-docs', 'href': '/api-docs'}]}

Parse the json and using these links you can easily download the file you are looking for.解析 json 并使用这些链接,您可以轻松下载您正在寻找的文件。 But I would say these files are pretty huge files, better using streaming download in requests.但我会说这些文件是相当大的文件,最好在请求中使用流式下载。

The link you are looking for is the first element in data["jsonDownloadMetadata"]您要查找的链接是data["jsonDownloadMetadata"]中的第一个元素

In order to get the downloadable links, parse the json为了获得可下载的链接,请解析 json

data = res.json()

for links in data["jsonDownloadMetadata"]:
    print(f"https://ped.uspto.gov/api/full-download?fileName={links['fileName']}")

Output: Output:

https://ped.uspto.gov/api/full-download?fileName=pairbulk-delta-20200815-json
https://ped.uspto.gov/api/full-download?fileName=1900-1919-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=1920-1939-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=1940-1959-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=1960-1979-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=1980-1999-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=2000-2019-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=2020-2020-pairbulk-full-20200809-json

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM