简体   繁体   English

使用Jsoup提取“隐藏的” HTML

[英]Extracting “hidden” HTML with Jsoup

I am trying to get at HTML data that does not appear in the source document but can be exposed, for example, by "inspect element" in Google Chrome. 我试图获取HTML数据,这些数据不会出现在源文档中,但是可以通过例如Google Chrome中的“检查元素”公开。

Example page: http://assignment.uspto.gov/#/search?q=9000000&sort=patAssignorEarliestExDate%20desc%2C%20id%20desc&synonyms=false 示例页面: http : //assignment.uspto.gov/#/search?q=9000000&sort=patAssignorEarliestExDate%20desc%2C%20id%20desc&synonyms=false

There are a number of div elements containing assignment data for US Patent No. 9,000,000 that appear below the line 该行下方显示了许多div元素,其中包含第9,000,000号美国专利的分配数据

<script async="async" type="text/javascript" src="https://components.uspto.gov/js/ais/2-2-assignment-search.js"></script>

Is there a way to extract this hidden html with Jsoup? 有没有办法用Jsoup提取这个隐藏的html?

The data seems to loaded with AJAX. 数据似乎已用AJAX加载。 JSoup does not process Javascript. JSoup不处理Javascript。

What you need is a "headless browser" API, that processes Javascript without actually rendering anything. 您需要的是“无头浏览器” API,该API可处理Javascript而不实际呈现任何内容。

HtmlUnit seems to be the best known tool, although I've never used it myself. HtmlUnit似乎是最知名的工具,尽管我自己从未使用过。 As suggested before, Selenium Webdriver is also an option. 如前所述,Selenium Webdriver也是一种选择。

I believe you will have to load the URL, wait for all the AJAX to process, and you will eventually get almost the same parse tree you get in Chrome in Java to do with it as you wish! 我相信您将必须加载URL,等待所有AJAX处理,最终您将获得与Java中的Chrome几乎相同的解析树,并可以根据需要使用它!

If this is the only information you will be needing, here's the JSON url to the information you seek: 如果这是您唯一需要的信息,那么这是您要查找的信息的JSON URL:

http://prod-proxy-lb-2117675230.us-east-1.elb.amazonaws.com/solr/aotw/search?json.wrf=jQuery1102004354461841285229_1448413727331&q=9000000&facet.date.other=before&rows=20&start=0&wt=json&facet.date.start=NOW%2FYEAR-50YEARS&fl=id%2CreelNo%2CframeNo%2CconveyanceText%2CpatAssigneeName%2CpatAssignorName%2CinventionTitleFirst%2CapplNumFirst%2CpublNumFirst%2CpatNumFirst%2CintlRegNumFirst%2CcorrName%2CcorrAddress1%2CcorrAddress2%2CcorrAddress3%2CpatAssignorEarliestExDate%2CfilingDateFirst%2CpublDateFirst%2CissueDateFirst%2CintlPublDateFirst%2CpatNumSize&hl.fl=reelNo%2CframeNo%2CpatAssigneeName%2CpatAssignorName%2CconveyanceText%2CinventionTitleFirst%2CapplNumFirst%2CpublNumFirst%2CpatNumFirst%2CintlRegNumFirst%2CcorrName%2CcorrAddress1%2CcorrAddress2%2CcorrAddress3&hl.requireFieldMatch=true&sort=patAssignorEarliestExDate+desc%2C+id+desc http://prod-proxy-lb-2117675230.us-east-1.elb.amazonaws.com/solr/aotw/sea​​rch?json.wrf=jQuery1102004354461841285229_1448413727331&q=9000000&facet.date.other=before&rows=20&start=0&wt=json&facet。 date.start = NOW%2FYEAR-50年&FL = ID%2CreelNo%2CframeNo%2CconveyanceText%2CpatAssigneeName%2CpatAssignorName%2CinventionTitleFirst%2CapplNumFirst%2CpublNumFirst%2CpatNumFirst%2CintlRegNumFirst%2CcorrName%2CcorrAddress1%2CcorrAddress2%2CcorrAddress3%2CpatAssignorEarliestExDate%2CfilingDateFirst%2CpublDateFirst%2CissueDateFirst%2CintlPublDateFirst% 2CpatNumSize&hl.fl = reelNo%2CframeNo%2CpatAssigneeName%2CpatAssignorName%2CconveyanceText%2CinventionTitleFirst%2CapplNumFirst%2CpublNumFirst%2CpatNumFirst%2CintlRegNumFirst%2CcorrName%2CcorrAddress1%2CcorrAddress2%2CcorrAddress3&hl.requireFieldMatch =真排序= patAssignorEarliestExDate +降序%2C + ID +降序

This has been retrieved by inspecting the Network tab of the Chrome developer tool, and you can get the contents of this url by using HttpConnection . 通过检查Chrome开发者工具的“网络”标签可以检索到此内容,并且可以使用HttpConnection获取此url的内容。 An example can be found here . 一个例子可以在这里找到。 After getting the JSON file you can then parse it to retrieve whatever information you need. 获取JSON文件后,您可以对其进行解析以检索所需的任何信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM