[英]Using JSoup with Android Studio to collect information from website or rss feed
I'm learning to use JSoup (slowly) to parse the source code of a website, but I feel myself nearing the end of my rope. 我正在学习(慢慢地)使用JSoup解析网站的源代码,但是我觉得自己快要走到尽头了。 That is, I'm not entirely sure what methods I should be looking into. 也就是说,我不确定我应该研究哪种方法。
In theory, I want to develop an app that sifts through a particular search page (ie. Google, Monster, Craigslist, eBay, etc.) and pulls out certain pieces of data. 从理论上讲,我想开发一个可以筛选特定搜索页面(例如Google,Monster,Craigslist,eBay等)的应用程序,并提取某些数据。 This data might be on the first page, but it could be 10+ pages down the list (in Google, obviously it could be hundreds). 此数据可能在第一页上,但可能比列表低10多个页面(在Google中,显然可能是数百页)。 Whether it's a search result (Google), a job posting (Monster), or an item for sale (Craigslist/eBay), how do I go about getting this done? 无论是搜索结果(Google),职位发布(Monster)还是待售商品(Craigslist / eBay),该如何完成?
I didn't know about JSoup until recently. 直到最近我才了解JSoup。 And I'm not "stuck" on using it. 而且我并没有“坚持”使用它。 But, my research has led me to believe that using JSoup will give me the desired result. 但是,我的研究使我相信使用JSoup将给我想要的结果。 So, I'm trying to learn how to use it to do what I want. 因此,我正在尝试学习如何使用它来完成我想要的事情。 (if anyone knows any extensive beginner tutorials, let me know) (如果有人知道任何广泛的初学者教程,请告诉我)
So how should I go about doing this? 那么我应该怎么做呢? I know it's a vague questions, but I have a goal and I'm not sure how I get to it. 我知道这是一个模糊的问题,但是我有一个目标,我不确定如何实现。
I've also contemplated using/searching RSS feeds when available (ie. Craigslist). 我还考虑过在可用时使用/搜索RSS提要(例如Craigslist)。 Is this also possible and easier or harder than attempted to pull straight from the site/source code? 与直接从站点/源代码中拉取相比,这是否可能,更容易或更难?
On top of that, I want to be able to send the user notifications if new information has presented itself (new items for sale on Craigslist/eBay, new job on Monster, etc.) A separate topic I know, one that I can figure out I'm sure. 最重要的是,我希望能够在出现新信息时向用户发送通知(Craigslist / eBay上有待售的新商品,Monster上有新工作等)。我知道一个单独的主题,我能想到我确定。
Thanks in advance! 提前致谢!
Your answer is a little vague so I'll try to cover as much as possible. 您的答案含糊不清,因此我将尽力涵盖。 Jsoup is a HTML/XHTML parser
library. Jsoup是HTML/XHTML parser
库。 This means that it can make a GET
or a POST
request to a certain url and get the content generated by the server. 这意味着它可以对某个URL进行GET
或POST
请求,并获取服务器生成的内容。 Then it parses this content to build a DOM
representation using java objects. 然后,它将解析此内容,以使用Java对象构建DOM
表示形式。
The problem is your examples. 问题是你的例子。 You mention Google
and Monster
. 您提到了Google
和Monster
。 As fas as I know these sites use Ajax
in order to retrieve the content from the server. 据我所知,这些站点使用Ajax
来从服务器检索内容。 They use javascript
in order to generate dynamic content. 他们使用javascript
来生成动态内容。 Jsoup
cannot handle javascript
generated content. Jsoup
无法处理javascript
生成的内容。 This is due to its inability to execute javascript
. 这是因为它无法执行javascript
。 It can certainly "see" it since it's included in the response of the server, but it perceives it as simple text, not code. 它肯定可以“看到”它,因为它已包含在服务器的响应中,但它将其视为简单的文本,而不是代码。 The browsers can handle this kind of content since the include a javascript
execution engine. 浏览器可以处理此类内容,因为其中包含了javascript
执行引擎。
In general is safer/better/easier to get the content from an API
of the source. 通常,从源的API
获取内容更安全/更好/更容易。 Does it provide rss feed
or an API
? 它提供rss feed
或API
吗? Then use that. 然后使用它。 For instance Google
provides a way to programmatically execute search queries. 例如, Google
提供了一种以编程方式执行搜索查询的方法。
If the source doesn't provide a programmatic way of accessing it, then you can parse the content using Jsoup
if and only if the content is static. 如果源没有提供以编程方式访问它的方式,则可以且仅当内容为静态时才可以使用Jsoup
解析内容。 In order to determine if the content is static or is being generated by javascript
then visit the site you want to parse and press Ctrl + U . 为了确定内容是静态的还是由javascript
生成的,请访问要解析的站点,然后按Ctrl + U。 The mesh of HTML
is what Jsoup
will receive when you make a request to the site. 当您向网站提出请求时, Jsoup
将收到HTML
网格。 If the content you need is not included in there, then the content is dynamic. 如果您所需的内容未包含在其中,则该内容是动态的。 In that case, you must use a headless browser , which is a library/framework that, among others, includes a javacript
execution engine. 在这种情况下,您必须使用无头浏览器 ,该浏览器是一个库/框架,其中包括一个javacript
执行引擎。 This way you can simulate the browser 100%. 这样,您可以模拟浏览器100%。
As far as tutorials go, this will cover all your needs regarding Jsoup
. 就教程而言, 这将满足您有关Jsoup
所有需求。 If the content is javacript
generated, then you can retrieve it by using a headless browser, and then use Jsoup
by parsing the already retrieved content just for parsing and not getting the content. 如果内容是javacript
生成的,则可以使用无头浏览器来检索它,然后通过解析已经检索的内容(仅用于解析而不获取内容)来使用Jsoup
。
The material I provided is sufficient for continuing your research. 我提供的材料足以继续您的研究。 In order to get better information then you must be specific about the problem you are facing. 为了获得更好的信息,您必须对所面临的问题有所了解。
Update 更新
Check selenium for android and selendroid 检查硒中的android和selendroid
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.