简体繁体 English

结合使用JSoup和Android Studio从网站或RSS提要中收集信息

[英]Using JSoup with Android Studio to collect information from website or rss feed

原文 2015-10-18 03:33:55 7 1 java/ android/ parsing/ rss/ jsoup

I'm learning to use JSoup (slowly) to parse the source code of a website, but I feel myself nearing the end of my rope. 我正在学习（慢慢地）使用JSoup解析网站的源代码，但是我觉得自己快要走到尽头了。 That is, I'm not entirely sure what methods I should be looking into. 也就是说，我不确定我应该研究哪种方法。

In theory, I want to develop an app that sifts through a particular search page (ie. Google, Monster, Craigslist, eBay, etc.) and pulls out certain pieces of data. 从理论上讲，我想开发一个可以筛选特定搜索页面（例如Google，Monster，Craigslist，eBay等）的应用程序，并提取某些数据。 This data might be on the first page, but it could be 10+ pages down the list (in Google, obviously it could be hundreds). 此数据可能在第一页上，但可能比列表低10多个页面（在Google中，显然可能是数百页）。 Whether it's a search result (Google), a job posting (Monster), or an item for sale (Craigslist/eBay), how do I go about getting this done? 无论是搜索结果（Google），职位发布（Monster）还是待售商品（Craigslist / eBay），该如何完成？

I didn't know about JSoup until recently. 直到最近我才了解JSoup。 And I'm not "stuck" on using it. 而且我并没有“坚持”使用它。 But, my research has led me to believe that using JSoup will give me the desired result. 但是，我的研究使我相信使用JSoup将给我想要的结果。 So, I'm trying to learn how to use it to do what I want. 因此，我正在尝试学习如何使用它来完成我想要的事情。 (if anyone knows any extensive beginner tutorials, let me know) （如果有人知道任何广泛的初学者教程，请告诉我）

So how should I go about doing this? 那么我应该怎么做呢？ I know it's a vague questions, but I have a goal and I'm not sure how I get to it. 我知道这是一个模糊的问题，但是我有一个目标，我不确定如何实现。

I've also contemplated using/searching RSS feeds when available (ie. Craigslist). 我还考虑过在可用时使用/搜索RSS提要（例如Craigslist）。 Is this also possible and easier or harder than attempted to pull straight from the site/source code? 与直接从站点/源代码中拉取相比，这是否可能，更容易或更难？

On top of that, I want to be able to send the user notifications if new information has presented itself (new items for sale on Craigslist/eBay, new job on Monster, etc.) A separate topic I know, one that I can figure out I'm sure. 最重要的是，我希望能够在出现新信息时向用户发送通知（Craigslist / eBay上有待售的新商品，Monster上有新工作等）。我知道一个单独的主题，我能想到我确定。

Thanks in advance! 提前致谢！

1 个解决方案

Your answer is a little vague so I'll try to cover as much as possible. 您的答案含糊不清，因此我将尽力涵盖。 Jsoup is a HTML/XHTML parser library. Jsoup是HTML/XHTML parser库。 This means that it can make a GET or a POST request to a certain url and get the content generated by the server. 这意味着它可以对某个URL进行GET或POST请求，并获取服务器生成的内容。 Then it parses this content to build a DOM representation using java objects. 然后，它将解析此内容，以使用Java对象构建DOM表示形式。

The problem is your examples. 问题是你的例子。 You mention Google and Monster . 您提到了Google和Monster 。 As fas as I know these sites use Ajax in order to retrieve the content from the server. 据我所知，这些站点使用Ajax来从服务器检索内容。 They use javascript in order to generate dynamic content. 他们使用javascript来生成动态内容。 Jsoup cannot handle javascript generated content. Jsoup无法处理javascript生成的内容。 This is due to its inability to execute javascript . 这是因为它无法执行javascript 。 It can certainly "see" it since it's included in the response of the server, but it perceives it as simple text, not code. 它肯定可以“看到”它，因为它已包含在服务器的响应中，但它将其视为简单的文本，而不是代码。 The browsers can handle this kind of content since the include a javascript execution engine. 浏览器可以处理此类内容，因为其中包含了javascript执行引擎。

In general is safer/better/easier to get the content from an API of the source. 通常，从源的API获取内容更安全/更好/更容易。 Does it provide rss feed or an API ? 它提供rss feed或API吗？ Then use that. 然后使用它。 For instance Google provides a way to programmatically execute search queries. 例如， Google 提供了一种以编程方式执行搜索查询的方法。

If the source doesn't provide a programmatic way of accessing it, then you can parse the content using Jsoup if and only if the content is static. 如果源没有提供以编程方式访问它的方式，则可以且仅当内容为静态时才可以使用Jsoup解析内容。 In order to determine if the content is static or is being generated by javascript then visit the site you want to parse and press Ctrl + U . 为了确定内容是静态的还是由javascript生成的，请访问要解析的站点，然后按Ctrl + U。 The mesh of HTML is what Jsoup will receive when you make a request to the site. 当您向网站提出请求时， Jsoup将收到HTML网格。 If the content you need is not included in there, then the content is dynamic. 如果您所需的内容未包含在其中，则该内容是动态的。 In that case, you must use a headless browser , which is a library/framework that, among others, includes a javacript execution engine. 在这种情况下，您必须使用无头浏览器，该浏览器是一个库/框架，其中包括一个javacript执行引擎。 This way you can simulate the browser 100%. 这样，您可以模拟浏览器100％。

As far as tutorials go, this will cover all your needs regarding Jsoup . 就教程而言，这将满足您有关Jsoup所有需求。 If the content is javacript generated, then you can retrieve it by using a headless browser, and then use Jsoup by parsing the already retrieved content just for parsing and not getting the content. 如果内容是javacript生成的，则可以使用无头浏览器来检索它，然后通过解析已经检索的内容（仅用于解析而不获取内容）来使用Jsoup 。

The material I provided is sufficient for continuing your research. 我提供的材料足以继续您的研究。 In order to get better information then you must be specific about the problem you are facing. 为了获得更好的信息，您必须对所面临的问题有所了解。

Update 更新

Check selenium for android and selendroid 检查硒中的android和selendroid