简体   繁体   中英

Using JSoup with Android Studio to collect information from website or rss feed

I'm learning to use JSoup (slowly) to parse the source code of a website, but I feel myself nearing the end of my rope. That is, I'm not entirely sure what methods I should be looking into.

In theory, I want to develop an app that sifts through a particular search page (ie. Google, Monster, Craigslist, eBay, etc.) and pulls out certain pieces of data. This data might be on the first page, but it could be 10+ pages down the list (in Google, obviously it could be hundreds). Whether it's a search result (Google), a job posting (Monster), or an item for sale (Craigslist/eBay), how do I go about getting this done?

I didn't know about JSoup until recently. And I'm not "stuck" on using it. But, my research has led me to believe that using JSoup will give me the desired result. So, I'm trying to learn how to use it to do what I want. (if anyone knows any extensive beginner tutorials, let me know)

So how should I go about doing this? I know it's a vague questions, but I have a goal and I'm not sure how I get to it.

I've also contemplated using/searching RSS feeds when available (ie. Craigslist). Is this also possible and easier or harder than attempted to pull straight from the site/source code?

On top of that, I want to be able to send the user notifications if new information has presented itself (new items for sale on Craigslist/eBay, new job on Monster, etc.) A separate topic I know, one that I can figure out I'm sure.

Thanks in advance!

Your answer is a little vague so I'll try to cover as much as possible. Jsoup is a HTML/XHTML parser library. This means that it can make a GET or a POST request to a certain url and get the content generated by the server. Then it parses this content to build a DOM representation using java objects.

The problem is your examples. You mention Google and Monster . As fas as I know these sites use Ajax in order to retrieve the content from the server. They use javascript in order to generate dynamic content. Jsoup cannot handle javascript generated content. This is due to its inability to execute javascript . It can certainly "see" it since it's included in the response of the server, but it perceives it as simple text, not code. The browsers can handle this kind of content since the include a javascript execution engine.

In general is safer/better/easier to get the content from an API of the source. Does it provide rss feed or an API ? Then use that. For instance Google provides a way to programmatically execute search queries.

If the source doesn't provide a programmatic way of accessing it, then you can parse the content using Jsoup if and only if the content is static. In order to determine if the content is static or is being generated by javascript then visit the site you want to parse and press Ctrl + U . The mesh of HTML is what Jsoup will receive when you make a request to the site. If the content you need is not included in there, then the content is dynamic. In that case, you must use a headless browser , which is a library/framework that, among others, includes a javacript execution engine. This way you can simulate the browser 100%.

As far as tutorials go, this will cover all your needs regarding Jsoup . If the content is javacript generated, then you can retrieve it by using a headless browser, and then use Jsoup by parsing the already retrieved content just for parsing and not getting the content.

The material I provided is sufficient for continuing your research. In order to get better information then you must be specific about the problem you are facing.

Update

Check selenium for android and selendroid

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM