简体   繁体   English

Java-从网页获取文本

[英]Java - Get text from webpage

I am beginning a new project, its something I have never attempted with in Java, and I have been researching before hand. 我正在开始一个新项目,这是我从未在Java中尝试过的项目,并且我一直在进行研究。 My research has not got me much further than where I started. 我的研究并没有让我更深入。

Basically my project will do this: 基本上我的项目会做到这一点:

  • Search a website and get corresponding data (Basically search its search engine based of the query that a user inputs, then returns the corresponding results) 搜索网站并获取相应的数据(根据用户输入的查询来基本搜索其搜索引擎,然后返回相应的结果)

  • The user clicks on one of the results and then program will show certain 用户单击结果之一,然后程序将显示某些
    values (the values will be on the 值(值将在
    result's webpage) 结果的网页)

So far all I kind of know on how to do this is Web Scraping. 到目前为止,我对如何执行此操作的了解仅是Web Scraping。 I couldn't find any examples so I am still kind of in the dark about this. 我找不到任何示例,因此我对此仍然有些茫然。

Is this really possible? 这真的有可能吗? I will be using Java with the Android SDK. 我将在Android SDK中使用Java。 I kind of have a idea, but my Java knowledge does not contain anything to do with Web Pages, etc. 我有点主意,但是我的Java知识与Web Pages等无关。

Thanks in advanced, Brandon 谢谢你,布兰登

Nutch is a great tool, but may be a bit overkill for a small project. Nutch是一个很棒的工具,但是对于一个小型项目而言可能有点过大。 if you are looking for something really quick and dirty and easy to understand you should look into crawler 如果您正在寻找真正快速,肮脏且易于理解的内容,则应考虑使用履带

see an example of use here: http://java.net/projects/crawler/sources/svn/content/trunk/src/examples/com/torunski/crawler/examples/ExampleDownloadWithHTMLParser.java?rev=429 在此处查看使用示例: http : //java.net/projects/crawler/sources/svn/content/trunk/src/examples/com/torunski/crawler/examples/ExampleDownloadWithHTMLParser.java?rev=429

You can probably drop this into your project and be scraping in 10 mins 您可能可以将其放入项目并在10分钟内抓取

Of course it is possible. 当然可以。 Probably the best library for this is Apache Nutch . 最好的库也许是Apache Nutch Its based on powerful library stacks like Lucene and is very matured. 它基于强大的库堆栈(如Lucene),并且非常成熟。 Look in to their tutorials and you might find all the necessary information for a quick poc. 查看他们的教程,您可能会找到所有必要的信息,以便快速上手。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM