简体   繁体   English

Java Web 爬虫和爬虫

[英]Java Web crawler and scraper

My intention is to read cost details of a product from various websites , so that i can display cost comparison details in a html page of my Spring application.我的目的是从各种网站读取产品的成本详细信息,以便我可以在 Spring 应用程序的 html 页面中显示成本比较详细信息。 Can anyone suggest me on how to do it .任何人都可以建议我如何做。 is there any technologies to achieve this ?有什么技术可以实现这一目标吗? so that i can always read the updated data from other websites and display it in my Spring application.这样我就可以随时从其他网站读取更新的数据并将其显示在我的 Spring 应用程序中。 I saw some Web scraper tools as a Chrome extension but it generates an Excel workbook.我看到一些 Web抓取工具作为 Chrome 扩展程序,但它生成了一个 Excel 工作簿。 how could i use it in my Spring application and display it in HTML page ?我如何在我的 Spring 应用程序中使用它并在 HTML 页面中显示它?

You can send http queries from your spring application and parse answers updating data.您可以从 Spring 应用程序发送 http 查询并解析更新数据的答案。 Or you can use any external tool that will scrape whatever you want and to save results( for example as an Excel workbook), and your application will read this results and process it however you want.或者您可以使用任何外部工具来抓取您想要的任何内容并保存结果(例如作为 Excel 工作簿),您的应用程序将读取此结果并根据需要对其进行处理。

There are a lot of opensource Java and python based crawler readily available which you can configure for your requirement, some of which are as stated below.有很多基于开源 Java 和 Python 的爬虫可用,您可以根据自己的要求进行配置,其中一些如下所述。

Apache Nutch
StormCrawler
Jsoup
Jaunt

in your case, since you need the only price from the product page you can build your own using JSoup a framework available in Java or Beautiful Soup module in Python.在您的情况下,由于您需要产品页面上的唯一价格,您可以使用 JSoup 构建自己的框架,这是一个 Java 中可用的框架或 Python 中的 Beautiful Soup 模块。

if the scale isn't a concern and you just want to crawl some pages on a daily basis I recommend building your own crawler.如果规模不是问题并且您只想每天抓取一些页面,我建议您构建自己的抓取工具。 otherwise, you can use Nutch or StormCrawler否则,您可以使用 Nutch 或 StormCrawler

Also for custom made please don't have multiple selectors for different webpages, in fact just find out a common tag, CSS or template which will get you the price.同样对于定制,请不要为不同的网页设置多个选择器,事实上,只需找出一个通用的标签、CSS 或模板即可获得价格。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM