简体繁体 English

Java web scraper

[英]Java web scraper

原文 2011-03-03 10:22:22 6 3 java/ html-parsing/ htmlunit/ html-content-extraction

What is the best library for a Java web scraper? 什么是Java Web scraper的最佳库？ I know the following choices: 我知道以下选择：

Selenium 硒
HTMLUnit 的HtmlUnit
Lobo browser Lobo浏览器

I need to select one option to build a scraper for one scalable project. 我需要选择一个选项来为一个可伸缩项目构建一个scraper。

3 个解决方案

If you are scraping, why do you need a browser? 如果你在抓，你为什么需要浏览器？ Just doing basic cURL calls to a page and getting the response will give you what you need to do scraping. 只是对页面进行基本的cURL调用并获得响应将为您提供所需的抓取功能。

This will help with scalability. 这将有助于扩展性。 If you want a browser then go for HTMLUnit as that would again help with scalability. 如果你想要一个浏览器，那就选择HTMLUnit，这样可以再次提高可伸缩性。

我最近推荐了Web Harvest ，并认为它开箱即用，除了围绕HTTP 500响应代码的一些问题...

Use jsoup , it works great to get the response from URL and then use the XPath Expression to parse data from the response. 使用jsoup ，它可以很好地从URL获取响应，然后使用XPath Expression来解析响应中的数据。 I've implemented this and it works great. 我实现了这个并且效果很好。

Java Web 爬虫和爬虫 - Java Web crawler and scraper

Java - 网络爬虫问题 - Java - Web Scraper Issue

Java网络爬虫看到验证码 - Java web-scraper sees captcha

在Google App Engine：Java上运行Jaunt（网络抓取工具） - Running Jaunt (web-scraper) on Google App Engine: Java

Java Web 爬虫项目返回 null 而不是正常链接 - Java Web Scraper project is returning null instead of normal links

提高刮板效率 - Increase web scraper efficiency

网络抓取工具未创建 CSV 文件 - Web scraper not creating CSV file

运行Web抓取程序时出现“线程“ main”中的异常“ java.lang.NullPointerException”错误 - “Exception in thread ”main“ java.lang.NullPointerException” error when running web scraper program

Java 多线程网络爬虫，每秒连续提取数据，同时允许消费者检索数据 - Java Multithreading web scraper that extracts data continuously at every second while allowing consumer to retrieve data

Will Jaunt web scraper能否抓取这个javascript网站 - Will Jaunt web scraper be capable of scraping this javascript site

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Java Web 爬虫和爬虫 - Java Web crawler and scraper Java - 网络爬虫问题 - Java - Web Scraper Issue Java网络爬虫看到验证码 - Java web-scraper sees captcha 在Google App Engine：Java上运行Jaunt（网络抓取工具） - Running Jaunt (web-scraper) on Google App Engine: Java Java Web 爬虫项目返回 null 而不是正常链接 - Java Web Scraper project is returning null instead of normal links 提高刮板效率 - Increase web scraper efficiency 网络抓取工具未创建 CSV 文件 - Web scraper not creating CSV file 运行Web抓取程序时出现“线程“ main”中的异常“ java.lang.NullPointerException”错误 - “Exception in thread ”main“ java.lang.NullPointerException” error when running web scraper program Java 多线程网络爬虫，每秒连续提取数据，同时允许消费者检索数据 - Java Multithreading web scraper that extracts data continuously at every second while allowing consumer to retrieve data Will Jaunt web scraper能否抓取这个javascript网站 - Will Jaunt web scraper be capable of scraping this javascript site

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM