简体   繁体   English

Java Web Crawler库

[英]Java Web Crawler Libraries

I wanted to make a Java based web crawler for an experiment. 我想为实验制作一个基于Java的网络爬虫。 I heard that making a Web Crawler in Java was the way to go if this is your first time. 我听说如果这是你第一次使用Java制作一个Web爬虫是可行的方法。 However, I have two important questions. 但是,我有两个重要问题。

  1. How will my program 'visit' or 'connect' to web pages? 我的程序如何“访问”或“连接”到网页? Please give a brief explanation. 请简要说明一下。 (I understand the basics of the layers of abstraction from the hardware up to the software, here I am interested in the Java abstractions) (我理解从硬件到软件的抽象层的基础知识,这里我对Java抽象感兴趣)

  2. What libraries should I use? 我应该使用哪些库? I would assume I need a library for connecting to web pages, a library for HTTP/HTTPS protocol, and a library for HTML parsing. 我想我需要一个用于连接网页的库,一个用于HTTP / HTTPS协议的库和一个用于HTML解析的库。

Crawler4j is the best solution for you, Crawler4j是最适合您的解决方案,

Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web. Crawler4j是一个开源Java爬虫,它为爬网提供了一个简单的界面。 You can setup a multi-threaded web crawler in 5 minutes! 您可以在5分钟内设置多线程Web爬虫!

Also visit. 访问。 for more java based web crawler tools and brief explanation for each. 更多基于Java的网络爬虫工具和每个工具的简要说明。

This is How your program 'visit' or 'connect' to web pages. 这是您的程序如何“访问”或“连接”到网页。

    URL url;
    InputStream is = null;
    DataInputStream dis;
    String line;

    try {
        url = new URL("http://stackoverflow.com/");
        is = url.openStream();  // throws an IOException
        dis = new DataInputStream(new BufferedInputStream(is));

        while ((line = dis.readLine()) != null) {
            System.out.println(line);
        }
    } catch (MalformedURLException mue) {
         mue.printStackTrace();
    } catch (IOException ioe) {
         ioe.printStackTrace();
    } finally {
        try {
            is.close();
        } catch (IOException ioe) {
            // nothing to see here
        }
    }

This will download source of html page. 这将下载html页面的源代码。

For HTML parsing see this 对于HTML解析,请参阅此内容

Also take a look at jSpider and jsoup 另外看看jSpiderjsoup

对于解析内容,我正在使用Apache Tika

Right now there is a inclusion of many java based HTML parser that support visiting and parsing the HTML pages. 现在有一个包含许多基于Java的HTML解析器,支持访问和解析HTML页面。

Here's the complete list of HTML parser with basic comparison. 这是基本比较的HTML解析器的完整列表。

I recommend you to use the HttpClient library . 我建议你使用HttpClient库 You can found examples here . 你可以在这里找到例子。

I would prefer crawler4j. 我更喜欢crawler4j。 Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web. Crawler4j是一个开源Java爬虫,它为爬网提供了一个简单的界面。 You can setup a multi-threaded web crawler in few hours. 您可以在几个小时内设置多线程Web爬网程序。

Have a look at these existing projects if you want to learn how it can be done: 如果您想了解如何完成这些项目,请查看这些现有项目:

A typical crawler process is a loop consisting of fetching, parsing, link extraction, and processing of the output (storing, indexing). 典型的爬虫过程是一个循环,包括提取,解析,链接提取和输出处理(存储,索引)。 Though the devil is in the details, ie how to be "polite" and respect robots.txt , meta tags, redirects, rate limits, URL canonicalization, infinite depth, retries, revisits, etc. 虽然魔鬼在细节中,即如何“礼貌”并尊重robots.txt ,元标记,重定向,速率限制,URL规范化,无限深度,重试,重访等。

Norconex HTTP收集器流程图

Flow diagram courtesy of Norconex HTTP Collector . 流程图由Norconex HTTP Collector提供

你可以探索.apache droid或apache nutch来获得基于java的爬虫的感觉

Though mainly used for Unit Testing web applications, HttpUnit traverses a website, clicks links, analyzes tables and form elements, and gives you meta data about all the pages. 虽然主要用于单元测试Web应用程序,但HttpUnit遍历网站,单击链接,分析表格和表单元素,并为您提供有关所有页面的元数据。 I use it for Web Crawling, not just for Unit Testing. 我用它来进行Web爬行,而不仅仅是单元测试。 - http://httpunit.sourceforge.net/ - http://httpunit.sourceforge.net/

我认为jsoup比其他人更好,jsoup在Java 1.5及更高版本,Scala,Android,OSGi和Google App Engine上运行。

Here is a list of available crawler: 以下是可用爬虫的列表:

https://java-source.net/open-source/crawlers https://java-source.net/open-source/crawlers

But I suggest using Apache Nutch 但我建议使用Apache Nutch

I come up with another solution to propose that no one mention. 我提出另一个解决方案,提出没有人提及。 There is a library called Selenum it is is an open-source automating testing tool used for automating web applications for testing purposes, but is certainly not limited to only this . 有一个名为Selenum的库,它是一个开源自动化测试工具,用于自动化Web应用程序以进行测试,但当然不仅限于此。 You can write a web crawler and get benefited from this automation testing tool just as a human would do. 你可以编写一个网络爬虫,并从这个自动化测试工具中受益,就像人类一样。

As an illustration, i will provide to you a quick tutorial to get a better look of how it works. 作为一个例子,我将为您提供一个快速教程,以便更好地了解它的工作原理。 if you are being bored to read this post take a look at this Video to understand what capabilities this library can offer in order to crawl web pages. 如果您感到无聊阅读此帖子,请查看此视频 ,了解此库可以提供哪些功能以便抓取网页。

Selenium Components 硒成分

To begin with Selenium consist of various components that coexisted in a unique process and perform their action on the java program. 首先,Selenium包含各种组件,这些组件在一个独特的进程中共存并在java程序上执行它们的操作。 This main component is called Webdriver and it must be included in your program in order to make it working properly. 这个主要组件称为Webdriver,它必须包含在您的程序中才能使其正常工作。

Go to the following site here and download the latest release for your computer OS (Windows, Linux, or MacOS). 请转到以下站点并下载适用于您的计算机操作系统(Windows,Linux或MacOS)的最新版本。 It is a ZIP archive containing chromedriver.exe. 它是一个包含chromedriver.exe的ZIP存档。 Save it on your computer and then extract it to a convenient location just as C:\\WebDrivers\\User\\chromedriver.exe We will use this location later in the java program. 将其保存在您的计算机上,然后将其解压缩到一个方便的位置,就像C:\\ WebDrivers \\ User \\ chromedriver.exe我们稍后将在java程序中使用此位置。

The next step is to inlude the jar library. 下一步是包含jar库。 Assuming you are using maven project to build the java programm you need to add the follow dependency to your pom.xml 假设您正在使用maven项目来构建java程序,您需要将后续依赖项添加到您的pom.xml

<dependency>
 <groupId>org.seleniumhq.selenium</groupId>
 <artifactId>selenium-java</artifactId>
 <version>3.8.1</version>
</dependency>

Selenium Web driver Setup Selenium Web驱动程序安装程序

Let us get started with Selenium. 让我们开始使用Selenium。 The first step is to create a ChromeDriver instance: 第一步是创建ChromeDriver实例:

System.setProperty("webdriver.chrome.driver", "C:\WebDrivers\User\chromedriver.exe);
WebDriver driver = new ChromeDriver();

Now its time to get deeper in code.The following example shows a simple programma that open a web page and extract some useful Html components. 现在是时候深入了解代码。下面的例子展示了一个简单的程序,它打开一个网页并提取一些有用的Html组件。 It is easy to understand, as it has comments that explain the steps clearly. 这很容易理解,因为它有评论清楚地解释了这些步骤。 Please take a brief look to understand how to capture the objects 请简要了解一下如何捕获对象

//Launch website
      driver.navigate().to("http://www.calculator.net/");

      //Maximize the browser
      driver.manage().window().maximize();

      // Click on Math Calculators
      driver.findElement(By.xpath(".//*[@id = 'menu']/div[3]/a")).click();

      // Click on Percent Calculators
      driver.findElement(By.xpath(".//*[@id = 'menu']/div[4]/div[3]/a")).click();

      // Enter value 10 in the first number of the percent Calculator
      driver.findElement(By.id("cpar1")).sendKeys("10");

      // Enter value 50 in the second number of the percent Calculator
      driver.findElement(By.id("cpar2")).sendKeys("50");

      // Click Calculate Button
      driver.findElement(By.xpath(".//*[@id = 'content']/table/tbody/tr[2]/td/input[2]")).click();


      // Get the Result Text based on its xpath
      String result =
         driver.findElement(By.xpath(".//*[@id = 'content']/p[2]/font/b")).getText();


      // Print a Log In message to the screen
      System.out.println(" The Result is " + result);

Once you are done with your work, the browser window can be closed with: 完成工作后,可以使用以下命令关闭浏览器窗口:

driver.quit();

Selenium Browser Options Selenium浏览器选项

There too much functionality you can implement when you working with this library, For example, assuming you are using chrome you can add in your code 使用此库时可以实现的功能太多,例如,假设您使用的是chrome,则可以添加代码

ChromeOptions options = new ChromeOptions();

Take look at how we can use WebDriver to open Chrome extensions using ChromeOptions 了解我们如何使用WebDriver使用ChromeOptions打开Chrome扩展程序

options.addExtensions(new File("src\test\resources\extensions\extension.crx"));

This is for using Incognito mode 这适用于使用隐身模式

options.addArguments("--incognito");

this one for disabling javascript and info bars 这一个用于禁用javascript和信息栏

options.addArguments("--disable-infobars");
options.addArguments("--disable-javascript");

this one if you want to make the browser scraping silently and hide browser crawling in the background 如果您想让浏览器静默抓取并在后台隐藏浏览器抓取,请执行此操作

options.addArguments("--headless");

once you have done with it then 一旦你完成它然后

WebDriver driver = new ChromeDriver(options);

To sum up let's see what Selenium has to offer and make it a unique choice compared with the other solutions that proposed on this post thus far. 总结一下,让我们看看Selenium提供了什么,并将其作为迄今为止在这篇文章中提出的其他解决方案的独特选择。

  • Language and Framework Support 语言和框架支持
  • Open Source Availability 开源可用性
  • Multi-Browser Support 多浏览器支持
  • Support Across Various Operating Systems 支持各种操作系统
  • Ease Of Implementation 易于实施
  • Reusability and Integrations 可重用性和集成
  • Parallel Test Execution and Faster Go-to-Market 并行测试执行和更快的上市
  • Easy to Learn and Use 易于学习和使用
  • Constant Updates 不断更新

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM