简体   繁体   English

如何使用 HtmlUnit 从 html 页面中提取元素

[英]How to extract elements from html page using HtmlUnit

I have two questions(problems actually) while parsing the html page using HtmlUnit .I tried their 'Getting Started` as well as searched google but no help.Here is my first problem.在使用HtmlUnit解析 html 页面时,我有两个问题(实际上是问题)。我尝试了他们的“入门”以及搜索谷歌但没有帮助。这是我的第一个问题。

1) I want to extract the text of following bold tag from the page 1)我想从页面中提取以下bold标记的文本

<b class="productPrice">Five Dollars</b>

2)I want to extract the entire text(including further span or link text,if present) in the last paragraph in the following structure 2)我想在以下结构的最后一段中提取整个文本(包括进一步的跨度或链接文本,如果存在)

<div class="alertContainer">
<p>Hello</p>
<p>Haven't you registeret yet?</p>
<p>Registrations will close on 3 July 2012.<span>So don't wait</span></p>
</div>

Can you please one-line code snippets how can I do that?I am new to HtmlUnit.你能请单行代码片段我该怎么做?我是 HtmlUnit 的新手。

EDIT:编辑:

HtmlUnit has getElementByName() and getElementById() , so what do we use if we want to select using class? HtmlUnit 有getElementByName()getElementById() ,那么如果我们想使用 class 进行选择,我们使用什么?

This will be the answer to my first question.这将是我的第一个问题的答案。

actually, I'd suggest you to use xpath and jtidy instead, like this实际上,我建议您改用 xpath 和 jtidy,就像这样

import java.io.IOException;
import java.net.MalformedURLException;
import java.util.List;

import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlForm;
import com.gargoylesoftware.htmlunit.html.HtmlItalic;
import com.gargoylesoftware.htmlunit.html.HtmlOption;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlRadioButtonInput;
import com.gargoylesoftware.htmlunit.html.HtmlSelect;
import com.gargoylesoftware.htmlunit.html.HtmlSubmitInput;
import com.gargoylesoftware.htmlunit.html.HtmlTextArea;
import com.gargoylesoftware.htmlunit.html.HtmlTextInput;

public class WebScraper {

    private static final String TEXT = "some random text here";
    private static final String SWALLOW = "continental";
    private static final String COLOR = "indigo2";
    private static final String QUESTION = "why?";
    private static final String NAME = "Leo";

    /**
     * @param args
     * @throws IOException
     * @throws MalformedURLException
     * @throws FailingHttpStatusCodeException
     */
    public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
   
        //to get the HTML Xpath, download and install firefox plugin Xpather from
        //http://jassage.com/xpather-1.4.5b.xpi
        //
        //then right-click on any part of the html and choose "show in xpather"
        //
        //HtmlUnit is a suite for functional web app tests (headless) with a
        //built-in "browser". Very useful for screen scraping.
        //
        //for HtmlUnit examples and usage, try
        //http://htmlunit.sourceforge.net/gettingStarted.html
        //
        //sometimes, the HTML is malformed, so you'll need to "clean it"
        //that's why I've also added JTidy to this project
       
        WebClient webClient = new WebClient();
       
        HtmlPage page = webClient.getPage("http://cgi-lib.berkeley.edu/ex/simple-form.html");
       
//        System.out.println(page.asXml());
       
        HtmlForm form = (HtmlForm) page.getByXPath("/html/body/form").get(0);
       
        HtmlTextInput name = form.getInputByName("name");
        name.setValueAttribute(NAME);
       
        HtmlTextInput quest = form.getInputByName("quest");
        quest.setValueAttribute(QUESTION);
       
        HtmlSelect color = form.getOneHtmlElementByAttribute("select", "name", "color");
        List<HtmlOption> options = color.getOptions();
        for(HtmlOption op:options){
            if (op.getValueAttribute().equals(COLOR)){
                op.setSelected(true);
            }
        }
       
        HtmlTextArea text = form.getOneHtmlElementByAttribute("textarea", "name", "text");
        text.setText(TEXT);
       
        //swallow
        HtmlRadioButtonInput swallow = form.getInputByValue(SWALLOW);
        swallow.click();
       
        HtmlSubmitInput submit = form.getInputByValue("here");

        //submit
        HtmlPage page2 = submit.click();
       
//        System.out.println(page2.asXml());
       
        String color2 = ((HtmlItalic)page2.getByXPath("//dd[1]/i").get(0)).getTextContent();
        String name2 = ((HtmlItalic)page2.getByXPath("//dd[2]/i").get(0)).getTextContent();
        String quest2 = ((HtmlItalic)page2.getByXPath("//dd[3]/i").get(0)).getTextContent();
        String swallow2 = ((HtmlItalic)page2.getByXPath("//dd[4]/i").get(0)).getTextContent();
        String text2 = ((HtmlItalic)page2.getByXPath("//dd[5]/i").get(0)).getTextContent();

        System.out.println(COLOR.equals(color2)
                && NAME.equals(name2)
                && QUESTION.equals(quest2)
                && SWALLOW.equals(swallow2)
                && TEXT.equals(text2));
       
        webClient.closeAllWindows();

    }

}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM