简体   繁体   English

将数据从HTML提取到Java对象

[英]Extracting data from HTML to Java objects

I have a message log from a messaging application stored as HTML. 我有一个来自存储为HTML的消息传递应用程序的消息日志。 In the file, a single message is presented in the following way: 在该文件中,以下列方式显示单个消息:

<div class="message">
  <div class="message_header">
    <span class="user">User Name</span>
    <span class="meta">10 february 2018 at 16:17 UTC+01</span>
  </div>
  <p>Message content</p>
</div>

The messages are not nicely arranged in the file - there may be multiple messages per line and sometimes the lines end in the middle of a message. 消息没有很好地排列在文件中 - 每行可能有多个消息,有时行会在消息中间结束。

I'd like to create an instance of class Message with fields like userName , date and messageContent for each item in the file. 我想创建一个类Message的实例,其中包含文件中每个项目的userNamedatemessageContent等字段。 Is there any elegant way to do this? 有没有优雅的方法来做到这一点?

I was planning to iterate over the file and split each line every time a new message starts and then get the data from the string but I'd rather avoid it if there's a less tedious way. 我计划迭代文件并在每次新消息开始时拆分每一行,然后从字符串中获取数据,但如果有一种不那么繁琐的方法,我宁愿避免使用它。

My answer won't probably be useful to the writer of this question (I am 5 months late so not the right timing I guess) but I think it will probably be useful for many other developers that might come across this answer. 我的回答对这个问题的作者来说可能没什么用处(我迟到了5个月,所以不是正确的时间),但我认为这可能对许多其他开发人员有用,可能会遇到这个问题。

Today, I just released (in the name of my company) an HTML to POJO complete framework that you can use to map HTML to any POJO class with simply some annotations. 今天,我刚刚(以我的公司名义)发布了一个HTML to POJO完整框架,您可以使用它将HTML映射到任何POJO类,只需一些注释。 The library itself is quite handy and features many other things all the while being very pluggable. 图书馆本身非常方便,并且具有许多其他功能,同时非常易于插拔。 You can have a look to it right here : https://github.com/whimtrip/jwht-htmltopojo 您可以在这里查看: https//github.com/whimtrip/jwht-htmltopojo

How to use : Basics 使用方法:基础知识

Imagine we need to parse the following html page : 想象一下,我们需要解析以下的html页面:

<html>
    <head>
        <title>A Simple HTML Document</title>
    </head>
    <body>
        <div class="restaurant">
            <h1>A la bonne Franquette</h1>
            <p>French cuisine restaurant for gourmet of fellow french people</p>
            <div class="location">
                <p>in <span>London</span></p>
            </div>
            <p>Restaurant n*18,190. Ranked 113 out of 1,550 restaurants</p>  
            <div class="meals">
                <div class="meal">
                    <p>Veal Cutlet</p>
                    <p rating-color="green">4.5/5 stars</p>
                    <p>Chef Mr. Frenchie</p>
                </div>

                <div class="meal">
                    <p>Ratatouille</p>
                    <p rating-color="orange">3.6/5 stars</p>
                    <p>Chef Mr. Frenchie and Mme. French-Cuisine</p>
                </div>

            </div> 
        </div>    
    </body>
</html>

Let's create the POJOs we want to map it to : 让我们创建我们想要映射到的POJO:

public class Restaurant {

    @Selector( value = "div.restaurant > h1")
    private String name;

    @Selector( value = "div.restaurant > p:nth-child(2)")
    private String description;

    @Selector( value = "div.restaurant > div:nth-child(3) > p > span")    
    private String location;    

    @Selector( 
        value = "div.restaurant > p:nth-child(4)"
        format = "^Restaurant n\*([0-9,]+). Ranked ([0-9,]+) out of ([0-9,]+) restaurants$",
        indexForRegexPattern = 1,
        useDeserializer = true,
        deserializer = ReplacerDeserializer.class,
        preConvert = true,
        postConvert = false
    )
    // so that the number becomes a valid number as they are shown in this format : 18,190
    @ReplaceWith(value = ",", with = "")
    private Long id;

    @Selector( 
        value = "div.restaurant > p:nth-child(4)"
        format = "^Restaurant n\*([0-9,]+). Ranked ([0-9,]+) out of ([0-9,]+) restaurants$",
        // This time, we want the second regex group and not the first one anymore
        indexForRegexPattern = 2,
        useDeserializer = true,
        deserializer = ReplacerDeserializer.class,
        preConvert = true,
        postConvert = false
    )
    // so that the number becomes a valid number as they are shown in this format : 18,190
    @ReplaceWith(value = ",", with = "")
    private Integer rank;

    @Selector(value = ".meal")    
    private List<Meal> meals;

    // getters and setters

}

And now the Meal class as well : 而现在的Meal课也是:

public class Meal {

    @Selector(value = "p:nth-child(1)")
    private String name;

    @Selector(
        value = "p:nth-child(2)",
        format = "^([0-9.]+)\/5 stars$",
        indexForRegexPattern = 1
    )
    private Float stars;

    @Selector(
        value = "p:nth-child(2)",
        // rating-color custom attribute can be used as well
        attr = "rating-color"
    )
    private String ratingColor;

    @Selector(
        value = "p:nth-child(3)"
    )
    private String chefs;

    // getters and setters.
}

We provided some more explanations on the above code on our github page. 我们在github页面上提供了有关上述代码的更多解释。

For the moment, let's see how to scrap this. 目前,让我们看看如何废弃这个。

private static final String MY_HTML_FILE = "my-html-file.html";

public static void main(String[] args) {


    HtmlToPojoEngine htmlToPojoEngine = HtmlToPojoEngine.create();

    HtmlAdapter<Restaurant> adapter = htmlToPojoEngine.adapter(Restaurant.class);

    // If they were several restaurants in the same page, 
    // you would need to create a parent POJO containing
    // a list of Restaurants as shown with the meals here
    Restaurant restaurant = adapter.fromHtml(getHtmlBody());

    // That's it, do some magic now!

}


private static String getHtmlBody() throws IOException {
    byte[] encoded = Files.readAllBytes(Paths.get(MY_HTML_FILE));
    return new String(encoded, Charset.forName("UTF-8"));

}

Another short example can be found here 另一个简短的例子可以在这里找到

Hope this will help someone out there! 希望这能帮助那里的人!

您可以将HTML视为XML,并将dom包用于java https://docs.oracle.com/javase/tutorial/jaxp/dom/readingXML.html,或者您可以使用JAXB进行解组。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM