简体   繁体   English

使用JSOUP解析html文件并将其映射到JAVA中的键值对

[英]Parsing html file using JSOUP and mapping it to key-value pairs in JAVA

I have parsed HTML using Jsoup and I'm trying to get key value pair out of it. 我已经使用Jsoup解析了HTML,并且试图从中获取键值对。

This is the HTML file, which contains key words in dt dlterm and values in dd : 这是HTML文件,其中包含dt dltermdd值:

 <div class="section" id="GUID-1BF02E47-1ECC-4CCF-A903-2A8621DB5FBA__GUID- 20A253C1-02AD-4413-9570-C0178C01E616"> <div class="p"> <dl class="dl"> <dt class="dt dlterm"> <a name="GUID-1BF02E47-1ECC-4CCF-A903-2A8621DB5FBA__GUID-942CC4F1-90F8- 4B83-9647-A3D086063B0C"><!----></a>Incident</dt> <dd class="dd">detials of one</dd> <dt class="dt dlterm"><a name="GUID-1BF02E47-1ECC-4CCF-A903- 2A8621DB5FBA__GUID-0F5CFEC5-6714-4000-A733-79DDB49B4C63"><!----> </a>Risk</dt> <dd class="dd">details of it two</dd> <dt class="dt dlterm"><a name="GUID-1BF02E47-1ECC-4CCF-A903- 2A8621DB5FBA__GUID-C731C50A-947F-431B-9CEE-1FFD1BA40EEA"><!----> </a>Event</dt> <dd class="dd">detials of it three.</dd> </dl> </div> </div> 

This is what I tried: 这是我尝试的:

static Map<Object, Object> maps;

public static Map<Object, Object> getSet(Document doc) {
    maps = new HashMap<Object, Object>();
    String key ="";
    String value = "";
    Elements elemname1 = doc.getElementsByClass("dt dlterm");
    Elements elemname2 = doc.getElementsByClass("dd");

    List<Object> keys = new ArrayList<Object>();
    List<Object> values = new ArrayList<Object>();
    for (Element i : elemname1) {
        key = i.ownText();
        keys.add(key);
    }
    for(Element j : elemname2) {
        value = j.ownText();
        values.add(value);
    }
    System.out.println(maps);
    return maps;
}

public static void main (String args[]) throws Exception {
    String filePath ="someFilePath.html";
    File input = new File(filePath);
    Document doc = Jsoup.parse(input, "UTF-8", "");
    getSet(doc);
}

The expexted result is like this: 展开的结果是这样的:

{ 
    Event = detials of one,
    Incident = detials of two,
    Risk = detials of three 
}

What im getting is: 我得到的是:

{[Incident, Risk, Event] = [detials of one,detials of two,detials of three]}

You can put results into map while collecting them in one loop. 您可以将结果放入地图中,同时将它们收集在一个循环中。 Replace both for loops with this one: 用这一个替换两个for循环:

for (int i = 0; i < elemname1.size(); i++) {
    key = elemname1.get(i).ownText();
    value = elemname2.get(i).ownText();
    maps.put(key, value);
}

output: 输出:

{Risk=details of it two, Event=detials of it three., Incident=detials of one}

You can just use this: 您可以使用以下命令:

Document document = Jsoup.parse(html);

Elements dts = document.getElementsByClass("dt dlterm");
Elements dds = document.getElementsByClass("dd");

if (dts.size() != dds.size()) {
    // ensure same sizes of both lists
}

HashMap<String, String> values = new HashMap<>();
for (int i = 0; i < dts.size(); i++) {
    values.put(dts.get(i).text(), dds.get(i).text());
}

Or in just one statement using Java Streams: 或仅使用Java Streams的一条语句:

Map<String, String> values = IntStream.range(0, Math.min(dts.size(), dds.size())).boxed()
        .collect(Collectors.toMap(i -> dts.get(i).text(),i -> dds.get(i).text()));

The result will be this: 结果将是这样的:

{Risk=details of it two, Event=detials of it three., Incident=detials of one}

If you want to make sure the order in the map is the same as in the HTML code use a LinkedHashMap instead of a HashMap . 如果要确保映射中的顺序与HTML代码中的顺序相同,请使用LinkedHashMap而不是HashMap

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM