简体   繁体   English

正则表达式解析JSoup中的html源

[英]Regex to parse html source in JSoup

I am trying to fetch values from a web page source file this is the html rules i have 我正在尝试从网页源文件中获取值,这是我拥有的html规则

e=d.select("li[id=result_48]");
e=d.select("div[id=result_48]");

this is the html tag 这是html标记

<li id="result_48" data-asin="0781774047" class="s-result-item">
<div id="result_48" data-asin="0781774047" class="s-result-item">

what i want to do is whatever comes in place of "li" or "div" i want to get the value inside the id .. so i want to use RegX in place of "li" or "div" 我想做的是代替“ li”或“ div”的任何内容,我想获取id内的值..因此我想使用RegX代替“ li”或“ div”

So the Jsoup element should check the id=result_48 and if something comes like that i want the data. 因此,Jsoup元素应检查id = result_48,如果出现类似的情况,我需要数据。 how can i do that. 我怎样才能做到这一点。

Thanks in advance 提前致谢

Tested with different order of attributes. 测试了不同顺序的属性。 Might have missed some cases so test with your actual data. 可能会遗漏某些情况,因此请使用您的实际数据进行测试。 Assume that there are no spaces and quotes in the id attribute. 假设id属性中没有空格和引号。

public static void main(String[] args) throws Exception {
    String[] lines = {
            "<li id=\"result_48\" data-asin=\"0781774047\" class=\"s-result-item\">",
            "<div id=\"result_48\" data-asin=\"0781774047\" class=\"s-result-item\">",
            "<div data-asin=\"0781774047\" id=\"result_48\" class=\"s-result-item\">",
            "<div data-asin=\"0781774047\" class=\"s-result-item\" id=\"result_48\">" };
    for (String str : lines) {
        System.out.println(extractId(str));
    }
}

private static String extractId(String line) {
    String regex = "";
    regex = regex + "(?:[<](?:li|div)).*id=\""; // match start until id="
    regex = regex + "([^\\s^\"]+)"; // capture the id inside quotes (exclude
                                    // spaces and quote)
    regex = regex + "(?:.*\">)"; // match any characters until the end ">
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(line);
    if (matcher.matches()) {
        return matcher.group(1);
    }
    return null;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM