简体   繁体   English

在jsoup中获取html字符串中的所有属性

[英]get all attributes in html string in jsoup

I have string in HTML format and i am trying to get all the attributes and its values using Jsoup. 我有HTML格式的字符串,我正尝试使用Jsoup获取所有属性及其值。

String is 字符串是

String string= 
"<button class=submit btn primary-btn flex-table-btn js-submit type=submit>Sign in</button>";

 Document doc = Jsoup.parse(string);
    try {
        org.jsoup.nodes.Attributes attrs = doc.attributes();

        for( org.jsoup.nodes.Element element : doc.getAllElements() )
        {
              for( Attribute attribute : element.attributes() )
              {
                  System.out.println( attribute.getKey() +  " --::-- "+attribute.getValue()  ); 
              }
        }

    } catch (Exception e) {
        e.printStackTrace();
    }

My desired output is :: 我想要的输出是::

key: **class** , Value is: **submit btn primary-btn flex-table-btn js-submit**

key: **type** , Value is: **submit**

But what I get is this 但是我得到的是

key: class , Value is: submit key: btn , Value is: key: primary-btn , Value is: key: flex-table-btn , Value is: key: js-submit , Value is: key: type , Value is: submit

This is because of the quotes. 这是因为引号。 If I use 如果我用

String string= 
"<button class='submit btn primary-btn flex-table-btn js-submit' type='submit'>       Sign in</button>";

I will get my desired output.But I am trying to get without quotes. 我会得到想要的输出。但是我试图不加引号。

You can't do it without the quotes, because the quotes are not optional. 没有引号就无法做到,因为引号不是可选的。 Without the quotes, the HTML you've quoted describes an element with one class ( submit ) and a series of non-class, invalid additional attributes with names like btn , flex-table , etc., and that's how any browser will interpret it, just as JSoup is doing. 如果没有引号,则您引用的HTML会描述一个元素,该元素具有一个类( submit )和一系列非类的,无效的附加属性,其名称如btnflex-table等,这就是任何浏览器都将解释它的方式,就像JSoup一样。 If those are meant to be additional classes on the element, quotes are required . 如果这些是元素上的其他类,则必须使用引号。

From the specification : 规格

Unquoted attribute value syntax 不带引号的属性值语法

The attribute name, followed by zero or more space characters, followed by a single U+003D EQUALS SIGN character, followed by zero or more space characters, followed by the attribute value, which, in addition to the requirements given above for attribute values, must not contain any literal space characters , any U+0022 QUOTATION MARK characters ("), U+0027 APOSTROPHE characters ('), "=" (U+003D) characters, "<" (U+003C) characters, ">" (U+003E) characters, or U+0060 GRAVE ACCENT characters (`), and must not be the empty string. 属性名称,后跟零个或多个空格字符,后跟一个U + 003D的EQUALS SIGN字符,后跟零个或多个空格字符,后跟属性值,除了上面对属性值的要求之外, 不得包含任何文字空格字符 ,任何U + 0022引号字符(“),U + 0027 APOSTROPHE字符('),” =“(U + 003D)字符,” <“(U + 003C)字符,”> “(U + 003E)字符或U + 0060 GRAVE ACCENT字符(`),并且不得为空字符串。

Note that "must not contain any literal space characters" part I've emphasized. 请注意,我强调了“不得包含任何文字空格字符”部分。

It's simple with Jsoup : 使用Jsoup很简单:

Document doc = Jsoup.parse(HTML);
List<String> tags = new ArrayList<String>(); //record tags

for(Element e : doc.getAllElements()){      // all elements in html

    tags.add(e.tagName().toLowerCase());    // add each tag in tags List
    //System.out.println("Tag: "+ e.tag()+" attributes = "+e.attributes());  // attributes with values in string
    //System.out.println("Tag: "+ e.tag()+" attributes = "+e.attributes().asList()); //attributes in List<Attribute>

    for(Attribute att : e.attributes().asList()){ // for each tag get all attributes in one List<Attribute>
        System.out.print("Key: "+att.getKey()+ " , Value: "+att.getValue());
        System.out.println();
    }
}

System.out.println("*****************");
System.out.println("All Tags = "+tags);
System.out.println("Distinct Tags = "+ new HashSet<String>(tags));

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM