简体   繁体   中英

Parsing the html meta tag with jsoup library

Just started exploring the Jsoup library as i will use it for one of my projects. I tried googling but i could not find the exact answer that can help me. Here is my problem, i have an html file with meta tags like below

<meta content="this is the title value" name="d.title">
<meta content="this is the description value" name="d.description">
<meta content="language is french" name="d.language">

And a java pojo like so,

public class Example {
    private String title;
    private String description;
    private String language;

    public Example() {}

    // setters and getters go here
} 

Now i want to parse the html file and extract the d.title content value and store in Example.title and d.description value of "content" and store in Example.description and so on and so forth.

What i have done by reading jsoup cookbook is somethink like,

Document doc = Jsoup.parse("test.html");
Elements metaTags = doc.getElementsByTag("meta");

for (Element metaTag : metaTags) {
    String content = metaTag.attr("content");
    String content = metaTag.attr("name");
}

what that will do is walk through all meta tags get the value of their "content" and "name" attributes, but what i want is to get the value of "content" attribute whose "name" attribute is say "d.title" so that i can store it in Example.title

Update: @PJMeisch answer below actually sovles the problem but that is too much code for my liking(was trying to avoid doing the exact same thing). I mean i was thinking it could be possible to do something like

String title = metaTags.getContent("d.title")

where d.title is the value of the "name" attribute That way it will reduce the lines of code, i have not found such a method but maybe that is because am still new to jsoup thats why i asked. But if such a method does not exist(which would be nice if it did cuz it makes life easier) i would just go with PJMeisch said.

ok, all the code:

Document doc = Jsoup.parse("test.html");
Elements metaTags = doc.getElementsByTag("meta");

Example ex = new Example();

for (Element metaTag : metaTags) {
  String content = metaTag.attr("content");
  String name = metaTag.attr("name");

  if("d.title".equals(name)) {
    ex.setTitle(content);
  }
  if("d.description".equals(name)) {
    ex.setDescription(content);
  }
  if("d.language".equals(name)) {
    ex.setLanguage(content);
  }
}

to answer your updated question: this is not possible with jsoup, as the jsoup document just reflects the xml/dom structure of the html document. You will have to iterate yourself over the metas, but you could do something like this:

Document doc = Jsoup.parse("test.html");

Map<String, String> metas = new HashMap<>();
Elements metaTags = doc.getElementsByTag("meta");

for (Element metaTag : metaTags) {
  String content = metaTag.attr("content");
  String name = metaTag.attr("name");
  metas.put(name, content);
}

Example ex = new Example();
ex.setTitle(metas.get("d.title"));
ex.setDescription(metas.get("d.description"));
ex.setLanguage(metas.get("d.language"));

you assign the value of both attributes to the same variable named content. Assign the name attribute to a name variable and compare that to your desired value of 'd.title'.

Using the right specific library simplifies everything

Check out my library for parsing meta tag content

poshjosh/bcmetaselector

package com.bc.meta.selector;

import com.bc.meta.selector.htmlparser.AttributeContextHtmlparser;
import com.bc.meta.selector.util.SampleConfigPaths;
import com.bc.meta.ArticleMetaNames;
import com.bc.meta.impl.ArticleMetaNameIsMultiValue;
import java.util.Map;
import java.util.Iterator;
import java.util.function.BiFunction;
import org.json.simple.JSONValue;
import org.htmlparser.Parser;
import org.htmlparser.Node;
import org.htmlparser.Tag;

public class ReadMe {

    public static void main(String... args) throws Exception {

        final BiFunction<String, Node, String> nodeContentExtractor =
                (prop, node) -> node instanceof Tag ? ((Tag)node).getAttributeValue("content") : null;

        final SelectorBuilder<Node, String, Object> builder = Selector.builder();

        final Selector<Node> selector = builder.filter()
                .attributeContext(new AttributeContextHtmlparser(false))
                .configFilePaths(SampleConfigPaths.APP_ARTICLE_LIST)
                .jsonParser((reader) -> (Map)JSONValue.parse(reader))
                .propertyNames(ArticleMetaNames.values())
                .back()
                .multipleValueTest(new ArticleMetaNameIsMultiValue())
                .nodeConverter(nodeContentExtractor)
                .build();

        final Parser parser = new Parser();

        final String url = "https://edition.cnn.com/2018/06/21/africa/noura-hussein-asequals-intl/index.html";

        parser.setURL(url);

        Iterator<Node> nodes = parser.elements().iterator();

        final Map map = selector.selectAsMap(nodes, ArticleMetaNames.values());

        System.out.println("Printing meta tags data for: " + url + "\n" + map);
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM