简体   繁体   English

使用jsoup库解析html元标记

[英]Parsing the html meta tag with jsoup library

Just started exploring the Jsoup library as i will use it for one of my projects. 刚开始探索Jsoup库,因为我将把它用于我的一个项目。 I tried googling but i could not find the exact answer that can help me. 我尝试谷歌搜索,但我找不到可以帮助我的确切答案。 Here is my problem, i have an html file with meta tags like below 这是我的问题,我有一个带有meta标签的html文件,如下所示

<meta content="this is the title value" name="d.title">
<meta content="this is the description value" name="d.description">
<meta content="language is french" name="d.language">

And a java pojo like so, 像这样的java pojo,

public class Example {
    private String title;
    private String description;
    private String language;

    public Example() {}

    // setters and getters go here
} 

Now i want to parse the html file and extract the d.title content value and store in Example.title and d.description value of "content" and store in Example.description and so on and so forth. 现在我要解析html文件并提取d.title内容值并存储在Example.title和d.description值“content”中并存储在Example.description中,依此类推。

What i have done by reading jsoup cookbook is somethink like, 通过阅读jsoup cookbook我所做的事情就像是,

Document doc = Jsoup.parse("test.html");
Elements metaTags = doc.getElementsByTag("meta");

for (Element metaTag : metaTags) {
    String content = metaTag.attr("content");
    String content = metaTag.attr("name");
}

what that will do is walk through all meta tags get the value of their "content" and "name" attributes, but what i want is to get the value of "content" attribute whose "name" attribute is say "d.title" so that i can store it in Example.title 那将要做的是遍历所有元标记获取其“内容”和“名称”属性的值,但我想要的是获取“内容”属性的值,其“名称”属性是“d.title”所以我可以将它存储在Example.title中

Update: @PJMeisch answer below actually sovles the problem but that is too much code for my liking(was trying to avoid doing the exact same thing). 更新: @PJMeisch下面的答案实际上解决了问题,但这是我喜欢的代码太多(试图避免做同样的事情)。 I mean i was thinking it could be possible to do something like 我的意思是我认为可以做类似的事情

String title = metaTags.getContent("d.title") String title = metaTags.getContent(“d.title”)

where d.title is the value of the "name" attribute That way it will reduce the lines of code, i have not found such a method but maybe that is because am still new to jsoup thats why i asked. 其中d.title是“name”属性的值那样它会减少代码行,我还没有找到这样的方法,但也许这是因为我仍然是jsoup的新手,这就是我问的原因。 But if such a method does not exist(which would be nice if it did cuz it makes life easier) i would just go with PJMeisch said. 但是如果这样的方法不存在(如果这样做会很好,那会让生活更轻松)我会跟PJMeisch说的那样。

ok, all the code: 好的,所有的代码:

Document doc = Jsoup.parse("test.html");
Elements metaTags = doc.getElementsByTag("meta");

Example ex = new Example();

for (Element metaTag : metaTags) {
  String content = metaTag.attr("content");
  String name = metaTag.attr("name");

  if("d.title".equals(name)) {
    ex.setTitle(content);
  }
  if("d.description".equals(name)) {
    ex.setDescription(content);
  }
  if("d.language".equals(name)) {
    ex.setLanguage(content);
  }
}

to answer your updated question: this is not possible with jsoup, as the jsoup document just reflects the xml/dom structure of the html document. 回答你的更新问题:jsoup无法做到这一点,因为jsoup文档只反映了html文档的xml / dom结构。 You will have to iterate yourself over the metas, but you could do something like this: 你必须在metas上迭代自己,但是你可以这样做:

Document doc = Jsoup.parse("test.html");

Map<String, String> metas = new HashMap<>();
Elements metaTags = doc.getElementsByTag("meta");

for (Element metaTag : metaTags) {
  String content = metaTag.attr("content");
  String name = metaTag.attr("name");
  metas.put(name, content);
}

Example ex = new Example();
ex.setTitle(metas.get("d.title"));
ex.setDescription(metas.get("d.description"));
ex.setLanguage(metas.get("d.language"));

you assign the value of both attributes to the same variable named content. 您将两个属性的值分配给名为content的同一变量。 Assign the name attribute to a name variable and compare that to your desired value of 'd.title'. 将name属性分配给name变量,并将其与您想要的'd.title'值进行比较。

Using the right specific library simplifies everything 使用正确的特定库简化了一切

Check out my library for parsing meta tag content 查看我的库以解析元标记内容

poshjosh/bcmetaselector poshjosh / bcmetaselector

package com.bc.meta.selector;

import com.bc.meta.selector.htmlparser.AttributeContextHtmlparser;
import com.bc.meta.selector.util.SampleConfigPaths;
import com.bc.meta.ArticleMetaNames;
import com.bc.meta.impl.ArticleMetaNameIsMultiValue;
import java.util.Map;
import java.util.Iterator;
import java.util.function.BiFunction;
import org.json.simple.JSONValue;
import org.htmlparser.Parser;
import org.htmlparser.Node;
import org.htmlparser.Tag;

public class ReadMe {

    public static void main(String... args) throws Exception {

        final BiFunction<String, Node, String> nodeContentExtractor =
                (prop, node) -> node instanceof Tag ? ((Tag)node).getAttributeValue("content") : null;

        final SelectorBuilder<Node, String, Object> builder = Selector.builder();

        final Selector<Node> selector = builder.filter()
                .attributeContext(new AttributeContextHtmlparser(false))
                .configFilePaths(SampleConfigPaths.APP_ARTICLE_LIST)
                .jsonParser((reader) -> (Map)JSONValue.parse(reader))
                .propertyNames(ArticleMetaNames.values())
                .back()
                .multipleValueTest(new ArticleMetaNameIsMultiValue())
                .nodeConverter(nodeContentExtractor)
                .build();

        final Parser parser = new Parser();

        final String url = "https://edition.cnn.com/2018/06/21/africa/noura-hussein-asequals-intl/index.html";

        parser.setURL(url);

        Iterator<Node> nodes = parser.elements().iterator();

        final Map map = selector.selectAsMap(nodes, ArticleMetaNames.values());

        System.out.println("Printing meta tags data for: " + url + "\n" + map);
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM