简体   繁体   English

尝试使用 Apache Tika 和 XPath 获取属性值

[英]Trying to get attribute value with Apache Tika and XPath

I have tried many different XPath values and just don't understand why I can't retrieve what I want with Apache Tika.我尝试了许多不同的 XPath 值,只是不明白为什么我无法用 Apache Tika 检索我想要的东西。 I want to retrieve the href attribute value of links on random webpages.我想检索随机网页上链接的 href 属性值。 I managed to find out how to extract the content inside the tags but trying to get the attribute values always returns empty.我设法找出如何提取标签内的内容,但试图获取属性值总是返回空。 What am I doing wrong?, Here is my code below, Thanks a lot我在做什么错?,下面是我的代码,非常感谢

XPathParser  xhtmlParser = new XPathParser ("xhtml", XHTMLContentHandler.XHTML);
Matcher anchorLinkContentMatcher = xhtmlParser.parse("//xhtml:a/@xhtml:href/text()");
ContentHandler handler = new MatchingContentHandler(
    new ToHTMLContentHandler(), anchorLinkContentMatcher);
HtmlParser parser = new HtmlParser();
ParseContext pcontext = new ParseContext();
    
try {
    parser.parse(urlContentStream, handler, new Metadata(),pcontext);
    System.out.println(handler);
}
catch (Exception e)
{....}

I have tried these different XPaths:我尝试过这些不同的 XPath:

//xhtml:a/@xhtml:href
//xhtml:a/@href/text()
//xhtml:a/@href
//@xhtml:href/text()

You were almost there... you will need:你快到了……你需要:

//xhtml:a/@href

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM