简体   繁体   English

解析与JTidy的链接

[英]Parsing links with JTidy

I am currently using JTidy to parse an HTML document and fetch a collection of all anchor tags in the given HTML document. 我目前正在使用JTidy来解析HTML文档并获取给定HTML文档中所有锚标记的集合。 I then extract the value of each tag's href attribute to come up with a collection of links on the page. 然后,我提取每个标记的href属性的值,以在页面上提供一组链接。

Unfortunately, these links can be expressed in a few different ways: some absolute ( http://www.example.com/page.html ), some relative ( /page.html , page.html , or ../page.html ). 不幸的是,这些链接可以在几个不同的方式表达:一些绝对( http://www.example.com/page.html ),一些相对( /page.htmlpage.html ,或../page.html )。 Even more, some can just be anchors ( #paragraphA ). 更有甚者,有些人可能只是锚( #paragraphA )。 When I visit my page in a browser, it knows automatically how to handle these different href values if I were to click the link, however if I were to follow one of these links retrieved from JTidy using an HTTPClient programatically, I first need to provide a valid URL (so eg I would first need to transform /page.html, page.html, and http://www.example.com/page.html to http://www.example.com/page.html ). 当我在浏览器中访问我的页面时,如果我单击链接,它会自动知道如何处理这些不同的href值,但是如果我使用HTTPClient以编程方式跟踪从JTidy检索到的其中一个链接,我首先需要提供有效的网址(例如,我首先需要将/page.html,page.html和http://www.example.com/page.html转换http://www.example.com/page.html ) 。

Is there some built-in functionality, whether in JTidy or elsewhere, that can achieve this for me? 是否有一些内置功能,无论是在JTidy还是其他地方,都可以为我实现这一目标? Or will I need to create my own rules to transform these different URLs into an absolute URL? 或者我是否需要创建自己的规则来将这些不同的URL转换为绝对URL?

The vanilla URL class might get you most of the way there, assuming you can work out which context to use. 假设您可以计算出要使用的上下文,那么vanilla URL类可能会让您大部分都在那里。 Here are some examples: 这里有些例子:

package grimbo.url;

import java.net.MalformedURLException;
import java.net.URL;

public class TestURL {
    public static void main(String[] args) {
        // context1
        URL c1 = u(null, "http://www.example.com/page.html");
        u(c1, "http://www.example.com/page.html");
        u(c1, "/page.html");
        u(c1, "page.html");
        u(c1, "../page.html");
        u(c1, "#paragraphA");

        System.out.println();

        // context2
        URL c2 = u(null, "http://www.example.com/path/to/page.html");
        u(c2, "http://www.example.com/page.html");
        u(c2, "/page.html");
        u(c2, "page.html");
        u(c2, "../page.html");
        u(c2, "#paragraphA");
    }

    public static URL u(URL context, String url) {
        try {
            URL u = null != context ? new URL(context, url) : new URL(url);
            System.out.println(u);
            return u;
        } catch (MalformedURLException e) {
            e.printStackTrace();
            return null;
        }
    }
}

Results in: 结果是:

http://www.example.com/page.html
http://www.example.com/page.html
http://www.example.com/page.html
http://www.example.com/page.html
http://www.example.com/../page.html
http://www.example.com/page.html#paragraphA

http://www.example.com/path/to/page.html
http://www.example.com/page.html
http://www.example.com/page.html
http://www.example.com/path/to/page.html
http://www.example.com/path/page.html
http://www.example.com/path/to/page.html#paragraphA

As you can see, there are some results that aren't what you want. 如您所见,有些结果并非您想要的结果。 So maybe you try and parse the URL using new URL(value) first, and if that results in a MalformedURLException you could try relative to a context URL. 因此,您可能首先尝试使用new URL(value)解析URL,如果这导致MalformedURLException您可以尝试相对于上下文URL。

Your best best is most likely to follow the same resolution process that browsers do, as outlined in the HTML spec : 您最好的最有可能遵循与浏览器相同的解决方案流程,如HTML规范中所述

User agents must calculate the base URI according to the following precedences (highest priority to lowest): 用户代理必须根据以下优先级(最高优先级到最低优先级)计算基URI:

  1. The base URI is set by the BASE element. 基URI由BASE元素设置。
  2. The base URI is given by meta data discovered during a protocol interaction, such as an HTTP header (see [RFC2616]). 基URI由协议交互期间发现的元数据给出,例如HTTP头(参见[RFC2616])。
  3. By default, the base URI is that of the current document. 默认情况下,基URI是当前文档的URI。 Not all HTML documents have a base URI (eg, a valid HTML document may appear in an email and may not be designated by a URI). 并非所有HTML文档都具有基本URI(例如,有效的HTML文档可能出现在电子邮件中,并且可能不是由URI指定的)。 Such HTML documents are considered erroneous if they contain relative URIs and rely on a default base URI. 如果这些HTML文档包含相对URI并依赖于默认基URI,则认为这些HTML文档是错误的。

In practice, you're probably most concerned with numbers 1 and 2 (ie check for a <base href="..." and use either that (if it exists) or the URI of the current document). 在实践中,您可能最关心数字1和2(即检查<base href="..."并使用它(如果存在)或当前文档的URI)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM