简体   繁体   English

有没有办法在 JSoup 中解析整个 HTML 标记?

[英]Is there a way to parse an entire HTML tag in JSoup?

Hi I'm wondering if there's a way to parse an entire HTML tag using JSoup?嗨,我想知道是否有办法使用 JSoup 解析整个 HTML 标记? In my example pictures below, the five elements (4 images and 1 string) are all inside the "li" container.在下面的示例图片中,五个元素(4 个图像和 1 个字符串)都在“li”容器内。 However, when you open the "li" tag, there are multiple nested containers.但是,当您打开“li”标签时,会有多个嵌套容器。 Is there a way to parse it so that I have access to all 5 elements contained in this "li" tag?有没有办法解析它,以便我可以访问这个“li”标签中包含的所有 5 个元素? I'm thinking of using getElementsMatchingOwnText("Collins") but that seems to only get me "span class="text text_14 mix-text_color7">Panorama".我正在考虑使用 getElementsMatchingOwnText("Collins") 但这似乎只会让我得到“span class="text text_14 mix-text_color7">Panorama”。 Any help would be appreciated, thanks!任何帮助将不胜感激,谢谢! 在此处输入图像描述 在此处输入图像描述

在此处输入图像描述

Yes, you can iterate over the children of your <li> tag using jsoup.是的,您可以使用 jsoup 遍历<li>标记的子项。

Here is a simplified version of the HTML in your screenshot, showing the 5 elements:这是屏幕截图中 HTML 的简化版本,显示了 5 个元素:

<li>
    <span class="foo"><img src="bar" class="img"></span>
    <span class="bar">Collins</span>
    <i class="baz1"><img src="baz1" class="img"></i>
    <i class="baz2"><img src="baz2" class="img"></i>
    <i class="baz3"><img src="baz3" class="img"></i>
</li>

Assuming you have selected this specific <li> tag in your document, you can use the following approach:假设您在文档中选择了这个特定的<li>标签,您可以使用以下方法:

String html = "<li><span class=\"foo\"><img src=\"bar\" class=\"img\"></span><span class=\"bar\">Collins</span><i class=\"baz1\"><img src=\"baz1\" class=\"img\"></i><i class=\"baz2\"><img src=\"baz2\" class=\"img\"></i><i class=\"baz3\"><img src=\"baz3\" class=\"img\"></i></li>";

Document document = Jsoup.parse(html);

Element element = document.selectFirst("li");
element.children().forEach(child -> {
    // do your processing here - this is just an example:
    if (child.hasText()) {
        System.out.println(child.text());
    } else {
        System.out.println(child.html());
    }
});

The above code prints the following output:上面的代码打印出以下 output:

<img src="bar" class="img">
Collins
<img src="baz1" class="img">
<img src="baz2" class="img">
<img src="baz3" class="img">

UPDATE更新

If the starting point is a URL, then you would need to start with this:如果起点是 URL,那么您需要从以下开始:

Document document = Jsoup.connect("https://www...").get();

Then the exercise is about identifying a unique way to find your specific element.然后练习是关于确定一种独特的方式来找到你的特定元素。 So, if we update my earlier example, let's assume your web page is like this:因此,如果我们更新我之前的示例,假设您的 web 页面如下所示:

<html>
    <head>...</head>
    <body>
        <div>
            <ul class="vList_4">
 
                <li>
                    <span class="foo"><img src="bar" class="img"></span>
                    <span class="bar">Collins</span>
                    <i class="baz1"><img src="baz1" class="img"></i>
                    <i class="baz2"><img src="baz2" class="img"></i>
                    <i class="baz3"><img src="baz3" class="img"></i>
                </li>
            </ul>
        </div>
    </body
</html>

Here we have a class in a <ul> tag called vList_4 .在这里,我们在名为 vList_4 的<ul>标记中有一个vList_4 If that is a unique class name, we can use it to jump to that section of the HTML page (IDs are better than class names because they are guaranteed to be unique - but I did not see any ID names in your screenshot).如果这是一个唯一的 class 名称,我们可以使用它跳转到 HTML 页面的该部分(ID 优于 class 名称,因为它们保证在您的屏幕截图中是唯一的 ID - 但我没有看到任何名称)。

Now, instead of my previous selector:现在,而不是我以前的选择器:

Element element = document.selectFirst("li");

We can use this more specific selector:我们可以使用这个更具体的选择器:

Element element = document.selectFirst("ul.vList_4 li");

The same results will be printed as before.将像以前一样打印相同的结果。

So, it's all about you looking at the page structure and figuring out how to jump to the relevant section of the page.因此,这一切都是关于您查看页面结构并弄清楚如何跳转到页面的相关部分。

See here for technical details describing how selectors are constructed.有关描述选择器如何构建的技术细节,请参见此处。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM