简体   繁体   English

Jsoup删除H2标签之前的所有内容

[英]Jsoup Remove Everything before a H2 tag

I have my HTML source that I get from a website using Jsoup.connect() method. 我有使用Jsoup.connect()方法从网站获取的HTML源代码。 Following is an piece of code from that HTML source (link: https://docs.microsoft.com/en-us/visualstudio/install/workload-component-id-vs-community ) 以下是该HTML源代码中的一段代码(链接: https : //docs.microsoft.com/zh-cn/visualstudio/install/workload-component-id-vs-community

.....
<p>When you set dependencies in your VSIX manifest, you must specify Component IDs 
   only. Use the tables on this page to determine our minimum component dependencies. 
   In some scenarios, this might mean that you specify only one component from a workload. 
   In other scenarios, it might mean that you specify multiple components from a single 
   workload or multiple components from multiple workloads. For more information, see 
   the 
<a href="../extensibility/how-to-migrate-extensibility-projects-to-visual-studio-2017" data-linktype="relative-path">How to: Migrate Extensibility Projects to Visual Studio 2017</a> page.</p>
.....
<h2 id="visual-studio-core-editor-included-with-visual-studio-community-2017">Visual Studio core editor (included with Visual Studio Community 2017)</h2>
.....
<h2 id="see-also">See also</h2>
.....

What I want to do using jsoup is that, I would like to remove every single Html piece before <h2 id="visual-studio-core-editor-included-with-visual-studio-community-2017">Visual Studio core editor (included with Visual Studio Community 2017)</h2> 我想使用jsoup做的是,我想删除<h2 id="visual-studio-core-editor-included-with-visual-studio-community-2017">Visual Studio core editor (included with Visual Studio Community 2017)</h2>之前的所有HTML片段<h2 id="visual-studio-core-editor-included-with-visual-studio-community-2017">Visual Studio core editor (included with Visual Studio Community 2017)</h2>

,and everything after (including) <h2 id="see-also">See also</h2> ,以及<h2 id="see-also">See also</h2>

I have a solution like this, but this pretty much didnt work for me: 我有这样的解决方案,但这对我几乎没有用:

        try {
            document = Jsoup.connect(Constants.URL).get();
        }
        catch (IOException iex) {
            iex.printStackTrace();
        }
        document = Parser.parse(document.toString().replaceAll(".*?<a href=\"workload-and-component-ids\" data-linktype=\"relative-path\">Visual Studio 2017 Workload and Component IDs</a> page.</p>", "") , Constants.URL);
        document = Parser.parse(document.toString().replaceAll("<h2 id=\"see-also\">See also</h2>?.*", "") , Constants.URL);
        return null;

Any help would be appreciated. 任何帮助,将不胜感激。

Simple way could be: get the whole html of the page as a string, make a substring of the part you need and parse that substring once again with jsoup. 简单的方法可能是:以字符串的形式获取页面的整个html,为所需部分制作一个子字符串,然后使用jsoup再次解析该子字符串。

        Document doc = Jsoup.connect("https://docs.microsoft.com/en-us/visualstudio/install/workload-component-id-vs-community").get();
        String html = doc.html().substring(doc.html().indexOf("visual-studio-core-editor-included-with-visual-studio-community-2017")-8, 
                                           doc.html().indexOf("unaffiliated-components")-8);
        Document doc2 = Jsoup.parse(html);
        System.out.println(doc2);

I'll just make a small change to @eritrean s answer above. 我将对上述@eritrean的答案进行一些小的更改。 There is small modification to be made for me to get the required output. 为了获得所需的输出,我需要进行一些小的修改。

document = Jsoup.parse(document.html().substring(document.html().indexOf("visual-studio-core-editor-included-with-visual-studio-community-2017")-26,
                document.html().indexOf("see-also")-8));
System.out.println(document);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM