简体   繁体   English

去掉 <div> 来自JSoup中html的标记

[英]remove <div> tag from html in JSoup

I tried to scrape some content from a web site. 我试图从网站上抓取一些内容。 I used JSoup . 我用过JSoup I tried what was, 我试过了

List<String> songs = new ArrayList<String>();
for (Element s : doc.select("#core")) {
    System.out.println(s.html());
    songs.add(s.text());
}

for (String chord : songs) {
    System.out.println(chord);
}

#core is a <pre> tag. #core<pre>标签。 In this <pre> tag, I have a div like following, 在这个<pre>标记中,我有一个如下所示的div,

Intro: <u>G</u> - <u>Em</u> - <u>C</u> - <u>D</u>
<u>G</u>
Would you dance,
<u>Em</u>
If I asked you to dance?
<u>C</u>
Would you run,
<u>D</u>
And never look back?
<u>G</u>
Would you cry,
<u>Em</u>
If you saw me crying?
<u>C</u>        <u>D</u>     <u>G</u>
Would you save my soul tonight?

<div id="part1">

    <div class="inner">
        <u>G</u>
        <u>D</u>
        <u>C</u> I can be your hero baby
        <u>G</u>
        <u>D</u>
        <u>C</u> I can kiss away the pain
        <u>G</u>
        <u>D</u>
        <u>C</u> I will stand by you forever
        <u>G</u>
        <u>D</u>
        <u>C</u> You can take my breath away
    </div>
 </div>

When I'm scrapping this, Jsoup isn't maintain the correct format in div . 当我取消此操作时, Jsoup不能在div维护正确的格式。 Is there a way to get the <pre> tag content as it is? 有没有办法直接获取<pre>标签的内容?

If you want to just scrape the content without parsing it, then you can do something like this 如果您只想抓取内容而不进行解析,则可以执行以下操作

Connection.Response response = Jsoup.connect("URL_HERE").execute();
System.out.println(response.body()); //This will keep the format as it is from the server.

If you want to parse the content after that, then do this 如果要在此之后解析内容,请执行此操作

response.parse();

If you want to remove some element then you have to parse the content. 如果要删除某些元素,则必须解析内容。 But if you parse it, then any format that was there will be lost. 但是,如果您解析它,那么那里的任何格式都会丢失。

A workaround would be to escape the element that you want to keep the whitespaces. 一种解决方法是转义要保留空白的元素。 Check this from the author of Jsoup https://stackoverflow.com/a/5830454/1138559 Although you have to escape the contents of <pre> since it contains html elements too. 从Jsoup的作者那里检查一下https://stackoverflow.com/a/5830454/1138559尽管您必须转义<pre>的内容,因为它也包含html元素。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM