[英]remove <div> tag from html in JSoup
I tried to scrape some content from a web site. 我试图从网站上抓取一些内容。 I used
JSoup
. 我用过
JSoup
。 I tried what was, 我试过了
List<String> songs = new ArrayList<String>();
for (Element s : doc.select("#core")) {
System.out.println(s.html());
songs.add(s.text());
}
for (String chord : songs) {
System.out.println(chord);
}
#core
is a <pre>
tag. #core
是<pre>
标签。 In this <pre>
tag, I have a div like following, 在这个
<pre>
标记中,我有一个如下所示的div,
Intro: <u>G</u> - <u>Em</u> - <u>C</u> - <u>D</u>
<u>G</u>
Would you dance,
<u>Em</u>
If I asked you to dance?
<u>C</u>
Would you run,
<u>D</u>
And never look back?
<u>G</u>
Would you cry,
<u>Em</u>
If you saw me crying?
<u>C</u> <u>D</u> <u>G</u>
Would you save my soul tonight?
<div id="part1">
<div class="inner">
<u>G</u>
<u>D</u>
<u>C</u> I can be your hero baby
<u>G</u>
<u>D</u>
<u>C</u> I can kiss away the pain
<u>G</u>
<u>D</u>
<u>C</u> I will stand by you forever
<u>G</u>
<u>D</u>
<u>C</u> You can take my breath away
</div>
</div>
When I'm scrapping this, Jsoup
isn't maintain the correct format in div
. 当我取消此操作时,
Jsoup
不能在div
维护正确的格式。 Is there a way to get the <pre>
tag content as it is? 有没有办法直接获取
<pre>
标签的内容?
If you want to just scrape the content without parsing it, then you can do something like this 如果您只想抓取内容而不进行解析,则可以执行以下操作
Connection.Response response = Jsoup.connect("URL_HERE").execute();
System.out.println(response.body()); //This will keep the format as it is from the server.
If you want to parse the content after that, then do this 如果要在此之后解析内容,请执行此操作
response.parse();
If you want to remove some element then you have to parse the content. 如果要删除某些元素,则必须解析内容。 But if you parse it, then any format that was there will be lost.
但是,如果您解析它,那么那里的任何格式都会丢失。
A workaround would be to escape the element that you want to keep the whitespaces. 一种解决方法是转义要保留空白的元素。 Check this from the author of Jsoup https://stackoverflow.com/a/5830454/1138559 Although you have to escape the contents of
<pre>
since it contains html elements too. 从Jsoup的作者那里检查一下https://stackoverflow.com/a/5830454/1138559尽管您必须转义
<pre>
的内容,因为它也包含html元素。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.