[英]Jsoup - Convert html texts into a list of Strings
Using Jsoup I want to be able add text existing in each html tag to a List<String>
in order. 使用Jsoup,我希望能够将每个html标记中存在的文本依次添加到
List<String>
中。
This is fairly easy using BeautifulSoup4 in python but I'm having a hard time in Java. 这在python中使用BeautifulSoup4相当容易,但是我在Java中却很难。
BeautifulSoup Code: BeautifulSoup代码:
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(body, 'html.parser')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
text_list =[]
for t in visible_texts:
text_list.append(t.strip())
return list(filter(None, text_list))
html = urllib.request.urlopen('https://someURL.com/something').read()
print(text_from_html(html))
This code will print ["text1", "text2", "text3",...]
此代码将打印
["text1", "text2", "text3",...]
My initial attempt was to follow the Jsoup documentation for text conversion. 我最初的尝试是遵循Jsoup文档进行文本转换。
Jsoup Code Attempt-1: Jsoup代码尝试1:
Document doc = Jsoup.connect('https://someURL.com/something')
.userAgent("Bot")
.get();
Elements divElements = doc.select("*")
List<String> texts = divElements.eachText();
System.out.println(texts);
What ends up happening is a duplication of texts ["text1 text2 text3","text2 text3", "text3",...]
最终发生的是文本的重复
["text1 text2 text3","text2 text3", "text3",...]
My assumption is that Jsoup goes through each Element and prints out every text within that Element including the text existing in each child node. 我的假设是Jsoup遍历每个Element并打印出Element内的每个文本,包括每个子节点中存在的文本。 Then it goes to the child node and prints out the remaining text, so on and so forth.
然后,它转到子节点并打印出剩余的文本,依此类推。
I have seen many people specify Tag/Attributes via cssQuery to bypass this problem but my project requires to do this for any scrape-able website. 我已经看到很多人通过cssQuery指定标签/属性来绕过此问题,但是我的项目要求对任何可抓取的网站执行此操作。
Any suggestion is appreciated. 任何建议表示赞赏。
Your assumption is right - but BeautifulSoup would probably do the same. 您的假设是正确的-但BeautifulSoup可能也会这样做。 Only the
text=True
in findAll(text=True)
limits the result to pure text-nodes. findAll(text=True)
只有text=True
将结果限制为纯文本节点。 To have the equivalent in JSoup use the following selector: 要在JSoup中具有等效功能,请使用以下选择器:
Elements divElements = doc.select(":matchText");
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.