简体   繁体   English

Jsoup-将html文本转换为字符串列表

[英]Jsoup - Convert html texts into a list of Strings

Using Jsoup I want to be able add text existing in each html tag to a List<String> in order. 使用Jsoup,我希望能够将每个html标记中存在的文本依次添加到List<String>中。

This is fairly easy using BeautifulSoup4 in python but I'm having a hard time in Java. 这在python中使用BeautifulSoup4相当容易,但是我在Java中却很难。

BeautifulSoup Code: BeautifulSoup代码:

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request


def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
    return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)

    text_list =[]

    for t in visible_texts:
        text_list.append(t.strip())

    return list(filter(None, text_list))

html = urllib.request.urlopen('https://someURL.com/something').read()
print(text_from_html(html))

This code will print ["text1", "text2", "text3",...] 此代码将打印["text1", "text2", "text3",...]


My initial attempt was to follow the Jsoup documentation for text conversion. 我最初的尝试是遵循Jsoup文档进行文本转换。

Jsoup Code Attempt-1: Jsoup代码尝试1:

Document doc = Jsoup.connect('https://someURL.com/something')
                        .userAgent("Bot")
                        .get();
Elements divElements = doc.select("*")
List<String> texts = divElements.eachText();
System.out.println(texts);

What ends up happening is a duplication of texts ["text1 text2 text3","text2 text3", "text3",...] 最终发生的是文本的重复["text1 text2 text3","text2 text3", "text3",...]

My assumption is that Jsoup goes through each Element and prints out every text within that Element including the text existing in each child node. 我的假设是Jsoup遍历每个Element并打印出Element内的每个文本,包括每个子节点中存在的文本。 Then it goes to the child node and prints out the remaining text, so on and so forth. 然后,它转到子节点并打印出剩余的文本,依此类推。

I have seen many people specify Tag/Attributes via cssQuery to bypass this problem but my project requires to do this for any scrape-able website. 我已经看到很多人通过cssQuery指定标签/属性来绕过此问题,但是我的项目要求对任何可抓取的网站执行此操作。

Any suggestion is appreciated. 任何建议表示赞赏。

Your assumption is right - but BeautifulSoup would probably do the same. 您的假设是正确的-但BeautifulSoup可能也会这样做。 Only the text=True in findAll(text=True) limits the result to pure text-nodes. findAll(text=True)只有text=True将结果限制为纯文本节点。 To have the equivalent in JSoup use the following selector: 要在JSoup中具有等效功能,请使用以下选择器:

Elements divElements = doc.select(":matchText");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM