简体   繁体   English

使用jSoup将文本存储到字符串中

[英]Storing text into a String using jSoup

I'm trying to understand how to use htmlUnit and jSoup together and have been successful in understanding the basics. 我试图了解如何一起使用htmlUnit和jSoup,并且已经成功地理解了基础知识。 However, I'm trying to store text from a specific webpage into a string but when I try to do this, it only returns a single line rather than the whole text. 但是,我试图将来自特定网页的文本存储到字符串中,但是当我尝试这样做时,它仅返回一行而不是整个文本。

I know the code I've written works as I when I print out p.text, it returns the whole text stored within the website. 当我打印出p.text时,我知道我编写的代码可以正常工作,它返回存储在网站中的整个文本。

private static String getText() {
    try {
        final WebClient webClient = new WebClient();
        final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
        List<HtmlAnchor> anchors = page.getAnchors();
        HtmlPage page1 = anchors.get(18).click();
        String url = page1.getUrl().toString();
        Document doc = Jsoup.connect(url).get();
        Elements paragraphs = doc.select("div[class=govspeak] p");
        for (Element p : paragraphs)
            System.out.println(p.text());
    } catch (Exception e) {
        e.printStackTrace();
        Logger.getLogger(HTMLParser.class.getName()).log(Level.SEVERE, null, e);
    }
    return null;
}

} }

When I introduce the notion of a string to store the text from p.text, it only returns a single line rather than the whole text. 当我引入字符串的概念来存储p.text中的文本时,它仅返回一行而不是整个文本。

private static String getText() {
    String text = "";
    try {
        final WebClient webClient = new WebClient();
        final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
        List<HtmlAnchor> anchors = page.getAnchors();
        HtmlPage page1 = anchors.get(18).click();
        String url = page1.getUrl().toString();
        Document doc = Jsoup.connect(url).get();
        Elements paragraphs = doc.select("div[class=govspeak] p");
        for (Element p : paragraphs)
            text=p.text();
    } catch (Exception e) {
        e.printStackTrace();
        Logger.getLogger(HTMLParser.class.getName()).log(Level.SEVERE, null, e);
    }
    return text;
}

Ultimately, all I want to do is store the whole text into a string. 最终,我要做的就是将整个文本存储到一个字符串中。 Any help would be greatly appreciated, thanks in advance. 任何帮助将不胜感激,在此先感谢。

Document doc = Jsoup.connect(url).get();
String text = doc.text();

That's basically it. 基本上就是这样。 Due to the fact that JSoup is already taking care of cleaning all the html tags from the text, you can use the doc.text() and you'll receive the content of the whole page cleaned from html tags. 由于JSoup已经负责清理文本中的所有html标签,因此您可以使用doc.text() ,您将收到从html标签清除的整个页面的内容。

    for (Element p : paragraphs)
        text+=p.text(); // Append the text.

In your code, you are overwriting the values of variable text. 在您的代码中,您将覆盖变量文本的值。 That's why only last line is returned by the function. 这就是该函数仅返回最后一行的原因。

I think it is a strange idea to use the HtmlUnit result as starting point for jSoup. 我认为将HtmlUnit结果用作jSoup的起点是一个奇怪的想法。 There a various drawbacks of your approach (eg think about cookies). 您的方法有很多弊端(例如考虑cookie)。 And of course HtmlUnit had parsed the html code already; 当然,HtmlUnit已经解析了html代码; you will do the work twice. 您将做两次工作。

I hope this code will fulfill your requirements without jSoup. 我希望这段代码无需jSoup就能满足您的要求。

private static String getText() throws FailingHttpStatusCodeException, MalformedURLException, IOException {
    StringBuilder text = new StringBuilder();
    try (WebClient webClient = new WebClient()) {
        final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
        List<HtmlAnchor> anchors = page.getAnchors();
        HtmlPage page1 = anchors.get(18).click();
        DomNodeList<DomNode> paragraphs = page1.querySelectorAll("div[class=govspeak] p");
        for (DomNode p : paragraphs) {
            text.append(p.asText());
        }
    }
    return text.toString();
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM