从 HtmlElement 中提取文本时如何保留换行符

Question

我正在尝试从以下位置提取三个单独的字符串： https://taxtest.navajocountyaz.gov/Pages/WebForm1.aspx?p=1&apn=103-03-122

业主姓名：Johnson Tommy A & Nell H Cprs
业主街道地址：133 Maricopa Dr
所有者城市，state 和 zip 代码，作为一个字符串：Winslow AZ 86047-2013

我尝试了以下代码：

import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;
import com.gargoylesoftware.htmlunit.javascript.*;
import java.io.*;

public class PropertyOwner {

    public static void PropertyOwner () {

        try (final WebClient webClient = new WebClient()) {
            System.getProperties().put("org.apache.commons.logging.simplelog.defaultlog", "fatal");
            java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(java.util.logging.Level.OFF);

            webClient.getOptions().setThrowExceptionOnScriptError(false);
            webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);

            webClient.getOptions().setCssEnabled(false);
            webClient.setJavaScriptErrorListener(new SilentJavaScriptErrorListener());
            webClient.setCssErrorHandler(new SilentCssErrorHandler());
            HtmlPage page = webClient.getPage("http://taxtest.navajocountyaz.gov/Pages/WebForm1.aspx?p=1&apn=103-03-122");
            webClient.waitForBackgroundJavaScriptStartingBefore(10000);     
            page = (HtmlPage) page.getEnclosingWindow().getEnclosedPage();
            webClient.getOptions().setThrowExceptionOnScriptError(false);
            webClient.setJavaScriptErrorListener(new SilentJavaScriptErrorListener());
            HtmlTable pnlGridView_nextYear = (HtmlTable) page.getElementById("pnlGridView_nextYear");
            HtmlTableDataCell ownershipCell = (HtmlTableDataCell) pnlGridView_nextYear.getCellAt(0,0);
            String ownershipCellAsText = ownershipCell.toString();
            HtmlElement onwershipElement = (HtmlElement) page.getElementById("lblOwnership_NextYear");
            System.out.println("ownershipCellAsText = " + ownershipCellAsText);
            System.out.println("onwershipElement.getTextContent() = " + onwershipElement.getTextContent());


        }

        catch (Exception e) {
            System.out.println("Error: "+ e);
        }
  
    }
  
    public static void main(String[] args) {
        File file = new File("validParcelIDs.txt");
        PropertyOwner();
    }

}

然后我使用了以下两个命令：

> javac -classpath ".:/opt/htmlunit_2.69.0/*"  PropertyOwner.java
> java -classpath ".:/opt/htmlunit_2.69.0/*"  PropertyOwner

并得到如下output：

ownershipCellAsText = HtmlTableDataCell[<td style="border:solid 1px black;">]
onwershipElement.getTextContent() = Johnson Tommy A & Nell H Cprs133 Maricopa DrWinslow AZ 86047-2013

如您所见，onwershipElement.getTextContent() 非常接近我想要的。 除了它从 HtmlElement 中删除了换行符。

我尝试了 8 年前提出的以下解决方案： Java 通过向我的程序添加仅三行代码，从元素中获取文本内容以包含换行符。 以下三行（非连续的）：

import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
.....
WebView webView = new WebView();

这给了我以下编译错误：

achab@HP-Envy [Navajo] $javac -classpath ".:/opt/htmlunit_2.69.0/*"  PropertyOwner.java 
PropertyOwner.java:15: error: cannot find symbol
            WebView webView = new WebView(); 
            ^
  symbol:   class WebView
  location: class PropertyOwner
PropertyOwner.java:15: error: cannot find symbol
            WebView webView = new WebView(); 
                                  ^
  symbol:   class WebView
  location: class PropertyOwner
2 errors

所以，这个解决方案似乎已经过时了。 HtmlUnit 的 2.69.0 版本于 2023 年 1 月 5 日发布。

在那之前。 我曾尝试过大约两年前发布的 HtmlUnit 2.47.1 版本。 同样存在上述两个问题：在第一版代码中无法保留换行符，在第二版代码中找不到符号 WebView。

我需要更改什么才能获得我想要的三个单独的字符串？

Answer 1

而不是onwershipElement.getTextContent()使用onwershipElement.asNormalizedText() 。

从 HtmlElement 中提取文本时如何保留换行符

问题描述

1 个解决方案

解决方案1
1 2023-01-11 23:41:58

从 HtmlElement 中提取文本时如何保留换行符

问题描述

1 个解决方案

解决方案1 1 2023-01-11 23:41:58

解决方案1
1 2023-01-11 23:41:58