简体   繁体   English

使用Jsoup从HTML提取文本,但返回左右引号字符char

[英]Using Jsoup to extract text from HTML, but returned left and right quotation mark chars

So I used Jsoup to extract some text from an html snippit, shown here: 因此,我使用Jsoup从html片段中提取了一些文本,如下所示:

<font style="font-family:Times New Roman" size="2">BHI Finance LLC (“BHI Finance”)</font>

Notice the left and right quotation marks. 请注意左右引号。 I store this text in a String in Java here: 我将此文本存储在Java中的字符串中:

Element firstEntry = row.select("td").first();
String toAdd = firstEntry.select("font").text();

String toAdd gets printed as BHI Finance LLC (?BHI Finance?) String toAdd作为BHI Finance LLC(?BHI Finance?)打印

The two question mark chars when casted to an int are 147 and 148 respectively which I found to be left and right quotation marks on (certain?) html char encodings. 将两个问号字符转换为整数时,分别为147和148,我发现它们是(某些?)html char编码上的左右引号。 My question is how do I make it so Jsoup parses left and right quotation marks as just regualar ascii quotation marks? 我的问题是如何使Jsoup将左右的引号解析为正则的ascii引号?

Doesn't look like a parsing problem. 看起来不像是解析问题。 More likely, whatever you're using to display the parsed content doesn't correctly display those characters. 更有可能的是,无论您用来显示已解析内容的内容,都无法正确显示这些字符。

The easiest is probably to just replace them: 最简单的方法可能是替换它们:

toAdd = toAdd.replaceAll("[“”]", "\"");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM