简体   繁体   English

从HTML字符串Java提取文本

[英]Extract Text from HTML String Java

Ive got a string of HTML code filled with tags and special characters, for example: 我有一个用标签和特殊字符填充的HTML代码字符串,例如:

 <p class="MsoNormal"><span style="font-size: 14pt; font-family: TimesNewRoman;"> I Just want this Text here?<o:p></o:p></span></p>

or 要么

<div>This is more text i would like. :( </div><div> </div>

Im just wondering if there is any way to extract the text from the html strings. 我只是想知道是否有任何方法可以从html字符串中提取文本。 I have tried to use some regex to replace strings but it didnt seem like the bay way to do it. 我试图使用一些正则表达式来替换字符串,但似乎并没有做到这一点。 Have also tried JSoup but didnt have much luck with that. 也尝试过JSoup,但是运气不佳。

Any ideas? 有任何想法吗? Regards. 问候。

Are you sure you were using JSoup correctly? 您确定使用正确的JSoup吗? That would be perfect for this, and I use it all the time to do the same. 这将是完美的选择,我一直使用它来做同样的事情。

Your code would look like this: 您的代码如下所示:

String stringWithHtml="<div>&nbsp;test&nbsp;</div>";
String extractedText = Jsoup.parse(stringWithHtml).text();
//extractedText is now "test"

Make sure the JSoup library is in your classpath. 确保JSoup库在您的类路径中。

This is actually a possible duplicate. 这实际上是可能的重复。 Your solution looks something like this. 您的解决方案看起来像这样。

    String inputString = "&lt;div&gt;This is more text i would like. :( &lt;/div&gt;&lt;div&gt;&amp;nbsp;&lt;/div&gt;";
    inputString = inputString.replace("&lt;", "<");
    inputString = inputString.replace("&gt;", ">");
    inputString = inputString.replaceAll("<[^>]*>", "");
    System.out.println(inputString);

This would extract all items that are not in html tags. 这将提取不在html标记中的所有项目。 I wasn't sure if you wanted the first element or all elements. 我不确定是要第一个元素还是所有元素。 Here it's assuming all html tags would be removed leaving all text in its place including the ampersand. 此处假设所有html标记都将被删除,所有文本都保留在其位置,包括“&”号。 The escaped ampersand could be handled with a replace or strategies. 可以使用替换或策略来处理逃逸的“&”号。

You have another is aspose. 你还有另一个。 have a look at the link 看一下链接

http://www.aspose.com/java/word-component.aspx http://www.aspose.com/java/word-component.aspx

Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc);
builder.insertHtml(
        "<P align='right'>Paragraph right</P>" +
                "<b>Implicit paragraph left</b>" +
                "<div align='center'>Div center</div>" +
                "<h1 align='left'>Heading 1 left.</h1>");

doc.save(getMyDir() + "DocumentBuilder.InsertHtml Out.doc");

you can solve this issue by combined operation of Jsoup and regular expression 您可以通过结合使用Jsoup和正则表达式来解决此问题

  String st="&lt;p class=&quot;MsoNormal&quot;&gt;&lt;span style=&quot;font-size: 14pt; font-family: TimesNewRoman;&quot;&gt; I Just want this Text here?&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;";
  System.out.println(Jsoup.parse(st).text().replaceAll("\\<.*?>",""));

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM