简体   繁体   English

如何使用 Jsoup 将 HTML 文本(包括无序列表和有序列表)转换为纯文本?

[英]How can I convert HTML text, including unordered and ordererd lists, to plain text with Jsoup?

I need to convert HTML to plain text for sending it per mail.我需要将 HTML 转换为纯文本,以便通过邮件发送。 Currently I'm using目前我正在使用

Jsoup.parse(html).wholeText();

This preserves line breaks, but not lists.这会保留换行符,但不会保留列表。 Something like就像是

 - List item
 - List item 2
   - Nested list item

gets converted to List itemList item2Nested list item转换为List itemList item2Nested list item

How can I keep most of the text formatting, but remove all HTML tags with images, links etc.?如何保留大部分文本格式,但删除所有带有图像、链接等的 HTML 标签?

What you're asking for is to render HTML (not parse it; though parsing it is, naturally, part of any HTML rendering engine).您要求的是渲染HTML (不解析它;虽然解析它自然是任何 HTML 渲染引擎的一部分)。 Not render it the way eg Chromium would render it (as an image to a screen), but to render it into a string.不是像 Chromium 那样渲染它(作为屏幕上的图像),而是将它渲染成一个字符串。

This is highly complicated, and involves CSS support as well.这非常复杂,并且还涉及 CSS 支持。 In basis, what you are asking for is multiple personyears of effort, and as far as I know no library exists that did it.在基础上,你所要求的是多年的努力,据我所知,没有图书馆可以做到这一点。 You can have a look at text-based HTML renderers such as Lynx or w3m - you can probably install them, execute these with ProcessBuilder (this does, of course, make your app entirely arch+OS dependent, and you'll have to ship a w3m or lynx binary for each and every platform you want to support, or ask the one who installs your app to take care of also installing a lynx and/or w3m and telling your app where it is).您可以查看基于文本的 HTML 渲染器,例如Lynxw3m - 您可以安装它们,使用ProcessBuilder执行它们(当然,这确实使您的应用程序完全依赖于 arch+OS,并且您必须发布为您想要支持的每个平台提供一个 w3m 或 lynx 二进制文件,或者让安装您的应用程序的人负责安装 lynx 和/或 w3m 并告诉您的应用程序它在哪里)。 Note that lynx/w3m tend to assume full terminal support, meaning: Bold, colours, etc.请注意,lynx/w3m 倾向于假定完全支持终端,这意味着:粗体、颜色等。

Imagine an HTML page that doesn't use <ul> and <li> to create a bulleted list, but instead uses some CSS to make something that looks a lot like a bulleted list.想象一个 HTML 页面不使用<ul><li>来创建项目符号列表,而是使用一些 CSS 来创建看起来很像项目符号列表的东西。 Or what if inline CSS is used to align something to the right.或者,如果使用内联 CSS 将某些内容向右对齐会怎样。 Presumably then you would expect the string to also do this right alignment, except that is completely impossible unless either [A] you know the size of the 'window' the string will be rendered into or [B] the output is not basic text strings but some sort of markup language that supports right aligning (be it HTML or RTF or similar), or [C] terminal command sequences are available to move the cursor around.大概然后您会期望字符串也可以正确执行此操作 alignment,除非这是完全不可能的,除非 [A] 您知道字符串将呈现到的“窗口”的大小或 [B] output 不是基本文本字符串但是某种支持右对齐的标记语言(无论是 HTML 或 RTF 或类似),或 [C] 终端命令序列可用于移动 cursor。

This should highlight how your question is in essence 'weird' - it's either incredibly complicated, or a seemingly arbitrary tiny subselection of what HTML can do.这应该突出您的问题本质上是如何“奇怪”的——它要么非常复杂,要么是 HTML 可以做的看似任意的微小子选择。

If the latter piques your interest, it isn't too difficult to just write a simplistic tree walker that specifically inserts newlines and "\n * " any time a <li> element inside a <ul> is visited, and a String.format("\n%2d. ") anytime a <li> is visited inside an <ol> .如果后者激起了您的兴趣,那么只需编写一个简单的 tree walker 并在任何时候访问<ul>中的<li>元素时专门插入换行符和"\n * " ,并使用String.format("\n%2d. ")任何时候在<ol> <li>

In other words, given that what you ask for is either impossible or is an arbitrary choice of HTML and CSS stylings that you do and don't want to support, write it yourself.换句话说,鉴于您所要求的要么是不可能的,要么是您想要支持和不想支持的 HTML 和 CSS 样式的任意选择,请自己编写。 If truly you are only interested specifically in <ol> / <ul> based lists and nothing else, this will be about a page full of code and no more.如果你真的只对基于<ol> / <ul>的列表特别感兴趣,那么这将是一个充满代码的页面,仅此而已。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM