简体   繁体   English

将 HTML 转换为 PHP 中的纯文本以用于电子邮件

[英]Converting HTML to plain text in PHP for e-mail

I use TinyMCE to allow minimal formatting of text within my site.我使用TinyMCE来允许在我的站点中最小化文本格式。 From the HTML that's produced, I'd like to convert it to plain text for e-mail.从生成的 HTML 中,我想将其转换为电子邮件的纯文本。 I've been using a class called html2text , but it's really lacking in UTF-8 support, among other things.我一直在使用一个名为html2text的类,但除其他外,它确实缺乏对 UTF-8 的支持。 I do, however, like that it maps certain HTML tags to plain text formatting — like putting underscores around text that previously had <i> tags in the HTML.然而,我确实喜欢它将某些 HTML 标签映射到纯文本格式——比如在以前在 HTML 中有 <i> 标签的文本周围放置下划线。

Does anyone use a similar approach to converting HTML to plain text in PHP?有没有人使用类似的方法在 PHP 中将 HTML 转换为纯文本? And if so: Do you recommend any third-party classes that I can use?如果是这样:您是否推荐我可以使用的任何第三方类? Or how do you best tackle this issue?或者你如何最好地解决这个问题?

Use html2text (example HTML to text ), licensed under the Eclipse Public License .使用html2text (示例HTML文本),在Eclipse Public License下获得许可 It uses PHP's DOM methods to load from HTML, and then iterates over the resulting DOM to extract plain text.它使用 PHP 的 DOM 方法从 HTML 加载,然后迭代生成的 DOM 以提取纯文本。 Usage:用法:

// when installed using the Composer package
$text = Html2Text\Html2Text::convert($html);

// usage when installed using html2text.php
require('html2text.php');
$text = convert_html_to_text($html);

Although incomplete, it is open source and contributions are welcome.虽然不完整,但它是开源的,欢迎贡献。

Issues with other conversion scripts:其他转换脚本的问题:

here is another solution:这是另一种解决方案:

$cleaner_input = strip_tags($text);

For other variations of sanitization functions, see:有关消毒功能的其他变体,请参阅:

https://github.com/ttodua/useful-php-scripts/blob/master/filter-php-variable-sanitize.php https://github.com/ttodua/useful-php-scripts/blob/master/filter-php-variable-sanitize.php

Converting from HTML to text using a DOMDocument is a viable solution.使用DOMDocument从 HTML 转换为文本是一个可行的解决方案。 Consider HTML2Text, which requires PHP5:考虑 HTML2Text,它需要 PHP5:

Regarding UTF-8, the write-up on the "howto" page states:关于 UTF-8,“howto”页面上的文章指出:

PHP's own support for unicode is quite poor, and it does not always handle utf-8 correctly. PHP 本身对 unicode 的支持很差,而且它并不总是正确处理 utf-8。 Although the html2text script uses unicode-safe methods (without needing the mbstring module), it cannot always cope with PHP's own handling of encodings.尽管 html2text 脚本使用 unicode-safe 方法(不需要 mbstring 模块),但它不能总是处理 PHP 自己的编码处理。 PHP does not really understand unicode or encodings like utf-8, and uses the base encoding of the system, which tends to be one of the ISO-8859 family. PHP 并不真正理解 unicode 或 utf-8 之类的编码,而是使用系统的基本编码,这往往是 ISO-8859 系列之一。 As a result, what may look to you like a valid character in your text editor, in either utf-8 or single-byte, may well be misinterpreted by PHP.因此,在您的文本编辑器中看起来像是有效字符的 utf-8 或单字节字符很可能会被 PHP 误解。 So even though you think you are feeding a valid character into html2text, you may well not be.因此,即使您认为将有效字符输入到 html2text 中,也可能不是。

The author provides several approaches to solving this and states that version 2 of HTML2Text (using DOMDocument) has UTF-8 support.作者提供了几种解决此问题的方法,并指出 HTML2Text 的第 2 版(使用 DOMDocument)支持 UTF-8。

Note the restrictions for commercial use.请注意商业用途的限制。

There's the trusty strip_tags function.有值得信赖的strip_tags函数。 It's not pretty though.不过也不是很漂亮。 It'll only sanitize.它只会消毒。 You could combine it with a string replace to get your fancy underscores.您可以将它与字符串替换结合使用以获得您喜欢的下划线。


<?php
// to strip all tags and wrap italics with underscore
strip_tags(str_replace(array("<i>", "</i>"), array("_", "_"), $text));

// to preserve anchors...
str_replace("|a", "<a", strip_tags(str_replace("<a", "|a", $text)));

?>

You can use lynx with -stdin and -dump options to achieve that:您可以使用带有 -stdin 和 -dump 选项的 lynx 来实现:

<?php
$descriptorspec = array(
   0 => array("pipe", "r"),  // stdin is a pipe that the child will read from
   1 => array("pipe", "w"),  // stdout is a pipe that the child will write to
   2 => array("file", "/tmp/htmp2txt.log", "a") // stderr is a file to write to
);

$process = proc_open('lynx -stdin -dump 2>&1', $descriptorspec, $pipes, '/tmp', NULL);

if (is_resource($process)) {
    // $pipes now looks like this:
    // 0 => writeable handle connected to child stdin
    // 1 => readable handle connected to child stdout
    // Any error output will be appended to htmp2txt.log

    $stdin = $pipes[0];
    fwrite($stdin,  <<<'EOT'
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
 <title>TEST</title>
</head>
<body>
<h1><span>Lorem Ipsum</span></h1>

<h4>"Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit..."</h4>
<h5>"There is no one who loves pain itself, who seeks after it and wants to have it, simply because it is pain..."</h5>
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque et sapien ut erat porttitor suscipit id nec dui. Nam rhoncus mauris ac dui tristique bibendum. Aliquam molestie placerat gravida. Duis vitae tortor gravida libero semper cursus eu ut tortor. Nunc id orci orci. Suspendisse potenti. Phasellus vehicula leo sed erat rutrum sed blandit purus convallis.
</p>
<p>
Aliquam feugiat, neque a tempus rhoncus, neque dolor vulputate eros, non pellentesque elit lacus ut nunc. Pellentesque vel purus libero, ultrices condimentum lorem. Nam dictum faucibus mollis. Praesent adipiscing nunc sed dui ultricies molestie. Quisque facilisis purus quis felis molestie ut accumsan felis ultricies. Curabitur euismod est id est pretium accumsan. Praesent a mi in dolor feugiat vehicula quis at elit. Mauris lacus mauris, laoreet non molestie nec, adipiscing a nulla. Nullam rutrum, libero id pellentesque tempus, erat nibh ornare dolor, id accumsan est risus at leo. In convallis felis at eros condimentum adipiscing aliquam nisi faucibus. Integer arcu ligula, porttitor in fermentum vitae, lacinia nec dui.
</p>
</body>
</html>
EOT
    );
    fclose($stdin);

    echo stream_get_contents($pipes[1]);
    fclose($pipes[1]);

    // It is important that you close any pipes before calling
    // proc_close in order to avoid a deadlock
    $return_value = proc_close($process);

    echo "command returned $return_value\n";
}

You can test this function你可以测试这个功能

function html2text($Document) {
    $Rules = array ('@<script[^>]*?>.*?</script>@si',
                    '@<[\/\!]*?[^<>]*?>@si',
                    '@([\r\n])[\s]+@',
                    '@&(quot|#34);@i',
                    '@&(amp|#38);@i',
                    '@&(lt|#60);@i',
                    '@&(gt|#62);@i',
                    '@&(nbsp|#160);@i',
                    '@&(iexcl|#161);@i',
                    '@&(cent|#162);@i',
                    '@&(pound|#163);@i',
                    '@&(copy|#169);@i',
                    '@&(reg|#174);@i',
                    '@&#(d+);@e'
             );
    $Replace = array ('',
                      '',
                      '',
                      '',
                      '&',
                      '<',
                      '>',
                      ' ',
                      chr(161),
                      chr(162),
                      chr(163),
                      chr(169),
                      chr(174),
                      'chr()'
                );
  return preg_replace($Rules, $Replace, $Document);
}

I didn't find any of the existing solutions fitting - simple HTML emails to simple plain text files.我没有找到任何适合的现有解决方案 - 简单的 HTML 电子邮件到简单的纯文本文件。

I've opened up this repository, hope it helps someone.我已经打开了这个存储库,希望它可以帮助某人。 MIT license, by the way :)麻省理工学院许可证,顺便说一句:)

https://github.com/RobQuistNL/SimpleHtmlToText https://github.com/RobQuistNL/SimpleHtmlToText

Example:例子:

$myHtml = '<b>This is HTML</b><h1>Header</h1><br/><br/>Newlines';
echo (new Parser())->parseString($myHtml);

returns:返回:

**This is HTML**
### Header ###


Newlines
public function plainText($text)
{
    $text = strip_tags($text, '<br><p><li>');
    $text = preg_replace ('/<[^>]*>/', PHP_EOL, $text);

    return $text;
}

$text = "string 1<br>string 2<br/><ul><li>string 3</li><li>string 4</li></ul><p>string 5</p>";

echo planText($text);

output输出
string 1字符串 1
string 2字符串 2
string 3字符串 3
string 4字符串 4
string 5字符串 5

If you want to convert the HTML special characters and not just remove them as well as strip things down and prepare for plain text this was the solution that worked for me...如果您想转换HTML 特殊字符,而不仅仅是删除它们以及剥离内容并准备纯文本,那么这就是对我有用的解决方案......

function htmlToPlainText($str){
    $str = str_replace('&nbsp;', ' ', $str);
    $str = html_entity_decode($str, ENT_QUOTES | ENT_COMPAT , 'UTF-8');
    $str = html_entity_decode($str, ENT_HTML5, 'UTF-8');
    $str = html_entity_decode($str);
    $str = htmlspecialchars_decode($str);
    $str = strip_tags($str);

    return $str;
}

$string = '<p>this is (&nbsp;) a test</p>
<div>Yes this is! &amp; does it get "processed"? </div>'

htmlToPlainText($string);
// "this is ( ) a test. Yes this is! & does it get processed?"`

html_entity_decode w/ ENT_QUOTES | html_entity_decode w/ENT_QUOTES | ENT_XML1 converts things like &#39; ENT_XML1 转换诸如&#39;类的东西&#39; htmlspecialchars_decode converts things like &amp; htmlspecialchars_decode 转换诸如&amp;东西&amp; html_entity_decode converts things like '&lt; html_entity_decode 转换诸如'&lt; and strip_tags removes any HTML tags left over.和 strip_tags 删除任何剩余的 HTML 标签。

Markdownify将 HTML 转换为 Markdown,这是本网站上使用的纯文本格式系统。

I came around the same problem as the OP, and trying some solutions from the top answers above didn't prove to work for my scenarios.我遇到了与 OP 相同的问题,并且从上面的最佳答案中尝试了一些解决方案并没有证明对我的场景有效。 See why at the end.最后看看为什么。

Instead, I found this helpful script, to avoid confusion let's call it html2text_roundcube , available under GPL:相反,我发现了这个有用的脚本,为了避免混淆,我们将其html2text_roundcube ,在 GPL 下可用:

It's actually an updated version of an already mentioned script - http://www.chuggnutt.com/html2text.php - updated by RoundCube mail.它实际上是已经提到的脚本的更新版本 - http://www.chuggnutt.com/html2text.php - 由 RoundCube 邮件更新。

Usage:用法:

$h2t = new \Html2Text\Html2Text('Hello, &quot;<b>world</b>&quot;');
echo $h2t->getText(); // prints Hello, "WORLD"

Why html2text_roundcube proved better than the others:为什么html2text_roundcube被证明比其他的更好:

  • Script http://www.chuggnutt.com/html2text.php didn't work out of the box for cases with special HTML codes/names (eg &auml; ), or unpaired quotes (eg <p>25" Monitor</p> ).对于具有特殊 HTML 代码/名称(例如&auml; )或不成对引号(例如<p>25" Monitor</p> )的情况,脚本http://www.chuggnutt.com/html2text.php无法开箱即用<p>25" Monitor</p> )。

  • Script https://github.com/soundasleep/html2text had no option to hide or group the links at the end of the text, making a usual HTML page look bloated with links when in text-plain format;脚本https://github.com/soundasleep/html2text没有选项隐藏或分组文本末尾的链接,使普通的 HTML 页面在纯文本格式时看起来因链接而臃肿; customizing the code for special treatment of how the transformation is done is not as straight forward as simply editing an array in html2text_roundcube .自定义代码以对如何完成转换进行特殊处理并不像简单地在html2text_roundcube编辑数组那么简单。

For texts in utf-8, it worked for me mb_convert_encoding.对于 utf-8 中的文本,它对我有用 mb_convert_encoding。 To process everything regardless of errors, make sure you use the "@".要处理所有内容而不考虑错误,请确保使用“@”。

The basic code I use is:我使用的基本代码是:

$dom = new DOMDocument();
@$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));

$body = $dom->getElementsByTagName('body')->item(0);
echo $body->textContent;

If you want something more advanced, you can iteratively analyze the nodes, but you will encounter many problems with whitespaces.如果你想要更高级的东西,你可以迭代分析节点,但是你会遇到很多空白的问题。

I have implemented a converter based on what I say here.我已经根据我在这里所说的内容实现了一个转换器。 If you are interested, you can download it from git https://github.com/kranemora/html2text有兴趣的可以到git https://github.com/kranemora/html2text下载

It may serve as a reference to make yours它可以作为参考来制作你的

You can use it like this:你可以这样使用它:

$html = <<<EOF
<p>Welcome to <strong>html2text<strong></p>
<p>It's <em>works</em> for you?</p>
EOF;

$html2Text = new \kranemora\Html2Text\Html2Text;
$text = $html2Text->convert($html);

Markdownify worked wonderful for me! Markdownify对我来说很棒! what have to be mentioned about it: it supports perfectly utf-8, what was the main reason why i was searching for another solution than html2text (what was mentioned earlier in this thread). 必须提及的内容:它完全支持utf-8,这是我为什么要寻找除html2text之外的另一种解决方案的主因(此线程前面提到的内容)。

I have just found a PHP function "strip_tags()" and its working in my case.我刚刚找到了一个 PHP 函数“strip_tags()”,它在我的情况下工作。

I tried to convert the following HTML :我试图转换以下 HTML :

<p><span style="font-family: 'Verdana','sans-serif'; color: black; font-size: 7.5pt;">&nbsp;</span>Many  practitioners are optimistic that the eyeglass and contact lens  industry will recover from the recent economic storm. Did your practice  feel its affects?&nbsp; Statistics show revenue notably declined in 2008 and  2009. But interestingly enough, those that monitor these trends state  that despite the industry's lackluster performance during this time,  revenue has grown at an average annual rate&nbsp;of 2.2% over the last five  years, to $9.0 billion in 2010.&nbsp; So despite the downturn, how were we  able to manage growth as an industry?</p>

After applying strip_tags() function, I have got the following output :应用 strip_tags() 函数后,我得到了以下输出:

&amp;nbsp;Many  practitioners are optimistic that the eyeglass and contact lens  industry will recover from the recent economic storm. Did your practice  feel its affects?&amp;nbsp; Statistics show revenue notably declined in 2008 and  2009. But interestingly enough, those that monitor these trends state  that despite the industry&#039;s lackluster performance during this time,  revenue has grown at an average annual rate&amp;nbsp;of 2.2% over the last five  years, to $9.0 billion in 2010.&amp;nbsp; So despite the downturn, how were we  able to manage growth as an industry?

If you don't want to strip the tags completely and keep the content inside the tags, you can use the DOMDocument and extract the textContent of the root node like this:如果您不想完全剥离标签并将内容保留在标签内,您可以使用DOMDocument并像这样提取根节点的textContent

function html2text($html) {
    $dom = new DOMDocument();
    $dom->loadHTML("<body>" . strip_tags($html, '<b><a><i><div><span><p>') . "</body>");
    $xpath = new DOMXPath($dom);
    $node = $xpath->query('body')->item(0);
    return $node->textContent; // text
}

$p = 'this is <b>test</b>. <p>how are <i>you?</i>. <a href="#">I\'m fine!</a></p>';
print html2text($p);
// this is test. how are you?. I'm fine!

One advantage of this approach is that it does not require any external packages.这种方法的一个优点是它不需要任何外部包。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM