简体   繁体   English

将HTML表格转换为文本

[英]Convert HTML table to text

I'm working on a project that requires to convert html email into text. 我正在开发一个需要将html电子邮件转换为文本的项目。 Below is a simplified version of the HTML code: 以下是HTML代码的简化版本:

<table>
    <tr>
        <td width="10%"></td>
        <td width="60%"> test product </td>
        <td width="20%">5</td>
        <td width="10%"> £50.00 </td>
    </tr>
    <tr>
        <td></td>
        <td colspan="3" width="100%"> Project Name: Test Project </td>
    </tr>
    <tr>
        <td width="10%"> </td>
        <td colspan="2" width="80%"> Page 1 : 01 New York 1.jpg </td>
        <td width="10%"> £0.00 </td>
    </tr>
</table>

The expected outcome should look like this in a text file (with columns aligned nicely): 预期的结果应该在文本文件中看起来像这样(列很好地对齐):

test product                                      5            £50.00
Project Name: Test Project                                                            
Page 1 :  01 New York 1.jpg                                    £0.00

My idea is parsing the HTML content by DOMDocument. 我的想法是通过DOMDocument解析HTML内容。 Then I will set a default width for the table (ie: 100 spaces) then convert the width of each column from % to number of spaces (based on colspan & width attribute of <td> tag). 然后我将为表设置默认宽度(即:100个空格),然后将每列的宽度从%转换为空格数(基于<td>标签的colspanwidth属性)。 Then I will subtract these column width to strlen of the data in each column to archive the number of spaces I need to pad_right to the string to make everything align vertically. 然后我将这些列宽减去每列中数据的strlen ,以将pad_right所需的空格数归档到字符串,使所有内容垂直对齐。

I have been working that way, hasn't been archived what I want but just wondering if it is stupid or anyone knows a better way please help me out. 我一直在那样工作,没有归档我想要的东西,但只是想知道它是愚蠢还是有人知道更好的方法请帮助我。

Also when it comes to Multibyte languages (Japanese, Korean etc...) I don't think my approach would work because their characters will be bigger than one space and it end up a mess. 此外,当谈到多字节语言(日语,韩语等...)时,我认为我的方法不会起作用,因为它们的字符将超过一个空格并且最终会变得混乱。

Can someone help me out please? 有人可以帮帮我吗?

Don't reinvent the wheel. 不要重新发明轮子。 Table rendering is difficult, rendering tables using only text is even more difficult. 表格渲染很困难,只使用文本渲染表格更加困难。 To clarify the complexity of a text-based table renderer that offers all the features of HTML, take a look at w3m, which is open source: these 3000 lines of code are there only to display html tables. 为了阐明提供HTML的所有功能的基于文本的表格渲染器的复杂性,请查看w3m,它是开源的: 这3000行代码只用于显示html表格。

Transform HTML to Text 将HTML转换为文本

There are textbased browsers that can be used by command line, like lynx. 有一些基于文本的浏览器可以被命令行使用,比如lynx。 You could fwrite your html table into a file, pass that file into the textbased browser and take its output. 你可以fwrite你的HTML表格到一个文件,该文件传递到基于文本的浏览器,并采取它的输出。

Note: textbased browsers are generally used in a shell, which generally displays in monospace. 注意:基于文本的浏览器通常用在shell中,shell通常以等宽字体显示。 This remains a prerequisite. 这仍然是先决条件。

lynx and w3m are both available on Windows and you don't need to "install" them, you just need to have the executables and the permission to run them from PHP. lynx和w3m都可以在Windows上使用,你不需要“安装”它们,你只需要拥有可执行文件和从PHP运行它们的权限。

code example: 代码示例:

<?php
$table = '<table><tr><td>foo</td><td>bar</td></tr></table>'; //this contains your table
$html = "<html><body>$table</body></html>";

//write html file
$tmpfname = tempnam(sys_get_temp_dir(), "tblemail");

$handle = fopen($tmpfname, "w");
fwrite($handle, $html);
fclose($handle);

$myTextTable = shell_exec("w3m.exe -dump \"$tmpfname\"");
unlink($tmpfname);

w3m.exe needs to be in your working directory. w3m.exe需要在您的工作目录中。

(didn't try it) (没试过)

Render a Text table 渲染文本表

If you want a native PHP solution, there's also at least one framework ( https://github.com/c9s/CLIFramework ) aimed at console applications for PHP which has a table renderer. 如果你想要一个本机PHP解决方案,那么至少还有一个框架( https://github.com/c9s/CLIFramework )针对PHP的控制台应用程序,它有一个表格渲染器。

It doesn't transform HTML to text, but it helps you build a text formatted table with support for multiline cells (which seems to be the most complicated part). 它不会将HTML转换为文本,但它可以帮助您构建一个支持多行单元格的文本格式表(这似乎是最复杂的部分)。

Using CLIFramework you would need a code like this to render your table: 使用CLIFramework,您需要这样的代码来呈现您的表:

<?php
require 'vendor/autoload.php';
use CLIFramework\Component\Table\Table;

$table = new Table;
$table->addRow(array( 
    "test product", "5", "£50.00"
));
$table->addRow(array( 
    "Project Name: Test Project", "", ""
));
$table->addRow(array( 
    "Page 1 : 01 New York 1.jpg", "", "£0.00"
));

$myTextTable = $table->render();

The CLIFramework table renderer doesn't seem to support anything similar to "colspan" however. 然而,CLIFramework表呈现器似乎不支持类似于“colspan”的任何内容。

Here's the documentation for the table component: https://github.com/c9s/CLIFramework/wiki/Using-Table-Component 这是表组件的文档: https//github.com/c9s/CLIFramework/wiki/Using-Table-Component

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM