简体   繁体   中英

Convert HTML table to text

I'm working on a project that requires to convert html email into text. Below is a simplified version of the HTML code:

<table>
    <tr>
        <td width="10%"></td>
        <td width="60%"> test product </td>
        <td width="20%">5</td>
        <td width="10%"> £50.00 </td>
    </tr>
    <tr>
        <td></td>
        <td colspan="3" width="100%"> Project Name: Test Project </td>
    </tr>
    <tr>
        <td width="10%"> </td>
        <td colspan="2" width="80%"> Page 1 : 01 New York 1.jpg </td>
        <td width="10%"> £0.00 </td>
    </tr>
</table>

The expected outcome should look like this in a text file (with columns aligned nicely):

test product                                      5            £50.00
Project Name: Test Project                                                            
Page 1 :  01 New York 1.jpg                                    £0.00

My idea is parsing the HTML content by DOMDocument. Then I will set a default width for the table (ie: 100 spaces) then convert the width of each column from % to number of spaces (based on colspan & width attribute of <td> tag). Then I will subtract these column width to strlen of the data in each column to archive the number of spaces I need to pad_right to the string to make everything align vertically.

I have been working that way, hasn't been archived what I want but just wondering if it is stupid or anyone knows a better way please help me out.

Also when it comes to Multibyte languages (Japanese, Korean etc...) I don't think my approach would work because their characters will be bigger than one space and it end up a mess.

Can someone help me out please?

Don't reinvent the wheel. Table rendering is difficult, rendering tables using only text is even more difficult. To clarify the complexity of a text-based table renderer that offers all the features of HTML, take a look at w3m, which is open source: these 3000 lines of code are there only to display html tables.

Transform HTML to Text

There are textbased browsers that can be used by command line, like lynx. You could fwrite your html table into a file, pass that file into the textbased browser and take its output.

Note: textbased browsers are generally used in a shell, which generally displays in monospace. This remains a prerequisite.

lynx and w3m are both available on Windows and you don't need to "install" them, you just need to have the executables and the permission to run them from PHP.

code example:

<?php
$table = '<table><tr><td>foo</td><td>bar</td></tr></table>'; //this contains your table
$html = "<html><body>$table</body></html>";

//write html file
$tmpfname = tempnam(sys_get_temp_dir(), "tblemail");

$handle = fopen($tmpfname, "w");
fwrite($handle, $html);
fclose($handle);

$myTextTable = shell_exec("w3m.exe -dump \"$tmpfname\"");
unlink($tmpfname);

w3m.exe needs to be in your working directory.

(didn't try it)

Render a Text table

If you want a native PHP solution, there's also at least one framework ( https://github.com/c9s/CLIFramework ) aimed at console applications for PHP which has a table renderer.

It doesn't transform HTML to text, but it helps you build a text formatted table with support for multiline cells (which seems to be the most complicated part).

Using CLIFramework you would need a code like this to render your table:

<?php
require 'vendor/autoload.php';
use CLIFramework\Component\Table\Table;

$table = new Table;
$table->addRow(array( 
    "test product", "5", "£50.00"
));
$table->addRow(array( 
    "Project Name: Test Project", "", ""
));
$table->addRow(array( 
    "Page 1 : 01 New York 1.jpg", "", "£0.00"
));

$myTextTable = $table->render();

The CLIFramework table renderer doesn't seem to support anything similar to "colspan" however.

Here's the documentation for the table component: https://github.com/c9s/CLIFramework/wiki/Using-Table-Component

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM