简体   繁体   English

Email html 转 csv 文件

[英]Email html to csv file

I have one email with html format and need to download it and need to make one csv semicolon field separator result to a new file.我有一个 email 格式为 html ,需要下载它,需要将一个 csv 分号字段分隔符结果生成一个新文件。

Example of the email received:收到的 email 示例:

Content-Type: text/html; charset=UTF-8
<b>Thu Jul 11 2019</b><hr><table style=3D"border=
: 1px solid #dddddd;border-collapse: collapse;text-align: left;"><tr><th st= yle=3D"padding: 8px;background-color: #cce6ff">Name</th><th styl=
e=3D"padding: 8px;background-color: #cce6ff">CI</th><th style=3D"padding: 8=
px;background-color: #cce6ff">DH</th><th style=3D"padding: 8px;backgro=
und-color: #cce6ff">FG</th><th style=3D"padding: 8px;background-color: #c=
ce6ff">Mon</th><th style=3D"padding: 8px;background-color: #cce6ff">DATE=
(UTC)</th></tr><tr><th style=3D"padding: 8px;">Arael Amarel</th><th style=
=3D"padding: 8px;">30549214</th><th style=3D"padding: 8px;">099981496</th><=
th style=3D"padding: 8px;">43</th><th style=3D"padding: 8px;">-</th><th sty=
le=3D"padding: 8px;">2019-07-11T10:06:34.311Z</th></tr><tr><th style=3D"pad=
ding: 8px;background-color: #dddddd">MATIN TARDEI</th><th style=3D"padding=
: 8px;background-color: #dddddd">45159820</th><th style=3D"padding: 8px;bac=
kground-color: #dddddd">094432451</th><th style=3D"padding: 
8px;background-=
color: #dddddd">32</th><th style=3D"padding: 8px;background-color: #dddddd"=
-</th><th style=3D"padding: 8px;background-color: #dddddd">2019-07- 
11T10:2=
8:41.198Z</th></tr>

Needed csv output:需要 csv output:

Name;CI;DH;FG;Mon;DATE (UTC)
Arael Amarel;30549214;099981496;43;-;2019-07-11T10:06:34.311Z
MATIN TARDEI;45159820;094432451;32;-;2019-07-11T10:28:41.198Z

If I open this mail on Client there make the table all ok, but I think it´s there a problem of format with procmail if I put in.html file this content (saved by procmail) of procmail and open it it´s make impossible to process the content if I look this content all the end of line are marked with a "=" which means a lot of problems, furtermore they are some several problems in the alignment of the table and other stuff which make it a nightmare to process the content to extract.如果我在客户端上打开此邮件,则表格一切正常,但我认为如果我输入 procmail 的格式存在问题。如果我查看此内容,则无法处理内容所有行尾都标有“=”,这意味着很多问题,此外,它们是表格的 alignment 中的一些问题以及其他使其成为噩梦的东西处理要提取的内容。

I had made one procmailrc with the filter to convert the html format to plain我用过滤器制作了一个 procmailrc,将 html 格式转换为普通格式

procmailrc file: procmailrc 文件:

MAILDIR=/new/mail/htmlconvert
:0
* ^Content-Type: text/html.*;
{
:0c
$MAILDIR/converted/
:0fwb
| `which html2text`
:0fwh
| `which formail` -i "Content-Type: text/plain; charset=UTF-8"
}

This is a try number 1, didn't work the converter uses I tough html2text converter if I use html2text directly from the file originated de result is:这是第 1 次尝试,如果我直接从文件中使用 html2text,则转换器无法使用我很难使用的 html2text 转换器,结果是:

html2text html2文本

===============================================================================
 1px solid #dddddd;border-collapse: collapse;text-align: left;">
px;background-color: #cce6ff">NAME
px;background-color: #cce6ff">CI
= px;background-color: #cce6ff">DH
px;backgro= und-color: #cce6ff">FG
px;background-color: #c= ce6ff">Mon
px;background-color: #cce6ff">DATE= (UTC)
px;">Arael Amarel
px;">30549214
px;">099981496
<= th style=3D"padding: 8px;">43
px;">-
px;">2019-07-11T10:06:34.311Z
px;background-color: #dddddd">MATIN TARDEI
 8px;background-color: #dddddd">45159820
px;bac= kground-color: #dddddd">094432451
px;background-= color: #dddddd">32
px;background-color: #dddddd"= >-
px;background-color: #dddddd">2019-07-11T10:2= 8:41.198Z
px;">

Already tried lynx -dump -force-html to the file and the result isn't nothing good to reach the format csv output.已经尝试过lynx -dump -force-html到文件中,结果达到 csv output 格式并不是什么好事。

html2text -nobs (file)

Name;CI;DH;FG;Mon;DATE (UTC)
Arael Amarel;30549214;099981496;43;-;2019-07-11T10:06:34.311Z
MATIN TARDEI;45159820;094432451;32;-;2019-07-11T10:28:41.198Z

Update: I have applied the solution of tripleee to the procmailrc, however the format of the mail is still the same of the original source, the qprint didn't change the format with this change.更新:我已经将tripleee的解决方案应用于procmailrc,但是邮件的格式仍然与原始来源相同, qprint并没有随着这个变化而改变格式。 However have tried to make it directly to the file and works fine.但是已经尝试将其直接写入文件并且工作正常。 The actual solution:实际解决方案:

qprint -d -n <1563019338.1197_0.localhost.localdomain |
html2text -style pretty |
awk '/^-------------------------------------------------------------------------------/{p=1}p'

The - line is the separator of the body of the mail and the before content, this shows out: - 行是邮件正文和之前内容的分隔符,显示如下:

-------------------------------------------------------------------------------

NAME         CI       CD   FG  HJ DATE (UTC)
Yaiaa Fereeira        52104575 097325303 20    -     2019-07-12T10:46:24.716Z
Gabtiel Aosta Sclavi   42445135 098322361 42    -     2019-07-12T11:07:36.110Z

Need now to make this content to the csv out, I thought it will be more easy to the first part but want to automate it to the procmail to do it with the mail download.现在需要把这个内容到 csv 出来,我以为第一部分会更容易但是想把它自动化到 procmail 来做它的邮件下载。

The result of procmail changing the procmailrc is the mail with the body still having the "=" as line end, but in the header have: procmail 更改 procmailrc 的结果是邮件正文仍以“=”作为行尾,但在 header 中有:

Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=UTF-8 

Update The email result source with qprint in the procrc使用 procrc 中的 qprint 更新 email 结果源

Return-Path: 
Delivered-To: 
Return-path: 
Envelope-to: 
Delivery-date: Sat, 13 Jul 2019 08:03:48 -0300
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=UTF-8
Date: Sat, 13 Jul 2019 11:03:02 +0000 (UTC)
From: 
Mime-Version: 1.0
To: 
Message-ID: 
Subject:Fri Jul 12 2019
X-Spam-Flag: NO

<b>Fri Jul 12 2019</b><hr><table style=3D"border=
: 1px solid #dddddd;border-collapse: collapse;text-align: left;"><tr><th st=
yle=3D"padding: 8px;background-color: #cce6ff">NAME</th><th styl=
e=3D"padding: 8px;background-color: #cce6ff">CI</th><th style=3D"padding: 8=
px;background-color: #cce6ff">CD</th><th style=3D"padding: 8px;backgro=
und-color: #cce6ff">FG</th><th style=3D"padding: 8px;background-color: #c=
ce6ff">HJ</th><th style=3D"padding: 8px;background-color: #cce6ff">DATE=
 (UTC)</th></tr><tr><th style=3D"padding: 8px;">Yaiaa Fereeira</th><th st=
yle=3D"padding: 8px;">52104575</th><th style=3D"padding: 8px;">097325303</t=
h><th style=3D"padding: 8px;">20</th><th style=3D"padding: 8px;">-</th><th =
style=3D"padding: 8px;">2019-07-12T10:46:24.716Z</th></tr>

I have the log in the stdin because procmail can`t write logfile as you can see in this log detail:我在标准输入中有日志,因为 procmail 无法写入日志文件,正如您在此日志详细信息中看到的那样:

1 message for aaa@aaa.com at aaa.com (25330 octets).
reading message aaa@aaa.com@aaa.com:1 of 1 (25330 octets)........................procmail: Error while writing to "/info/in/log"
procmail: [20191] Mon Jul 15 08:55:34 2019
procmail: Assigning "FORMAIL=/usr/bin/formail"
procmail: Assigning "QPRINT=/usr/local/bin/qprint"
procmail: Match on "^Content-Type: text/html;"
procmail: Assigning "LASTFOLDER=converted/new/1563191734.20191_0.localhost.localdomain"
 Subject: Sun Jul 14 2019
  Folder: converted/new/1563191734.20191_0.localhost.localdomain          24985
procmail: Executing " qprint -d -n | html2text -nobs "
procmail: Executing " formail -I "Content-Type: text/html; charset=UTF-8"
procmail: Skipped "Mail"
procmail: Skipped "/"
From aaaaaa.com@aaa.com  Mon Jul 15 08:55:34 2019
 Subject: Sun Jul 14 2019
  Folder: **Bounced**                                                     24985
fetchmail: MDA returned nonzero status 73
 not flushed

The sample in your post does not look like a valid email body at all.您帖子中的示例看起来根本不像有效的 email 主体。 I'm guessing it's a body part within a MIME message with Content-type: text/html (as vaguely indicated) and Content-transfer-encoding: quoted-printabe .它是 MIME 消息中的正文部分,带有Content-type: text/html (如模糊指示)和Content-transfer-encoding: quoted-printabe The latter is what introduces the = escapes which you regard as problematic.后者是引入您认为有问题的=转义的原因。 Decoding them is actually fairly trivial, but how exactly to do that from Procmail depends on the overall composition of the containing message, and the utilities available to you.解码它们实际上是相当简单的,但如何从 Procmail 中准确地做到这一点取决于包含消息的整体组成,以及您可用的实用程序。 Unfortunately, Procmail itself has no idea about MIME structures, so you'll have to rely on external tools.不幸的是,Procmail 本身不知道 MIME 结构,因此您必须依赖外部工具。

As an aside the `which...` commands in your recipe are completely redundant.顺便说一句,你的食谱中的`which...`命令是完全多余的。 For which to work, the utilities which you are looking for need to be in your PATH ... which means Procmail can find them without which .对于which工作,您正在寻找的实用程序需要在您的PATH中......这意味着 Procmail 可以在没有which的情况下找到它们。

If something is not in Procmail's default PATH , simply update PATH near the top of your .procmailrc file.如果 Procmail 的默认PATH中没有某些内容,只需更新.procmailrc文件顶部附近的PATH即可。 This should also remove the need to use variables like $FORMAIL etc. Just use formail and make sure it's available on Procmail's PATH .这也应该消除使用$FORMAIL等变量的需要。只需使用formail并确保它在 Procmail 的PATH上可用。

For your recipe to work, the MIME structure needs to be a single-part message.要使您的配方起作用,MIME 结构必须是单部分消息。 If that is indeed the case, and your html2text is otherwise correct, the only fix you need is to decode the content-transfer-encoding before piping through that.如果确实如此,并且您的html2text在其他方面是正确的,那么您需要的唯一解决方法是在通过管道之前解码内容传输编码。 Assuming you have qprint , and with the superfluous which calls removed, that leaves假设您有qprint ,并且删除了多余which调用,那就离开

:0
* ^Content-Type: text/html.*;
{
  :0c  # no need to spell out $MAILDIR/ prefix
  converted/
  :0fwb
  | qprint -d | html2text
  :0fwh
  | formail -i "Content-Type: text/plain; charset=UTF-8" \
        -i "Content-transfer-encoding: 8bit"
}

If in fact the MIME body structure is more complex, perhaps edit your question to include the actual email source instead of your current ad-lib paraphrase.如果实际上 MIME 正文结构更复杂,也许编辑您的问题以包含实际的 email 源而不是您当前的即兴解释。

In other words, and in some more detail, if your input message looks like换句话说,更详细地说,如果您的输入消息看起来像

From: sender <sender@example.net>
To: you <you@example.org>
Subject: HTML table
MIME-Version: 1.0
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<b>Thu Jul 11 2019</b><hr><table style=3D"border=
: 1px solid #dddddd;border-collapse: collapse;text-align: left;"><tr><th st=
yle=3D"padding: 8px;background-color: #cce6ff">Name</th><th styl=
e=3D"padding: 8px;background-color:....

then the recipe above should basically work.那么上面的食谱应该基本上可以工作。 But on the other hand, if your actual message is more like但另一方面,如果你的实际信息更像

From: sender <sender@example.net>
To: you <you@example.org>
Subject: HTML table
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary=0xdeadbeef

This is a multi-part MIME message.

--0xdeadbeef
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<b>Thu Jul 11 2019</b><hr><table style=3D"border=
: 1px solid #dddddd;border-collapse: collapse;text-align: left;"><tr><th st=
yle=3D"padding: 8px;background-color: #cce6ff">Name</th><th styl=
e=3D"padding: 8px;background-color:....

--0xdeadbeef--

then the first condition will not match (the headers don't contain Content-type: text/html ), but the actions inside the block will also need to be updated in several places because the MIME wrapping around the HTML body part needs to be unwrapped or somehow otherwise restructured.那么第一个条件将不匹配(标题不包含Content-type: text/html ),但块内的操作也需要在几个地方更新,因为围绕 HTML 正文部分的 MIME 需要是展开或以其他方式重组。 Here is a really quick and dirty attempt at solving this.这是解决这个问题的一个非常快速而肮脏的尝试。

:0
* ^Content-Type: multipart/mixed
{
  :0c  # no need to spell out $MAILDIR/ prefix
  converted/
  :0fwb
  | perl -0777 -pe 's/=([0-9A-F]{2})/ chr(oct("0x$1"))/ge; \
    s/=\n//g; \
    s%</table>.*%%s; \
    s%.*<table[^<>]*>%%s; \
    s%<tr[^<>]*><t[dh][^<>]*>%\n%g; \
    s%<t[dh][^<>]*>%;%g; \
    s%</t[rdh]>%%g; \
    s%^\n+%%;'
  :0fwh
  | formail -i "Content-Type: text/plain; charset=UTF-8" \
        -i "Content-transfer-encoding: 8bit"
}

With minor adaptations, it should work for the single-part variation, too.稍作调整,它也应该适用于单部分变化。 But you should realize that the Perl script is a really rough cut, not a proper HTML parser.但是您应该意识到 Perl 脚本是一个非常粗略的剪辑,而不是正确的 HTML 解析器。

The f flag causes Procmail to replace the input message with the output from the pipeline. f标志使 Procmail 用来自管道的 output 替换输入消息。 The formail call is then necessary because the original MIME headers are no longer correct after you have replaced the original content with content of a different type and with a different encoding.然后, formail调用是必要的,因为在您将原始内容替换为不同类型和不同编码的内容后,原始 MIME 标头不再正确。 If you just want to extract the CSV data into an external file instead, the latter can be skipped and the former can be simplified to just如果只是想将 CSV 数据提取到外部文件中,后者可以跳过,前者可以简化为

:0
* ^Content-type: text/html
{
  :0c
  converted/
  :0b  # no w flag necessary either once we drop f
  | qprint -d | html2text >>result.csv
}

where again we assume a single-part MIME message as input.我们再次假设单部分 MIME 消息作为输入。 Whether to overwrite the output file instead of appending (or perhaps write to a different CSV file each time) will depend on your specific use case, and how often you expect to receive these messages.是否覆盖 output 文件而不是附加(或者可能每次写入不同的 CSV 文件)将取决于您的特定用例以及您希望收到这些消息的频率。


(Not in particular an endorsement of qprint ; there are many comparable utilities, but nothing particularly ubiquitous. It's unfortunate that the GNU Coreutils maintainers steadfastly refuse to include a similar utility.) (特别不是对qprint的认可;有许多类似的实用程序,但没有什么特别普遍的。不幸的是,GNU Coreutils 维护者坚决拒绝包含类似的实用程序。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM