简体   繁体   English

如何在Linux中从HTTP MIME编码的消息中提取文件数据?

[英]How to extract file data from an HTTP MIME-encoded message in Linux?

I have a program that accepts HTTP post of files and write all the POST result into a file, I want to write a script to delete the HTTP headers, only leave the binary file data, how to do it? 我有一个程序接受HTTP发布文件并将所有POST结果写入文件,我想编写一个脚本来删除HTTP标头,只保留二进制文件数据,该怎么办?

The file content is below (the data between Content-Type: application/octet-stream and ------------KM7cH2GI3cH2Ef1Ij5gL6GI3Ij5GI3 is what I want: 文件内容如下( Content-Type: application/octet-stream------------KM7cH2GI3cH2Ef1Ij5gL6GI3Ij5GI3之间的数据是我想要的:

POST /?user_name=vvvvvvvv&size=837&file_name=logo.gif& HTTP/1.1^M
Accept: text/*^M
Content-Type: multipart/form-data; boundary=----------KM7cH2GI3cH2Ef1Ij5gL6GI3Ij5GI3^M
User-Agent: Shockwave Flash^M
Host: 192.168.0.198:9998^M
Content-Length: 1251^M
Connection: Keep-Alive^M
Cache-Control: no-cache^M
Cookie: cb_fullname=ddddddd; cb_user_name=cdc^M
^M
------------KM7cH2GI3cH2Ef1Ij5gL6GI3Ij5GI3^M
Content-Disposition: form-data; name="Filename"^M
^M
logo.gif^M
------------KM7cH2GI3cH2Ef1Ij5gL6GI3Ij5GI3^M
Content-Disposition: form-data; name="Filedata"; filename="logo.gif"^M
Content-Type: application/octet-stream^M
^M
GIF89an^@I^^M
------------KM7cH2GI3cH2Ef1Ij5gL6GI3Ij5GI3^M
Content-Disposition: form-data; name="Upload"^M
^M
Submit Query^M
------------KM7cH2GI3cH2Ef1Ij5gL6GI3Ij5GI3-

如果您使用Python, email.parser.Parser将允许您解析多部分的MIME文档。

You want to do this as the file is going over, or is this something you want to do after the file comes over? 您要在文件移交时执行此操作,还是在文件移交后执行此操作?

Almost any scripting language should work. 几乎所有脚本语言都可以使用。 My AWK is a bit rusty, but... 我的AWK有点生锈,但是...

awk '/^Content-Type: application\/octet-stream/,/^--------/'

That should print everything between application/octet-stream and the ---------- lines. 那应该打印application/octet-stream----------行之间的所有内容。 It might also include both those lines too which means you'll have to do something a bit more complex: 它还可能同时包含这两行,这意味着您必须做一些更复杂的事情:

BEGIN {state = 0}
{
    if ($0 ~ /^------------/) {
        state = 0;
    }
    if (state == 1) {
        print $0
    }
    if ($0 ~ /^Content-Type: application\/octet-stream/) {
        state = 1;
    }
}

The application\\/octet-stream line is after the print statement because you want to set state to 1 after you see application/octet-stream . application\\/octet-stream行在print语句之后,因为您希望在看到application/octet-stream之后将state设置为1

Of course, being Unix, you could pipe the output of your program through awk and then save the file. 当然,作为Unix,您可以通过awk通过管道传输程序的输出,然后保存文件。

这可能是一个疯狂的主意,但我会尝试使用procmail剥离标头。

Look at the Mime::Tools suite for Perl. 查看Perl的Mime :: Tools套件 It has a rich set of classes; 它具有丰富的类集; I'm sure you could put something together in just a few lines. 我敢肯定,您可以将几行内容放在一起。

This probably contains some typos or something, but bear with me anyway. 这可能包含一些错别字或其他内容,但还是请耐心等待。 First determine the boundary ( input is the file containing the data - pipe if necessary): 首先确定边界(如果需要, input是包含数据的文件-管道):

boundary=`grep '^Content-Type: multipart/form-data; boundary=' input|sed 's/.*boundary=//'`

Then filter the Filedata part: 然后过滤Filedata部分:

fd='Content-Disposition: form-data; name="Filedata"'
sed -n "/$fd/,/$boundary/p"

The last part is filter a few extra lines - header lines before and including the empty line and the boundary itself, so change the last line from previous to: 最后一部分是过滤一些额外的行-空行和边界本身之前的标题行,包括边界,因此将最后一行从上一行更改为:

sed -n "/$fd/,/$boundary/p" | sed '1,/^$/d' | sed '$d'
  • sed -n "/$fd/,/$boundary/p" filters the lines between the Filedata header and the boundary (inclusive), sed -n "/$fd/,/$boundary/p"过滤Filedata标头和边界(包括边界)之间的线,
  • sed '1,/^$/d' is deleting everything up to and including the first line (so removes the headers) and sed '1,/^$/d'删除直到第一行,包括第一行的所有内容(因此删除标题),并且
  • sed '$d' removes the last line (the boundary). sed '$d'删除最后一行(边界)。

After this, you wait for Dennis (see comments) to optimize it and you get this: 之后,您等待Dennis(请参阅评论)对其进行优化,然后得到以下信息:

sed "1,/$fd/d;/^$/d;/$boundary/,$d"

Now that you've come here, scratch all this and do what Ignacio suggested. 既然您已经来到这里,请抓紧所有内容并按照Ignacio的建议进行操作。 Reason - this probably won't work (reliably) for this, as GIF is binary data. 原因-因为GIF是二进制数据,所以这可能不会(可靠地)起作用。

Ah, it was a good exercise! 啊,这是很好的锻炼! Anyway, for the lovers of sed , here's the excellent page: 无论如何,对于sed爱好者来说,这是一个很棒的页面:

Outstanding information. 出色的信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM