[英]cgi.FieldStorage with multipart/form-data tries to decode binary file as UTF-8 if “filename=” not specified
When I use cgi.FieldStorage
to parse a multipart/form-data
request (or any web framework like Pyramid which uses cgi.FieldStorage
) I have trouble processing file uploads from certain clients which don't provide a filename=file.ext
in the part's Content-Disposition
header. 当我使用cgi.FieldStorage
解析multipart/form-data
请求(或任何使用cgi.FieldStorage
Web框架,如Pyramid)时,我无法处理某些客户端提供的文件上传,而这些客户端在filename=file.ext
未提供filename=file.ext
零件的Content-Disposition
标头。
If the filename=
option is missing, FieldStorage()
tries to decode the contents of the file as UTF-8 and return a string. 如果缺少filename=
选项,则FieldStorage()
尝试将文件内容解码为UTF-8并返回一个字符串。 And obviously many files are binary and not UTF-8 and as such give bogus results. 显然,许多文件都是二进制文件,而不是UTF-8,因此给出了虚假结果。
For example: 例如:
>>> import cgi
>>> import io
>>> body = (b'--KQNTvuH-itP09uVKjjZiegh7\r\n' +
... b'Content-Disposition: form-data; name=payload\r\n\r\n' +
... b'\xff\xd8\xff\xe0\x00\x10JFIF')
>>> env = {
... 'REQUEST_METHOD': 'POST',
... 'CONTENT_TYPE': 'multipart/form-data; boundary=KQNTvuH-itP09uVKjjZiegh7',
... 'CONTENT_LENGTH': len(body),
... }
>>> fs = cgi.FieldStorage(fp=io.BytesIO(body), environ=env)
>>> (fs['payload'].filename, fs['payload'].file.read())
(None, '����\x00\x10JFIF')
Browsers, and most HTTP libraries do include the filename=
option for file uploads, but I'm currently dealing with a client that doesn't (and omitting the filename
does seem to be valid according to the spec). 浏览器和大多数 HTTP库的确包含用于文件上传的filename=
选项,但是我目前正在与一个不兼容的客户端(根据规范,忽略filename
似乎是有效的)。
Currently I'm using a pretty hacky workaround by subclassing FieldStorage
and replacing the relevant Content-Disposition
header with one that does have the filename: 目前,我正在通过将FieldStorage
子类FieldStorage
并用一个确实具有文件名的标题替换相关的Content-Disposition
标头,来使用一种不错的解决方法:
import cgi
import os
class FileFieldStorage(cgi.FieldStorage):
"""To use, subclass FileFieldStorage and override _file_fields with a tuple
of the names of the file field(s). You can also override _file_name with
the filename to add.
"""
_file_fields = ()
_file_name = 'file_name'
def __init__(self, fp=None, headers=None, outerboundary=b'',
environ=os.environ, keep_blank_values=0, strict_parsing=0,
limit=None, encoding='utf-8', errors='replace'):
if self._file_fields and headers and headers.get('content-disposition'):
content_disposition = headers['content-disposition']
key, pdict = cgi.parse_header(content_disposition)
if (key == 'form-data' and pdict.get('name') in self._file_fields and
'filename' not in pdict):
del headers['content-disposition']
quoted_file_name = self._file_name.replace('"', '\\"')
headers['content-disposition'] = '{}; filename="{}"'.format(
content_disposition, quoted_file_name)
super().__init__(fp=fp, headers=headers, outerboundary=outerboundary,
environ=environ, keep_blank_values=keep_blank_values,
strict_parsing=strict_parsing, limit=limit,
encoding=encoding, errors=errors)
Using the body
and env
in my first test, this works now: 在我的第一个测试中使用body
和env
,现在可以使用:
>>> class TestFieldStorage(FileFieldStorage):
... _file_fields = ('payload',)
>>> fs = TestFieldStorage(fp=io.BytesIO(body), environ=env)
>>> (fs['payload'].filename, fs['payload'].file.read())
('file_name', b'\xff\xd8\xff\xe0\x00\x10JFIF')
Is there some way to avoid this hack and tell FieldStorage
not to decode as UTF-8? 有什么方法可以避免这种黑客攻击,并告诉FieldStorage
不要将其解码为UTF-8吗? It would be nice if you could provide encoding=None
or something, but it doesn't look like it supports that. 如果您可以提供encoding=None
或类似的东西,那将是很好的选择,但是看起来它不支持该功能。
I have trouble processing file uploads from certain clients which don't provide a filename=file.ext in the part's Content-Disposition header. 我在处理某些客户端的文件上传时遇到麻烦,这些客户端在部件的Content-Disposition标头中未提供filename = file.ext。
The filename= parameter is effectively the only way the server side can determine that a part represents a file upload. filename =参数实际上是服务器端可以确定零件代表文件上传的唯一方法。 If a client omits this parameter, it isn't really sending a file upload, but a plain text form field. 如果客户端忽略此参数,则实际上不是发送文件上载,而是纯文本形式的字段。 It's still technically legitimate to send arbitrary binary data in such a field, but many server environments including Python cgi
would be confused by it. 在这样的字段中发送任意二进制数据在技术上仍是合法的,但是包括Python cgi
在内的许多服务器环境都会对此感到困惑。
It would be nice if you could provide encoding=None or something 如果您可以提供encoding = None或其他的东西,那就太好了
If you set errors
to surrogateescape
you would at least be able to recover the original bytes from the decoded characters. 如果将errors
设置为surrogateescape
,则至少可以从解码的字符中恢复原始字节。
I ended up working around this using a somewhat simpler FieldStorage
subclass, so I'm posting it here as an answer. 我最终使用一个稍微简单一些的FieldStorage
子类解决了这个问题,因此将其发布在这里作为答案。 Instead of overriding __init__
and adding a filename to the Content-Disposition
header, you can just override the .filename
attribute to be a property that returns a filename if one wasn't provided for that input: 除了覆盖__init__
并将文件名添加到Content-Disposition
标头之外,您还可以覆盖.filename
属性,使其成为一个属性,如果没有为该输入提供文件名,它将返回文件名:
class MyFieldStorage(cgi.FieldStorage):
@property
def filename(self):
if self._original_filename is not None:
return self._original_filename
elif self.name == 'payload':
return 'file_name'
else:
return None
@filename.setter
def filename(self, value):
self._original_filename = value
Additionally, as @bobince's answer pointed out, you can use the surrogateescape
error handler and then encode it back to bytes. 此外,正如@bobince的答案所指出的,您可以使用surrogateescape
错误处理程序,然后将其编码回字节。 It's a bit roundabout, but also probably the simplest workaround: 这有点round回,但也可能是最简单的解决方法:
>>> fs = cgi.FieldStorage(fp=io.BytesIO(body), environ=env, errors='surrogateescape')
>>> fs['payload'].file.read().encode('utf-8', 'surrogateescape')
b'\xff\xd8\xff\xe0\x00\x10JFIF'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.