cgi.FieldStorage with multipart/form-data tries to decode binary file as UTF-8 if “filename=” not specified

Question

When I use cgi.FieldStorage to parse a multipart/form-data request (or any web framework like Pyramid which uses cgi.FieldStorage ) I have trouble processing file uploads from certain clients which don't provide a filename=file.ext in the part's Content-Disposition header.

If the filename= option is missing, FieldStorage() tries to decode the contents of the file as UTF-8 and return a string. And obviously many files are binary and not UTF-8 and as such give bogus results.

For example:

>>> import cgi
>>> import io
>>> body = (b'--KQNTvuH-itP09uVKjjZiegh7\r\n' +
...         b'Content-Disposition: form-data; name=payload\r\n\r\n' +
...         b'\xff\xd8\xff\xe0\x00\x10JFIF')
>>> env = {
...     'REQUEST_METHOD': 'POST',
...     'CONTENT_TYPE': 'multipart/form-data; boundary=KQNTvuH-itP09uVKjjZiegh7',
...     'CONTENT_LENGTH': len(body),
... }
>>> fs = cgi.FieldStorage(fp=io.BytesIO(body), environ=env)
>>> (fs['payload'].filename, fs['payload'].file.read())
(None, '����\x00\x10JFIF')

Browsers, and most HTTP libraries do include the filename= option for file uploads, but I'm currently dealing with a client that doesn't (and omitting the filename does seem to be valid according to the spec).

Currently I'm using a pretty hacky workaround by subclassing FieldStorage and replacing the relevant Content-Disposition header with one that does have the filename:

import cgi
import os

class FileFieldStorage(cgi.FieldStorage):
    """To use, subclass FileFieldStorage and override _file_fields with a tuple
    of the names of the file field(s). You can also override _file_name with
    the filename to add.
    """

    _file_fields = ()
    _file_name = 'file_name'

    def __init__(self, fp=None, headers=None, outerboundary=b'',
                 environ=os.environ, keep_blank_values=0, strict_parsing=0,
                 limit=None, encoding='utf-8', errors='replace'):

        if self._file_fields and headers and headers.get('content-disposition'):
            content_disposition = headers['content-disposition']
            key, pdict = cgi.parse_header(content_disposition)
            if (key == 'form-data' and pdict.get('name') in self._file_fields and
                    'filename' not in pdict):
                del headers['content-disposition']
                quoted_file_name = self._file_name.replace('"', '\\"')
                headers['content-disposition'] = '{}; filename="{}"'.format(
                        content_disposition, quoted_file_name)

        super().__init__(fp=fp, headers=headers, outerboundary=outerboundary,
                         environ=environ, keep_blank_values=keep_blank_values,
                         strict_parsing=strict_parsing, limit=limit,
                         encoding=encoding, errors=errors)

Using the body and env in my first test, this works now:

>>> class TestFieldStorage(FileFieldStorage):
...     _file_fields = ('payload',)
>>> fs = TestFieldStorage(fp=io.BytesIO(body), environ=env)
>>> (fs['payload'].filename, fs['payload'].file.read())
('file_name', b'\xff\xd8\xff\xe0\x00\x10JFIF')

Is there some way to avoid this hack and tell FieldStorage not to decode as UTF-8? It would be nice if you could provide encoding=None or something, but it doesn't look like it supports that.

Answer 1

I have trouble processing file uploads from certain clients which don't provide a filename=file.ext in the part's Content-Disposition header.

The filename= parameter is effectively the only way the server side can determine that a part represents a file upload. If a client omits this parameter, it isn't really sending a file upload, but a plain text form field. It's still technically legitimate to send arbitrary binary data in such a field, but many server environments including Python cgi would be confused by it.

It would be nice if you could provide encoding=None or something

If you set errors to surrogateescape you would at least be able to recover the original bytes from the decoded characters.

Answer 2

I ended up working around this using a somewhat simpler FieldStorage subclass, so I'm posting it here as an answer. Instead of overriding __init__ and adding a filename to the Content-Disposition header, you can just override the .filename attribute to be a property that returns a filename if one wasn't provided for that input:

class MyFieldStorage(cgi.FieldStorage):
    @property
    def filename(self):
        if self._original_filename is not None:
            return self._original_filename
        elif self.name == 'payload':
            return 'file_name'
        else:
            return None

    @filename.setter
    def filename(self, value):
        self._original_filename = value

Additionally, as @bobince's answer pointed out, you can use the surrogateescape error handler and then encode it back to bytes. It's a bit roundabout, but also probably the simplest workaround:

>>> fs = cgi.FieldStorage(fp=io.BytesIO(body), environ=env, errors='surrogateescape')
>>> fs['payload'].file.read().encode('utf-8', 'surrogateescape')
b'\xff\xd8\xff\xe0\x00\x10JFIF'

cgi.FieldStorage with multipart/form-data tries to decode binary file as UTF-8 if “filename=” not specified

Question

2 answers

solution1
1 2017-02-15 17:09:42

solution2
0 ACCPTED 2017-02-15 19:58:14

cgi.FieldStorage with multipart/form-data tries to decode binary file as UTF-8 if “filename=” not specified

Question

2 answers

solution1 1 2017-02-15 17:09:42

solution2 0 ACCPTED 2017-02-15 19:58:14

solution1
1 2017-02-15 17:09:42

solution2
0 ACCPTED 2017-02-15 19:58:14