简体   繁体   中英

Decoding a Python 2 `tempfile` with python-future

I'm attempting to write a Python 2/3 compatible routine to fetch a CSV file, decode it from latin_1 into Unicode and feed it to a csv.DictReader in a robust, scalable manner.

  • For Python 2/3 support, I'm using python-future including imporing open from builtins , and importing unicode_literals for consistent behaviour
  • I'm hoping to handle exceptionally large files by spilling to disk, using tempfile.SpooledTemporaryFile
  • I'm using io.TextIOWrapper to handle decoding from the latin_1 encoding before feeding to DictReader

This all works fine under Python 3.

The problem is that TextIOWrapper expects to wrap a stream which conforms to BufferedIOBase . Unfortunately under Python 2, although I have imported the Python 3-style open , the vanilla Python 2 tempfile.SpooledTemporaryFile still of course returns a Python 2 cStringIO.StringO , instead of a Python 3 io.BytesIO as required by TextIOWrapper .

I can think of these possible approaches:

  1. Wrap the Python 2 cStringIO.StringO as a Python 3-style io.BytesIO . I'm not sure how to approach this - would I need to write such a wrapper or does one already exist?
  2. Find a Python 2 alternative to wrap a cStringIO.StringO stream for decoding. I haven't found one yet.
  3. Do away with SpooledTemporaryFile , decode entirely in memory. How big would the CSV file need to be for operating entirely in memory to become a concern?
  4. Do away with SpooledTemporaryFile , and implement my own spill-to-disk. This would allow me to call open from python-future, but I'd rather not as it would be very tedious and probably less secure.

What's the best way forward? Have I missed anything?


Imports:

from __future__ import (absolute_import, division,
                    print_function, unicode_literals)
from builtins import (ascii, bytes, chr, dict, filter, hex, input,  # noqa
                  int, map, next, oct, open, pow, range, round,  # noqa
                  str, super, zip)  # noqa
import csv
import tempfile
from io import TextIOWrapper
import requests

Init:

...
self._session = requests.Session()
...

Routine:

def _fetch_csv(self, path):
    raw_file = tempfile.SpooledTemporaryFile(
        max_size=self._config.get('spool_size')
    )
    csv_r = self._session.get(self.url + path)
    for chunk in csv_r.iter_content():
        raw_file.write(chunk)
    raw_file.seek(0)
    text_file = TextIOWrapper(raw_file._file, encoding='latin_1')
    return csv.DictReader(text_file)

Error:

...in _fetch_csv
    text_file = TextIOWrapper(raw_file._file, encoding='utf-8')
AttributeError: 'cStringIO.StringO' object has no attribute 'readable'

Not sure whether this will be useful. The situation is only vaguely analogous to yours.

I wanted to use NamedTemporaryFile to create a CSV to be encoded in UTF-8 and have OS native line endings, possibly not-quite- standard , but easily accommodated by using the Python 3 style io.open.

The difficulty is that NamedTemporaryFile in Python 2 opens a byte stream, causing problems with line endings . The solution I settled on, which I think is a bit nicer than separate cases for Python 2 and 3, is to create the temp file then close it and reopen with io.open. The final piece is the excellent backports.csv library which provides the Python 3 style CSV handling in Python 2.

from __future__ import absolute_import, division, print_function, unicode_literals
from builtins import str
import csv, tempfile, io, os
from backports import csv

data = [["1", "1", "John Coltrane",  1926],
        ["2", "1", "Miles Davis",    1926],
        ["3", "1", "Bill Evans",     1929],
        ["4", "1", "Paul Chambers",  1935],
        ["5", "1", "Scott LaFaro",   1936],
        ["6", "1", "Sonny Rollins",  1930],
        ["7", "1", "Kenny Burrel",   1931]]

## create CSV file
with tempfile.NamedTemporaryFile(delete=False) as temp:
    filename = temp.name

with io.open(filename, mode='w', encoding="utf-8", newline='') as temp:
    writer = csv.writer(temp, quoting=csv.QUOTE_NONNUMERIC, lineterminator=str(os.linesep))
    headers = ['X', 'Y', 'Name', 'Born']
    writer.writerow(headers)
    for row in data:
        print(row)
        writer.writerow(row)

@cbare's approach should probably be avoided. It indeed works but here is what happens with it:

  1. We use tempfile.NamedTemporaryFile() to create temporary file. We then remember its name.
  2. We leave with statement and that file is closed.
  3. Now that the file is closed (but not removed) we open it again and open it with io.open() .

At first glance it looks okay, and at second glance too. But I am not sure if on some platforms (like nt ) it might be possible to remove the other user's file when it is not opened - and then create it again but have access to its contents. Please somebody correct me if this is not possible.

Here is what I would suggest instead:

# Create temporary file
with tempfile.NamedTemporaryFile() as tf_oldstyle:
    # get its file descriptor - note that it will also work with tempfile.TemporaryFile
    # which has no meaningful name at all
    fd = tf_oldstyle.fileno()
    # open that fd with io.open, using desired mode (could use binary mode or whatever)
    tf = io.open(fd, 'w+', encoding='utf-8', newline='')
    # note we don't use a with statement here, because this fd will be closed once we leave the outer with block
    # now work with the tf
    writer = csv.writer(tf, ...)
    writer.writerow(...)

# At this point, fd is closed, and the file is deleted.

Or we could directly use tempfile.mkstemp() which will create file and return its name and fd as a tuple - although using *TemporaryFile is probably more secure & portable between platforms.

fd, name = tempfile.mkstemp()
try:
    tf = io.open(fd, 'w+', encoding='utf-8', newline='')
    writer = csv.writer(tf, ...)
    writer.writerow(...)
finally:
    os.close(fd)
    os.unlink(name)

And to answer the original question regarding SpooledTemporaryFile

I would try subclassing SpooledTemporaryFile under python2 and overriding its rollover method.

Warning: this is not tested.

import io
import sys
import tempfile

if sys.version_info >= (3,):
    SpooledTemporaryFile = tempfile.SpooledTemporaryFile
else:
    class SpooledTemporaryFile(tempfile.SpooledTemporaryFile):
        def __init__(self, max_size=0, mode='w+b', **kwargs):
            # replace cStringIO with io.BytesIO or io.StringIO
            super(SpooledTemporaryFile, self).__init__(max_size, mode, **kwargs)
            if 'b' in mode:
                self._file = io.BytesIO()
            else:
                self._file = io.StringIO(newline='\n')  # see python3's tempfile sources for reason

        def rollover(self):
            if self._rolled:
                return
            # call super's implementation and then replace underlying file object
            super(SpooledTemporaryFile, self).rollover()
            fd = self._file.fileno()
            name = self._file.name
            mode = self._file.mode
            delete = self._file.delete
            pos = self._file.tell()
            # self._file is a tempfile._TemporaryFileWrapper.
            # It caches methods so we cannot just replace its .file attribute,
            # so let's create another _TemporaryFileWrapper
            file = io.open(fd, mode)
            file.seek(pos)
            self._file = tempfile._TemporaryFileWrapper(file, name, delete)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM