简体   繁体   中英

How to pass a binary file as stdin to a Docker containerized Python script using argparse?


Update based on Anthony Sottile's Answer


I re-implemented his solution to simplify the problem. Lets take Docker and Django out of the equation. The goal is to use Pandas to read excel by both of the following methods:

  1. python example.py - < /path/to/file.xlsx
  2. cat /path/to/file.xlsx | python example.py -

where example.py is reproduced below:

import argparse
import contextlib
from typing import IO
import sys
import pandas as pd


@contextlib.contextmanager
def file_ctx(filename: str) -> IO[bytes]:
    if filename == '-':
        yield sys.stdin.buffer
    else:
        with open(filename, 'rb') as f:
            yield f


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('FILE')
    args = parser.parse_args()

    with file_ctx(args.FILE) as input_file:
        print(input_file.read())
        df = pd.read_excel(input_file)
        print(df)


if __name__ == "__main__":
    main()

The problem is that Pandas (see traceback below) does not accept 2. However it works fine with 1.

Whereas simply printing the text representation of the excel file works in both 1. and 2.


In case you want to easily reproduce the Docker environment:


First build Docker image named pandas:

docker build --pull -t pandas - <<EOF
FROM python:latest
RUN pip install pandas xlrd
EOF

Then use pandas Docker image to run: docker run --rm -i -v /path/to/example.py:/example.py pandas python example.py - < /path/to/file.xlsx

Note how it correctly is able to print out a plaintext representation of the excel file, but pandas is unable to read it.

A more concise traceback, similar to below:

Traceback (most recent call last):
  File "example.py", line 29, in <module>
    main()
  File "example.py", line 24, in main
    df = pd.read_excel(input_file)
  File "/usr/local/lib/python3.8/site-packages/pandas/util/_decorators.py", line 208, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 310, in read_excel
    io = ExcelFile(io, engine=engine)
  File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 819, in __init__
    self._reader = self._engines[engine](self._io)
  File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_xlrd.py", line 21, in __init__
    super().__init__(filepath_or_buffer)
  File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 356, in __init__
    filepath_or_buffer.seek(0)
io.UnsupportedOperation: File or stream is not seekable.

To show the code working when mounting the excel file in (ie Not being passed by stdin):

docker run --rm -i -v /path/to/example.py:/example.py -v /path/to/file.xlsx:/file.xlsx pandas python example.py file.xlsx


Original problem description (for additional context)


Take the scenario where on the host system, you have a file at /tmp/test.txt and you want to use head on it, but within a Docker container ( echo 'Hello World!' > /tmp/test.txt to reproduce the example data I have):

You can run:

docker run -i busybox head -1 - < /tmp/test.txt to print the first line out to screen:

OR

cat /tmp/test.txt | docker run -i busybox head -1 -

and the output is:

Hello World!

Even with a binary format like .xlsx instead of plaintext, the above can be done and you would get some weird output similar to:

�Oxl/_rels/workbook.xml.rels���j�0
                                  ��}

The point above is that head works with both binary and text formats even through the abstraction of Docker.

But in my own argparse based CLI ( Actually custom Django management command , which I believe makes use of argparse), I get the following error when attempting to use panda's read_excel within a Docker context.

The error that is printed is as follows:

Traceback (most recent call last):
  File "./manage.py", line 15, in <module>
    execute_from_command_line(sys.argv)
  File "/opt/conda/lib/python3.7/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
    utility.execute()
  File "/opt/conda/lib/python3.7/site-packages/django/core/management/__init__.py", line 375, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/opt/conda/lib/python3.7/site-packages/django/core/management/base.py", line 323, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/opt/conda/lib/python3.7/site-packages/django/core/management/base.py", line 364, in execute
    output = self.handle(*args, **options)
  File "/home/jovyan/sequence_databaseApp/management/commands/seq_db.py", line 54, in handle
    df_snapshot = pd.read_excel(options['FILE'].buffer, sheet_name='Snapshot', header=0, dtype=dtype)
  File "/opt/conda/lib/python3.7/site-packages/pandas/util/_decorators.py", line 208, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 310, in read_excel
    io = ExcelFile(io, engine=engine)
  File "/opt/conda/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 819, in __init__
    self._reader = self._engines[engine](self._io)
  File "/opt/conda/lib/python3.7/site-packages/pandas/io/excel/_xlrd.py", line 21, in __init__
    super().__init__(filepath_or_buffer)
  File "/opt/conda/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 356, in __init__
    filepath_or_buffer.seek(0)
io.UnsupportedOperation: File or stream is not seekable.

Concretely,

docker run -i <IMAGE> ./manage.py my_cli import - < /path/to/file.xlsx does not work ,

but ./manage.py my_cli import - < /path/to/file.xlsx does work!

Somehow there is a difference within the Docker context.

However I also note, even taking Docker out of the equation:

cat /path/to/file.xlsx | ./manage.py my_cli import - cat /path/to/file.xlsx | ./manage.py my_cli import - does not work

though:

./manage.py my_cli import - < /path/to/file.xlsx does work (as mentioned before)

Finally, the code I am using (You should be able to save that as my_cli.py under management/commands to get it working within a Django project):

import argparse


import sys


from django.core.management.base import BaseCommand


class Command(BaseCommand):
    help = 'my_cli help'

    def add_arguments(self, parser):
        subparsers = parser.add_subparsers(
            title='commands', dest='command', help='command help')
        subparsers.required = True
        parser_import = subparsers.add_parser('import', help='import help')
        parser_import.add_argument('FILE', type=argparse.FileType('r'), default=sys.stdin)

    def handle(self, *args, **options):
        import pandas as pd
        df = pd.read_excel(options['FILE'].buffer, header=0)
        print(df)

It looks as though you're reading the file in text mode ( FileType('r') / sys.stdin )

According to this bpo issue argparse does not support opening binary files directly

I'd suggest handling the file type yourself with code similar to this (I'm not familiar with the django / pandas way so I've simplified it down to just plain python)

import argparse
import contextlib
import io
from typing import IO


@contextlib.contextmanager
def file_ctx(filename: str) -> IO[bytes]:
    if filename == '-':
        yield io.BytesIO(sys.stdin.buffer.read())
    else:
        with open(filename, 'rb') as f:
            yield f


def main() -> int:
    parser = argparse.ArgumentParser()
    parser.add_argument('FILE')
    args = parser.parse_args()

    with file_ctx(args.FILE) as input_file:
        # do whatever you need with that input file

Based very heavily on Anthony Sottile's Answer but with a slight edit that completely solves the problem:

import argparse
import contextlib
import io
from typing import IO
import sys

import pandas as pd


@contextlib.contextmanager
def file_ctx(filename: str) -> IO[bytes]:
    if filename == '-':
        yield io.BytesIO(sys.stdin.buffer.read())
    else:
        with open(filename, 'rb') as f:
            yield f


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('FILE')
    args = parser.parse_args()

    with file_ctx(args.FILE) as input_file:
        print(input_file.read())
        df = pd.read_excel(input_file)
        print(df)


if __name__ == "__main__":
    main()

I got the idea after reading this answer to Pandas 0.25.0 and xlsx from response content stream


How this looks in terms of the original question's Django based context:

import contextlib
import io
import sys
from typing import IO

import pandas as pd

from django.core.management.base import BaseCommand


@contextlib.contextmanager
def file_ctx(filename: str) -> IO[bytes]:
    if filename == '-':
        yield io.BytesIO(sys.stdin.buffer.read())
    else:
        with open(filename, 'rb') as f:
            yield f


class Command(BaseCommand):
    help = 'my_cli help'

    def add_arguments(self, parser):
        subparsers = parser.add_subparsers(
            title='commands', dest='command', help='command help')
        subparsers.required = True
        parser_import = subparsers.add_parser('import', help='import help')
        parser_import.add_argument('FILE')

    def handle(self, *args, **options):
        with file_ctx(options['FILE']) as input_file:
            df = pd.read_excel(input_file)
            print(df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM