简体   繁体   中英

Processing file in memory line by line and producing file like object in python

I need to import a large CSV file into postgresql. The file uses two delimiters "," (comma), and "_" (underscore).

postgres copy command is unable to use two delimiter characters so I process the file in bash before I load it to the database:

cat large_file.csv \ 
| sed -e 's/_/,/' \ 
| psql -d db -c "COPY large_table FROM STDIN DELIMITER ',' CSV header"

```

I'm trying to reproduce this command in python and I am having a hard time finding the python equivalent for sed .

Using psycopg I can copy from STDIN using python:

with unzip('large_zip.zip', 'large_file.csv') as file:
    cr.copy_expert('''
        COPY large_table
        FROM STDIN
        DELIMITER ',' CSV HEADER
   ''', file)

The file is very big and is loaded directly form the zip file. I'm trying to avoid saving a local copy.

What is the best way to process the file line by line and creating a file like object I can send as standard input to another command in python?

I did this recently and I can tell you there are a few ugly parts but it is definitely possible. I can't paste the code here verbatim because it's company internal.

The basic idea is this:

  1. start the program which consumes data from stdin by spawning it like this:

    command = subprocess.Popen(command_list, stdin=subprocess.PIPE) .

  2. for each pipe (eg command.stdin ) start a threading.Thread which writes to or reads from it. If you have multiple pipes you need multiple threads.
  3. Wait for the exit of the program with command.wait() in the main thread.
  4. Stop (join) all your threads so the program will not block. You have to make sure the threads exit by themselves by return ing from their target function.

Simple Example (not tested!):

import shutil
import subprocess
import sys
import threading


lots_of_data = StringIO.StringIO()

import_to_db = subprocess.Popen(["import_to_db"], stdin=subprocess.PIPE)

# Make sure your input stream is at pos 0
lots_of_data.seek(0)

writer = threading.Thread(target=shutil.copyfileobj,
                          args=(lots_of_data, import_to_db.stdin))

writer.start()

return_code = import_to_db.wait()
if return_code:
    print "Error"
    sys.exit(1)

writer.join()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM