I am hitting a webservice with Python's requests
library and the endpoint is returning a (very large) CSV file which I then want to stream into a database. The code looks like this:
response = requests.get(url, auth=auth, stream=True)
if response.status_code == 200:
stream_csv_into_database(response)
Now when the database is a MongoDB database, the loading works perfectly using a DictReader
:
def stream_csv_into_database(response):
.
.
.
for record in csv.DictReader(response.iter_lines(), delimiter='\t'):
product_count += 1
product = {k:v for (k,v) in record.iteritems() if v}
product['_id'] = product_count
collection.insert(product)
However, I am switching from MongoDB to Amazon RedShift, which I can already access just fine using psycopg2
. I can open connections and make simple queries just fine, but what I want to do is use my streamed response from the webservice and use psycopg2's copy_expert
to load the RedShift table. Here is what I tried so far:
def stream_csv_into_database(response, campaign, config):
print 'Loading product feed for {0}'.format(campaign)
conn = new_redshift_connection(config) # My own helper, works fine.
table = 'products.' + campaign
cur = conn.cursor()
reader = response.iter_lines()
# Error on following line:
cur.copy_expert("COPY {0} FROM STDIN WITH CSV HEADER DELIMITER '\t'".format(table), reader)
conn.commit()
cur.close()
conn.close()
The error that I get is:
file must be a readable file-like object for COPY FROM; a writable file-like object for COPY TO.
I understand what the error is saying; in fact, I can see from the psycopg2 documentation that copy_expert
calls copy_from
, which:
Reads data from a file-like object appending them to a database table (COPY table FROM file syntax). The source file must have both read() and readline() method.
My problem is that I cannot find a way to make the response
object be a file-like object! I tried both .data
and .iter_lines
without success. I certainly do not want to download the entire multi-gigabyte file from the webservice and then upload it to RedShift. There must be a way to use the streaming response as a file-like object that psycopg2 can copy into RedShift. Anyone know what I am missing?
You could use the response.raw
file object , but take into account that any content encoding (such as GZIP or Deflate compression) will still be in place unless you set the decode_content
flag to True
when calling .read()
, which psycopg2 will not.
You can set the flag on the raw
file object to change the default to decompressing-while-reading:
response.raw.decode_content = True
and then use the response.raw
file object to csv.DictReader()
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.