Read data from large tar.gz file from the website

Question

1) How should I read the data from all the csv files in the tar.gz file on website and write them to the CSVs on a folder in the most memory and space efficient way? 2) How can I loop it to go over all the CSVs in the tar.gz file? 3) Since the CSV files are huge, how can I loop it to read and write, let's say, 1 million rows at a time?

I have gone only so far using the codes on other stackoverflow answers!

import pandas as pd
import urllib2
import tarfile
url='https://ghtstorage.blob.core.windows.net/downloads/mysql-2016-08-01.tar.gz'
r=urllib2.Request(url)
o=urllib2.urlopen(r)

thetarfile=tarfile.open(o, mode='r:gz')
thetarfile.close()

Answer 1

Download an archive to your local storage.
Display the list of files in the archive. Run man tar to see options for command line.
Extract files one by one from the archive.
Use SAX xml parser https://docs.python.org/2/library/xml.sax.reader.html .
Remove file after parsing.
Remove the archive.

Read data from large tar.gz file from the website

Question

1 answers

solution1
-1 2016-08-27 00:22:08

Read data from large tar.gz file from the website

Question

1 answers

solution1 -1 2016-08-27 00:22:08

solution1
-1 2016-08-27 00:22:08