简体   繁体   中英

Read data from large tar.gz file from the website

1) How should I read the data from all the csv files in the tar.gz file on website and write them to the CSVs on a folder in the most memory and space efficient way? 2) How can I loop it to go over all the CSVs in the tar.gz file? 3) Since the CSV files are huge, how can I loop it to read and write, let's say, 1 million rows at a time?

I have gone only so far using the codes on other stackoverflow answers!

import pandas as pd
import urllib2
import tarfile
url='https://ghtstorage.blob.core.windows.net/downloads/mysql-2016-08-01.tar.gz'
r=urllib2.Request(url)
o=urllib2.urlopen(r)

thetarfile=tarfile.open(o, mode='r:gz')
thetarfile.close()
  1. Download an archive to your local storage.
  2. Display the list of files in the archive. Run man tar to see options for command line.
  3. Extract files one by one from the archive.
  4. Use SAX xml parser https://docs.python.org/2/library/xml.sax.reader.html .
  5. Remove file after parsing.
  6. Remove the archive.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM