I have a huge compressed file on which I am interested in reading the individual dataframes, so as not to run out of memory.
Also, due to time and space, I can't unzip the.tar.gz.
This is the code I've got this far:
import pandas as pd
# With this lib we can navigate on a compressed files
# without even extracting its content
import tarfile
import io
tar_file = tarfile.open(r'\\path\to\the\tar\file.tar.gz')
# With the following code we can iterate over the csv contained in the compressed file
def generate_individual_df(tar_file):
return \
(
(
member.name, \
pd.read_csv(io.StringIO(tar_file.extractfile(member).read().decode('ascii')), header=None)
)
for member in tar_file
if member.isreg()\
)
for filename, dataframe in generate_individual_df(tar_file):
# But dataframe is the whole file, which is too big
Tried the How to create Panda Dataframe from csv that is compressed in tar.gz? but still can't solve...
you can use the glob module to get certain files in a zip using glob for example I want cv2 to read images in a file
import glob
import cv2
file1 = glob.glob(filepath/ "*.extension")
for image in file1:
image = cv2.imread(image)
hope it works
You actually can iterate over the chunks inside a compressed file with the following function:
def generate_individual_df(tar_file, chunksize=10**4):
return \
(
(
member.name, \
chunk
)
for member in tar_file
if member.isreg()\
for chunk in pd.read_csv(io.StringIO(tar_file.extractfile(member)\
.read().decode('ascii')), header=None, chunksize=chunksize)
)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.