How to get pandas dataframe by chunks from csv files in huge tar.gz without unzipping and iterating over them?

Question

I have a huge compressed file on which I am interested in reading the individual dataframes, so as not to run out of memory.

Also, due to time and space, I can't unzip the.tar.gz.

This is the code I've got this far:

import pandas as pd
# With this lib we can navigate on a compressed files
# without even extracting its content
import tarfile
import io

tar_file = tarfile.open(r'\\path\to\the\tar\file.tar.gz')

# With the following code we can iterate over the csv contained in the compressed file
def generate_individual_df(tar_file):
    return \
        (
            (
                member.name, \
                pd.read_csv(io.StringIO(tar_file.extractfile(member).read().decode('ascii')), header=None)
            )
               for member in tar_file
                   if member.isreg()\
        )

for filename, dataframe in generate_individual_df(tar_file):
    # But dataframe is the whole file, which is too big

Tried the How to create Panda Dataframe from csv that is compressed in tar.gz? but still can't solve...

Answer 1

you can use the glob module to get certain files in a zip using glob for example I want cv2 to read images in a file

 import glob
 import cv2
    
    file1 = glob.glob(filepath/ "*.extension")
    for image in file1:
       image = cv2.imread(image)
hope it works

Answer 2

You actually can iterate over the chunks inside a compressed file with the following function:

def generate_individual_df(tar_file, chunksize=10**4):
    return \
        (
            (
                member.name, \
                chunk
            )
            for member in tar_file
                if member.isreg()\
                for chunk in pd.read_csv(io.StringIO(tar_file.extractfile(member)\
                  .read().decode('ascii')), header=None, chunksize=chunksize)
        )

How to get pandas dataframe by chunks from csv files in huge tar.gz without unzipping and iterating over them?

Question

2 answers

solution1
0 2022-01-04 16:52:01

solution2
0 ACCPTED 2022-01-19 12:18:47

How to get pandas dataframe by chunks from csv files in huge tar.gz without unzipping and iterating over them?

Question

2 answers

solution1 0 2022-01-04 16:52:01

solution2 0 ACCPTED 2022-01-19 12:18:47

solution1
0 2022-01-04 16:52:01

solution2
0 ACCPTED 2022-01-19 12:18:47