简体   繁体   中英

How to get pandas dataframe by chunks from csv files in huge tar.gz without unzipping and iterating over them?

I have a huge compressed file on which I am interested in reading the individual dataframes, so as not to run out of memory.

Also, due to time and space, I can't unzip the.tar.gz.

This is the code I've got this far:

import pandas as pd
# With this lib we can navigate on a compressed files
# without even extracting its content
import tarfile
import io

tar_file = tarfile.open(r'\\path\to\the\tar\file.tar.gz')

# With the following code we can iterate over the csv contained in the compressed file
def generate_individual_df(tar_file):
    return \
        (
            (
                member.name, \
                pd.read_csv(io.StringIO(tar_file.extractfile(member).read().decode('ascii')), header=None)
            )
               for member in tar_file
                   if member.isreg()\
        )

for filename, dataframe in generate_individual_df(tar_file):
    # But dataframe is the whole file, which is too big

Tried the How to create Panda Dataframe from csv that is compressed in tar.gz? but still can't solve...

you can use the glob module to get certain files in a zip using glob for example I want cv2 to read images in a file

 import glob
 import cv2
    
    file1 = glob.glob(filepath/ "*.extension")
    for image in file1:
       image = cv2.imread(image)
hope it works

You actually can iterate over the chunks inside a compressed file with the following function:

def generate_individual_df(tar_file, chunksize=10**4):
    return \
        (
            (
                member.name, \
                chunk
            )
            for member in tar_file
                if member.isreg()\
                for chunk in pd.read_csv(io.StringIO(tar_file.extractfile(member)\
                  .read().decode('ascii')), header=None, chunksize=chunksize)
        )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM