如何通过 csv 文件中的块在巨大的 tar.gz 中获取 pandas dataframe 而不解压缩和迭代它们？

Question

I have a huge compressed file on which I am interested in reading the individual dataframes, so as not to run out of memory.我有一个巨大的压缩文件，我对读取各个数据帧感兴趣，以免用完 memory。

Also, due to time and space, I can't unzip the.tar.gz.另外，由于时间和空间的关系，我无法解压.tar.gz。

This is the code I've got this far:这是我到目前为止的代码：

import pandas as pd
# With this lib we can navigate on a compressed files
# without even extracting its content
import tarfile
import io

tar_file = tarfile.open(r'\\path\to\the\tar\file.tar.gz')

# With the following code we can iterate over the csv contained in the compressed file
def generate_individual_df(tar_file):
    return \
        (
            (
                member.name, \
                pd.read_csv(io.StringIO(tar_file.extractfile(member).read().decode('ascii')), header=None)
            )
               for member in tar_file
                   if member.isreg()\
        )

for filename, dataframe in generate_individual_df(tar_file):
    # But dataframe is the whole file, which is too big

Tried the How to create Panda Dataframe from csv that is compressed in tar.gz?尝试了如何从压缩在 tar.gz 中的 csv 创建 Panda Dataframe？ but still can't solve...但还是解决不了...

Answer 1

you can use the glob module to get certain files in a zip using glob for example I want cv2 to read images in a file您可以使用 glob 模块使用 glob 获取 zip 中的某些文件，例如我希望 cv2 读取文件中的图像

 import glob
 import cv2
    
    file1 = glob.glob(filepath/ "*.extension")
    for image in file1:
       image = cv2.imread(image)
hope it works

Answer 2

You actually can iterate over the chunks inside a compressed file with the following function:实际上，您可以使用以下 function 遍历压缩文件中的块：

def generate_individual_df(tar_file, chunksize=10**4):
    return \
        (
            (
                member.name, \
                chunk
            )
            for member in tar_file
                if member.isreg()\
                for chunk in pd.read_csv(io.StringIO(tar_file.extractfile(member)\
                  .read().decode('ascii')), header=None, chunksize=chunksize)
        )

如何通过 csv 文件中的块在巨大的 tar.gz 中获取 pandas dataframe 而不解压缩和迭代它们？

问题描述

2 个解决方案

解决方案1
0 2022-01-04 16:52:01

解决方案2
0 已采纳 2022-01-19 12:18:47

如何通过 csv 文件中的块在巨大的 tar.gz 中获取 pandas dataframe 而不解压缩和迭代它们？

问题描述

2 个解决方案

解决方案1 0 2022-01-04 16:52:01

解决方案2 0 已采纳 2022-01-19 12:18:47

解决方案1
0 2022-01-04 16:52:01

解决方案2
0 已采纳 2022-01-19 12:18:47