在 tar.z 文件中读取为 python 3.7.4 中的 pandas 数据帧？

Question

I want to download a dataset from the UCI repository.我想从 UCI 存储库下载数据集。

The dataset is in the tar.Z format, and ideally I'd like to read it in as a pandas data frame.该数据集采用tar.Z格式，理想情况下，我想将其作为pandas数据帧读取。

I've checked out uncompressing tar.Z file with python?我用 python 检查了解压缩 tar.Z 文件？ which suggested the zgip library, so from https://docs.python.org/3/library/gzip.html I tried using the below code but I got an error message.建议使用zgip库，因此从https://docs.python.org/3/library/gzip.html 开始，我尝试使用以下代码，但收到错误消息。

Thanks for any help!谢谢你的帮助！

import gzip
with gzip.open('https://archive.ics.uci.edu/ml/machine-learning-databases/diabetes/diabetes-data.tar.Z', 'rb') as f:
file_content = f.read()  

ERROR MESSAGE:
OSError: [Errno 22] Invalid argument: 'https://archive.ics.uci.edu/ml/machine-learning-databases/diabetes/diabetes-data.tar.Z'

Answer 1

I do not think that you can read the .Z data with any module in Python;我不认为您可以使用 Python 中的任何模块读取.Z数据； you could browse Pypi, and see if there is a module for the .Z extension.你可以浏览 Pypi，看看是否有.Z扩展的模块。 You could however, use the command line to process the data.但是，您可以使用命令行来处理数据。

import subprocess
from io import StringIO

data = subprocess.run(
    """curl https://archive.ics.uci.edu/ml/machine-learning-databases/diabetes/diabetes-data.tar.Z | 
    tar -xOvf diabetes-data.tar.Z --wildcards 'Diabetes-Data/data-*' """,
    shell=True,
    capture_output=True,
    text=True,
).stdout


df = pd.read_csv(StringIO(data), sep="\t", header=None)

df.head()

        0       1        2  3
0   04-21-1991  9:09    58  100
1   04-21-1991  9:09    33  009
2   04-21-1991  9:09    34  013
3   04-21-1991  17:08   62  119
4   04-21-1991  17:08   33  007

You can read this ebook for more on command line options.您可以阅读此电子书以了解有关命令行选项的更多信息。

在 tar.z 文件中读取为 python 3.7.4 中的 pandas 数据帧？

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-08-04 11:23:50

在 tar.z 文件中读取为 python 3.7.4 中的 pandas 数据帧？

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-08-04 11:23:50

解决方案1
0 已采纳 2020-08-04 11:23:50