简体   繁体   English

在 tar.z 文件中读取为 python 3.7.4 中的 pandas 数据帧?

[英]reading in tar.z file as pandas data frame in python 3.7.4?

I want to download a dataset from the UCI repository.我想从 UCI 存储库下载数据集。

The dataset is in the tar.Z format, and ideally I'd like to read it in as a pandas data frame.该数据集采用tar.Z格式,理想情况下,我想将其作为pandas数据帧读取。

I've checked out uncompressing tar.Z file with python?用 python 检查了解压缩 tar.Z 文件? which suggested the zgip library, so from https://docs.python.org/3/library/gzip.html I tried using the below code but I got an error message.建议使用zgip库,因此从https://docs.python.org/3/library/gzip.html 开始,我尝试使用以下代码,但收到错误消息。

Thanks for any help!谢谢你的帮助!

import gzip
with gzip.open('https://archive.ics.uci.edu/ml/machine-learning-databases/diabetes/diabetes-data.tar.Z', 'rb') as f:
file_content = f.read()  

ERROR MESSAGE:
OSError: [Errno 22] Invalid argument: 'https://archive.ics.uci.edu/ml/machine-learning-databases/diabetes/diabetes-data.tar.Z'

I do not think that you can read the .Z data with any module in Python;我不认为您可以使用 Python 中的任何模块读取.Z数据; you could browse Pypi, and see if there is a module for the .Z extension.你可以浏览 Pypi,看看是否有.Z扩展的模块。 You could however, use the command line to process the data.但是,您可以使用命令行来处理数据。

import subprocess
from io import StringIO

data = subprocess.run(
    """curl https://archive.ics.uci.edu/ml/machine-learning-databases/diabetes/diabetes-data.tar.Z | 
    tar -xOvf diabetes-data.tar.Z --wildcards 'Diabetes-Data/data-*' """,
    shell=True,
    capture_output=True,
    text=True,
).stdout


df = pd.read_csv(StringIO(data), sep="\t", header=None)

df.head()

        0       1        2  3
0   04-21-1991  9:09    58  100
1   04-21-1991  9:09    33  009
2   04-21-1991  9:09    34  013
3   04-21-1991  17:08   62  119
4   04-21-1991  17:08   33  007

You can read this ebook for more on command line options.您可以阅读此电子书以了解有关命令行选项的更多信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM