[英]reading in tar.z file as pandas data frame in python 3.7.4?
I want to download a dataset from the UCI repository.我想从 UCI 存储库下载数据集。
The dataset is in the tar.Z
format, and ideally I'd like to read it in as a pandas
data frame.该数据集采用
tar.Z
格式,理想情况下,我想将其作为pandas
数据帧读取。
I've checked out uncompressing tar.Z file with python?我用 python 检查了解压缩 tar.Z 文件? which suggested the
zgip
library, so from https://docs.python.org/3/library/gzip.html I tried using the below code but I got an error message.建议使用
zgip
库,因此从https://docs.python.org/3/library/gzip.html 开始,我尝试使用以下代码,但收到错误消息。
Thanks for any help!谢谢你的帮助!
import gzip
with gzip.open('https://archive.ics.uci.edu/ml/machine-learning-databases/diabetes/diabetes-data.tar.Z', 'rb') as f:
file_content = f.read()
ERROR MESSAGE:
OSError: [Errno 22] Invalid argument: 'https://archive.ics.uci.edu/ml/machine-learning-databases/diabetes/diabetes-data.tar.Z'
I do not think that you can read the .Z
data with any module in Python;我不认为您可以使用 Python 中的任何模块读取
.Z
数据; you could browse Pypi, and see if there is a module for the .Z
extension.你可以浏览 Pypi,看看是否有
.Z
扩展的模块。 You could however, use the command line to process the data.但是,您可以使用命令行来处理数据。
import subprocess
from io import StringIO
data = subprocess.run(
"""curl https://archive.ics.uci.edu/ml/machine-learning-databases/diabetes/diabetes-data.tar.Z |
tar -xOvf diabetes-data.tar.Z --wildcards 'Diabetes-Data/data-*' """,
shell=True,
capture_output=True,
text=True,
).stdout
df = pd.read_csv(StringIO(data), sep="\t", header=None)
df.head()
0 1 2 3
0 04-21-1991 9:09 58 100
1 04-21-1991 9:09 33 009
2 04-21-1991 9:09 34 013
3 04-21-1991 17:08 62 119
4 04-21-1991 17:08 33 007
You can read this ebook for more on command line options.您可以阅读此电子书以了解有关命令行选项的更多信息。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.