简体   繁体   English

如何将 tar.gz 文件直接从 URL 读取到 Pandas 中?

[英]How can I read a tar.gz file directly from a URL into Pandas?

The dataset I wish to read lives on GitHub as a tar.gz file and is updated every few hours.我希望阅读的数据集以 tar.gz 文件的形式存在于 GitHub 上,每隔几个小时更新一次。 While I can always download this file, uncompress it, and read from CSV, it would be much better if I can directly read from this URL into a Pandas data frame in a timely manner.虽然我总是可以下载这个文件,解压缩它,然后从 CSV 中读取,但如果我能及时从这个 URL中直接读取到 Pandas 数据帧中,那就更好了。

After some Googling, I was able to download the compressed file and then read it as a data frame.经过一番谷歌搜索,我能够下载压缩文件,然后将其作为数据框读取。

import requests
import tarfile
import pandas as pd

# Download file from GitHub
url = "https://github.com/beoutbreakprepared/nCoV2019/blob/master/latest_data/latestdata.tar.gz?raw=true"
target_path = "latestdata.tar.gz"

response = requests.get(url, stream=True)
if response.status_code == 200:
    with open(target_path, "wb") as f:
        f.write(response.raw.read())

# Read from downloaded file
with tarfile.open(target_path, "r:*") as tar:
    csv_path = tar.getnames()[0]
    df = pd.read_csv(tar.extractfile(csv_path), header=0, sep=",")

However, I wonder if there is a way to directly read the file content into a data frame without first saving it locally.但是,我想知道是否有一种方法可以直接将文件内容读入数据框中,而无需先将其保存在本地。 This may be useful if I want to build a web app later and don't have a local machine.如果我想稍后构建 web 应用程序并且没有本地计算机,这可能很有用。 Any help would be appreciated!任何帮助,将不胜感激! Thanks!谢谢!

You can use the BytesIO ( In-Memory Stream ) to keep the data in memory instead of saving the file to local machine.您可以使用BytesIO ( In-Memory Stream ) 将数据保存在 memory 中,而不是将文件保存到本地计算机。

Also As per the tarfile.open documentation , If fileobj is specified, it is used as an alternative to a file object opened in binary mode for name.此外,根据tarfile.open 文档,如果指定了fileobj ,则将其用作以二进制模式打开的文件 object 的替代名称。

>>> import tarfile
>>> from io import BytesIO
>>>
>>> import requests
>>> import pandas as pd


>>> url = "https://github.com/beoutbreakprepared/nCoV2019/blob/master/latest_data/latestdata.tar.gz?raw=true"
>>> response = requests.get(url, stream=True)
>>> with tarfile.open(fileobj=BytesIO(response.raw.read()), mode="r:gz") as tar_file:
...     for member in tar_file.getmembers():
...         f = tar_file.extractfile(member)
...         df = pd.read_csv(f)
...         print(df)

If you use ParData , this can be done pretty cleanly:如果您使用ParData ,这可以非常干净地完成:

from tempfile import TemporaryDirectory

import pardata

schema = {
    'download_url': 'https://github.com/beoutbreakprepared/nCoV2019/blob/master/latest_data/latestdata.tar.gz?raw=true',
    'subdatasets': {
        'all': {
            'path': 'latestdata.csv',
            'format': {
                'id': 'table/csv'
            }
        }
    }
}

with TemporaryDirectory() as d:
    dataset = pardata.dataset.Dataset(schema=schema, data_dir=d)
    dataset.download(verify_checksum=False)
    my_csv = dataset.load()  # my_csv is a pandas.DataFrame object that stores the CSV file

print(my_csv)

Disclaimer: I'm a primary co-maintainer of ParData.免责声明:我是 ParData 的主要共同维护者。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM