简体   繁体   English

如何从熊猫数据框中的网页读取所有csv文件

[英]How to read all csv files from web page in a pandas data frame

I would like to load all the csv files from the following webpage to a data frame 我想将以下网页中的所有csv文件加载到数据框中

https://s3.amazonaws.com/tripdata/index.html https://s3.amazonaws.com/tripdata/index.html

I tried with glob as for loading all files from a directory without success: 我尝试使用glob来从目录加载所有文件而没有成功:

import glob
path ='https://s3.amazonaws.com/tripdata' # use your path
allFiles = glob.glob(path + "/*citibike-tripdata.csv.zip")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_, index_col=None, header=0)
    list_.append(df)
frame = pd.concat(list_)

Any suggestions? 有什么建议么?

The module glob is used for finding pathnames matching patterns on the same system as Python is running in, and there is no way for it to index arbitrary file hosting web servers (which isn't even possible a priori). 模块glob用于在与运行Python的系统相同的系统上查找与模式匹配的路径名,并且它无法索引任意文件托管Web服务器(甚至不可能)。 In your case, since https://s3.amazonaws.com/tripdata/ provides the desired index, you could parse that to get the relevant files: 在您的情况下,由于https://s3.amazonaws.com/tripdata/提供了所需的索引,因此您可以解析该索引以获得相关文件:

import re
import requests

url = 'https://s3.amazonaws.com/tripdata/'
t = requests.get(url).text
filenames = re.findall('[^>]+citibike-tripdata\.csv\.zip', t)
frame = pd.concat(pd.read_csv(url + f) for f in filenames)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM