循环通过 url 目录失败 - Python 3.x

Question

我正在尝试完成一个相当简单的任务......

我希望遍历指定 github 存储库中的所有.csv文件，特别是这个

import pandas as pd, urllib, requests, os, glob
base_url = 'https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series'
# https://stackoverflow.com/questions/39065921/what-do-raw-githubusercontent-com-urls-represent
base_raw_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series'

#base_dir = os.listdir(base_url)
#base_raw_dir = os.listdir(base_raw_url)

# https://stackoverflow.com/questions/61036695/import-multiple-csv-files-from-github-folder-python-covid-19
csv_files = glob.glob(base_raw_url+'/*.csv')
print(csv_files)

[]

csv_files是一个空列表，两次os.listdir()尝试都会导致：

OSError：[WinError 123] 文件名、目录名或卷 label 语法不正确：'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series'

我怎样才能简单地遍历目录？ 我希望最终获得每个.csv文件的完整路径（url）。

Answer 1

您无法访问地址为 web 的文件。 'Os.listdir()' 仅适用于您的本地计算机。 您正在尝试做的事情称为“网络抓取”，您将想要尝试使用“bs4”来完成您的任务。 您将需要解析 html 并获取每个文件的相关链接。

关于 BS4 的便捷教程： https://realpython.com/beautiful-soup-web-scraper-python/

循环通过 url 目录失败 - Python 3.x

问题描述

1 个解决方案

解决方案1
3 已采纳 2021-04-14 01:40:20

循环通过 url 目录失败 - Python 3.x

问题描述

1 个解决方案

解决方案1 3 已采纳 2021-04-14 01:40:20

解决方案1
3 已采纳 2021-04-14 01:40:20