简体   繁体   English

从 Python 中的动态网页下载 CSV 文件

[英]Downloading a CSV file from a dynamic webpage in Python

A CSV file is periodically uploaded to a known, constant URL (url_variable). CSV 文件会定期上传到已知的常量 URL (url_variable)。 I want to automatically download the latest iteration of that CSV file in the course of a Python script.我想在 Python 脚本过程中自动下载该 CSV 文件的最新版本。

I have tried using Pandas, specifically pd.read_csv(url_variable), but I receive the "HTTP Error 403: Forbidden."我曾尝试使用 Pandas,特别是 pd.read_csv(url_variable),但我收到“HTTP 错误 403:禁止”。

Next I tried using urllib and passing in spoofed headers (headers_variable), specifically urllib.requests.Request(url_variable, headers=headers_variable).接下来我尝试使用 urllib 并传入欺骗标头 (headers_variable),特别是 urllib.requests.Request(url_variable, headers=headers_variable)。 This technique works.这种技术有效。 However, when a new CSV file is uploaded to the URL and the script is repeated, the old CSV file is returned.但是,当新的 CSV 文件上传到 URL 并重复执行脚本时,将返回旧的 CSV 文件。

How can I alter my code to download the new CSV file each time this block is called?每次调用此块时,如何更改我的代码以下载新的 CSV 文件?

Check if url is the same for new CSV uploads.检查新 CSV 上传的 url 是否相同。 If it's the same just downloading it should work.如果它是相同的,只需下载它应该可以工作。

Here's an example of downloading a CSV file in memory and reading it directly using requests and pandas:下面是在内存中下载 CSV 文件并使用请求和 Pandas 直接读取它的示例:

from io import StringIO
import pandas as pd
import requests
                
if __name__ == "__main__":
        
    url = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv'
    headers = {"Authorization": "Test"}
    response = requests.get(url, headers=headers)
    df = pd.read_csv(StringIO(response.text))
    print(df.shape)

Of course, adjust headers as you wish.当然,根据需要调整标题。 If the file is large, you could use a temporary file in order to process it, see: Generate temporary files and directories如果文件很大,您可以使用临时文件来处理它,请参阅:生成临时文件和目录

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM