简体   繁体   English

如何使用 Python 从 URL 读取 CSV 文件?

[英]How to read a CSV file from a URL with Python?

when I do curl to a API call link http://example.com/passkey=wedsmdjsjmdd当我执行 curl 到 API 呼叫链接http://example.com/passkey=wedsmdjsjmdd

curl 'http://example.com/passkey=wedsmdjsjmdd'

I get the employee output data on a csv file format, like:我以 csv 文件格式获取员工 output 数据,例如:

"Steve","421","0","421","2","","","","","","","","","421","0","421","2"

how can parse through this using python.如何使用 python 解析它。

I tried:我试过:

import csv 
cr = csv.reader(open('http://example.com/passkey=wedsmdjsjmdd',"rb"))
for row in cr:
    print row

but it didn't work and I got an error但它没有用,我得到了一个错误

http://example.com/passkey=wedsmdjsjmdd No such file or directory:

Thanks!谢谢!

Using pandas it is very simple to read a csv file directly from a url使用 pandas 直接从 url 读取 csv 文件非常简单

import pandas as pd
data = pd.read_csv('https://example.com/passkey=wedsmdjsjmdd')

This will read your data in tabular format, which will be very easy to process这将以表格格式读取您的数据,这将非常容易处理

You need to replace open with urllib.urlopen or urllib2.urlopen .您需要用urllib.urlopenurllib2.urlopen替换open

eg例如

import csv
import urllib2

url = 'http://winterolympicsmedals.com/medals.csv'
response = urllib2.urlopen(url)
cr = csv.reader(response)

for row in cr:
    print row

This would output the following这将输出以下内容

Year,City,Sport,Discipline,NOC,Event,Event gender,Medal
1924,Chamonix,Skating,Figure skating,AUT,individual,M,Silver
1924,Chamonix,Skating,Figure skating,AUT,individual,W,Gold
...

The original question is tagged "python-2.x", but for a Python 3 implementation (which requires only minor changes) see below .原始问题被标记为“python-2.x”,但对于 Python 3 实现(只需要很小的更改),请参见下文

You could do it with the requests module as well:你也可以用 requests 模块做到这一点:

url = 'http://winterolympicsmedals.com/medals.csv'
r = requests.get(url)
text = r.iter_lines()
reader = csv.reader(text, delimiter=',')

To increase performance when downloading a large file, the below may work a bit more efficiently:为了在下载大文件时提高性能,以下可能会更有效地工作:

import requests
from contextlib import closing
import csv

url = "http://download-and-process-csv-efficiently/python.csv"

with closing(requests.get(url, stream=True)) as r:
    reader = csv.reader(r.iter_lines(), delimiter=',', quotechar='"')
    for row in reader:
        # Handle each row here...
        print row   

By setting stream=True in the GET request, when we pass r.iter_lines() to csv.reader(), we are passing a generator to csv.reader().通过在 GET 请求中设置stream=True ,当我们将r.iter_lines()传递给 csv.reader() 时,我们将一个生成器传递给 csv.reader()。 By doing so, we enable csv.reader() to lazily iterate over each line in the response with for row in reader .通过这样做,我们使 csv.reader() 能够使用for row in reader懒惰地迭代响应中的每一行。

This avoids loading the entire file into memory before we start processing it, drastically reducing memory overhead for large files.这避免了在我们开始处理之前将整个文件加载到内存中,从而大大减少了大文件的内存开销。

This question is tagged python-2.x so it didn't seem right to tamper with the original question, or the accepted answer.这个问题被标记为python-2.x因此篡改原始问题或已接受的答案似乎是不对的。 However, Python 2 is now unsupported, and this question still has good google juice for "python csv urllib", so here's an updated Python 3 solution.但是,现在不支持 Python 2,并且这个问题对于“python csv urllib”仍然有很好的谷歌果汁,所以这里有一个更新的 Python 3 解决方案。

It's now necessary to decode urlopen 's response (in bytes) into a valid local encoding, so the accepted answer has to be modified slightly:现在需要将urlopen的响应(以字节为单位)解码为有效的本地编码,因此必须稍微修改接受的答案

import csv, urllib.request

url = 'http://winterolympicsmedals.com/medals.csv'
response = urllib.request.urlopen(url)
lines = [l.decode('utf-8') for l in response.readlines()]
cr = csv.reader(lines)

for row in cr:
    print(row)

Note the extra line beginning with lines = , the fact that urlopen is now in the urllib.request module, and print of course requires parentheses.请注意以lines =开头的额外行,事实上urlopen现在位于urllib.request模块中,并且print当然需要括号。

It's hardly advertised, but yes, csv.reader can read from a list of strings.它几乎没有广告,但是是的, csv.reader可以从字符串列表中读取。

And since someone else mentioned pandas, here's a one-liner to display the CSV in a console-friendly output:由于其他人提到了熊猫,这里有一个单行代码,用于在控制台友好的输出中显示 CSV:

python3 -c 'import pandas
df = pandas.read_csv("http://winterolympicsmedals.com/medals.csv")
print(df.to_string())'

(Yes, it's three lines, but you can copy-paste it as one command. ;) (是的,它是三行,但您可以将其复制粘贴为一个命令。;)

import pandas as pd
url='https://raw.githubusercontent.com/juliencohensolal/BankMarketing/master/rawData/bank-additional-full.csv'
data = pd.read_csv(url,sep=";") # use sep="," for coma separation. 
data.describe()

在此处输入图片说明

I am also using this approach for csv files (Python 3.6.9):我也将这种方法用于 csv 文件(Python 3.6.9):

import csv
import io
import requests

r = requests.get(url)
buff = io.StringIO(r.text)
dr = csv.DictReader(buff)
for row in dr:
    print(row)

what you were trying to do with the curl command was to download the file to your local hard drive(HD).您尝试使用 curl 命令执行的是将文件下载到本地硬盘驱动器(HD)。 You however need to specify a path on HD但是,您需要在 HD 上指定路径

curl http://example.com/passkey=wedsmdjsjmdd -o ./example.csv
cr = csv.reader(open('./example.csv',"r"))
for row in cr:
    print row



All the above solutions didn't work with Python3, I got all the "famous" error messages, like _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?) and _csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?上述所有解决方案都不适用于 Python3,我收到了所有“著名”错误消息,如_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode? . . So I was a bit stuck here.所以我有点被困在这里。

My mistake here, was that I used response.text while response is a requests.models.Response class, while I should have used response.content instead (as the first error suggested), so I was able to decode its UTF-8 correctly and split lines afterwards.我在这里的错误是,我使用了response.textresponserequests.models.Response class,而我应该使用response.content (如第一个错误建议的那样),所以我能够正确解码它的 UTF-8 并拆分之后的行。 So here is my solutions:所以这是我的解决方案:

response = reqto.get("https://example.org/utf8-data.csv")
# Do some error checks to avoid bad results
if response.ok and len(response.content) > 0:
    reader = csv.DictReader(response.content.decode('utf-8').splitlines(), dialect='unix')
    for row in reader:
        print(f"DEBUG: row={row}")

The above example gives me already a dict back with each row.上面的示例已经为我提供了每一行的dict But with leading # for each dict key, which I may have to live with.但是每个字典键都有前导# ,我可能不得不忍受。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM