[英]How to read a CSV file from a URL with Python?
when I do curl to a API call link http://example.com/passkey=wedsmdjsjmdd当我执行 curl 到 API 呼叫链接http://example.com/passkey=wedsmdjsjmdd
curl 'http://example.com/passkey=wedsmdjsjmdd'
I get the employee output data on a csv file format, like:我以 csv 文件格式获取员工 output 数据,例如:
"Steve","421","0","421","2","","","","","","","","","421","0","421","2"
how can parse through this using python.如何使用 python 解析它。
I tried:我试过:
import csv
cr = csv.reader(open('http://example.com/passkey=wedsmdjsjmdd',"rb"))
for row in cr:
print row
but it didn't work and I got an error但它没有用,我得到了一个错误
http://example.com/passkey=wedsmdjsjmdd No such file or directory:
Thanks!谢谢!
Using pandas it is very simple to read a csv file directly from a url使用 pandas 直接从 url 读取 csv 文件非常简单
import pandas as pd
data = pd.read_csv('https://example.com/passkey=wedsmdjsjmdd')
This will read your data in tabular format, which will be very easy to process这将以表格格式读取您的数据,这将非常容易处理
You need to replace open
with urllib.urlopen or urllib2.urlopen .您需要用urllib.urlopen或urllib2.urlopen替换
open
。
eg例如
import csv
import urllib2
url = 'http://winterolympicsmedals.com/medals.csv'
response = urllib2.urlopen(url)
cr = csv.reader(response)
for row in cr:
print row
This would output the following这将输出以下内容
Year,City,Sport,Discipline,NOC,Event,Event gender,Medal
1924,Chamonix,Skating,Figure skating,AUT,individual,M,Silver
1924,Chamonix,Skating,Figure skating,AUT,individual,W,Gold
...
The original question is tagged "python-2.x", but for a Python 3 implementation (which requires only minor changes) see below .原始问题被标记为“python-2.x”,但对于 Python 3 实现(只需要很小的更改),请参见下文。
You could do it with the requests module as well:你也可以用 requests 模块做到这一点:
url = 'http://winterolympicsmedals.com/medals.csv'
r = requests.get(url)
text = r.iter_lines()
reader = csv.reader(text, delimiter=',')
To increase performance when downloading a large file, the below may work a bit more efficiently:为了在下载大文件时提高性能,以下可能会更有效地工作:
import requests
from contextlib import closing
import csv
url = "http://download-and-process-csv-efficiently/python.csv"
with closing(requests.get(url, stream=True)) as r:
reader = csv.reader(r.iter_lines(), delimiter=',', quotechar='"')
for row in reader:
# Handle each row here...
print row
By setting stream=True
in the GET request, when we pass r.iter_lines()
to csv.reader(), we are passing a generator to csv.reader().通过在 GET 请求中设置
stream=True
,当我们将r.iter_lines()
传递给 csv.reader() 时,我们将一个生成器传递给 csv.reader()。 By doing so, we enable csv.reader() to lazily iterate over each line in the response with for row in reader
.通过这样做,我们使 csv.reader() 能够使用
for row in reader
懒惰地迭代响应中的每一行。
This avoids loading the entire file into memory before we start processing it, drastically reducing memory overhead for large files.这避免了在我们开始处理之前将整个文件加载到内存中,从而大大减少了大文件的内存开销。
This question is tagged python-2.x
so it didn't seem right to tamper with the original question, or the accepted answer.这个问题被标记为
python-2.x
因此篡改原始问题或已接受的答案似乎是不对的。 However, Python 2 is now unsupported, and this question still has good google juice for "python csv urllib", so here's an updated Python 3 solution.但是,现在不支持 Python 2,并且这个问题对于“python csv urllib”仍然有很好的谷歌果汁,所以这里有一个更新的 Python 3 解决方案。
It's now necessary to decode urlopen
's response (in bytes) into a valid local encoding, so the accepted answer has to be modified slightly:现在需要将
urlopen
的响应(以字节为单位)解码为有效的本地编码,因此必须稍微修改接受的答案:
import csv, urllib.request
url = 'http://winterolympicsmedals.com/medals.csv'
response = urllib.request.urlopen(url)
lines = [l.decode('utf-8') for l in response.readlines()]
cr = csv.reader(lines)
for row in cr:
print(row)
Note the extra line beginning with lines =
, the fact that urlopen
is now in the urllib.request
module, and print
of course requires parentheses.请注意以
lines =
开头的额外行,事实上urlopen
现在位于urllib.request
模块中,并且print
当然需要括号。
It's hardly advertised, but yes, csv.reader
can read from a list of strings.它几乎没有广告,但是是的,
csv.reader
可以从字符串列表中读取。
And since someone else mentioned pandas, here's a one-liner to display the CSV in a console-friendly output:由于其他人提到了熊猫,这里有一个单行代码,用于在控制台友好的输出中显示 CSV:
python3 -c 'import pandas
df = pandas.read_csv("http://winterolympicsmedals.com/medals.csv")
print(df.to_string())'
(Yes, it's three lines, but you can copy-paste it as one command. ;) (是的,它是三行,但您可以将其复制粘贴为一个命令。;)
I am also using this approach for csv files (Python 3.6.9):我也将这种方法用于 csv 文件(Python 3.6.9):
import csv
import io
import requests
r = requests.get(url)
buff = io.StringIO(r.text)
dr = csv.DictReader(buff)
for row in dr:
print(row)
what you were trying to do with the curl command was to download the file to your local hard drive(HD).您尝试使用 curl 命令执行的是将文件下载到本地硬盘驱动器(HD)。 You however need to specify a path on HD
但是,您需要在 HD 上指定路径
curl http://example.com/passkey=wedsmdjsjmdd -o ./example.csv
cr = csv.reader(open('./example.csv',"r"))
for row in cr:
print row
All the above solutions didn't work with Python3, I got all the "famous" error messages, like _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
and _csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
上述所有解决方案都不适用于 Python3,我收到了所有“著名”错误消息,如
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
和_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
. . So I was a bit stuck here.
所以我有点被困在这里。
My mistake here, was that I used response.text
while response
is a requests.models.Response
class, while I should have used response.content
instead (as the first error suggested), so I was able to decode its UTF-8 correctly and split lines afterwards.我在这里的错误是,我使用了
response.text
而response
是requests.models.Response
class,而我应该使用response.content
(如第一个错误建议的那样),所以我能够正确解码它的 UTF-8 并拆分之后的行。 So here is my solutions:所以这是我的解决方案:
response = reqto.get("https://example.org/utf8-data.csv")
# Do some error checks to avoid bad results
if response.ok and len(response.content) > 0:
reader = csv.DictReader(response.content.decode('utf-8').splitlines(), dialect='unix')
for row in reader:
print(f"DEBUG: row={row}")
The above example gives me already a dict
back with each row.上面的示例已经为我提供了每一行的
dict
。 But with leading #
for each dict key, which I may have to live with.但是每个字典键都有前导
#
,我可能不得不忍受。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.