[英]Reading a CSV file into Pandas
我有 csv 数据,看起来像这样,我正在尝试将其读入 pandas df 并且鉴于在线文档充足,我已经厌倦了各种组合 - 我尝试过类似的东西:
pd.read_csv("https://www.nwrfc.noaa.gov/natural/nat_norm_text.cgi?id=TDAO3.csv", delimiter=',', skiprows=0, low_memory=False)
我得到这个错误 -
ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 989
或者,像这样得到一个空的 dataframe:
pd.read_csv('https://www.nwrfc.noaa.gov/natural/nat_norm_text.cgi?id=TDAO3.csv', skiprows=2,
skipfooter=3,index_col=[0], header=None,
engine='python', # c engine doesn't have skipfooter
sep='delimiter')
Out[31]:
Empty DataFrame
Columns: []
Index: []
csv 文件的前 10 行如下所示:
# Water Supply Monthly Volumes for COLUMBIA - THE DALLES DAM (TDAO3)
# Volumes are in KAF
ID,Calendar Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
TDAO3,1948,,,,,,,,,,6866.8,4307.04,4379.38
TDAO3,1949,3546.71,4615.1,8513.31,15020.45,35251.67,21985.99,11226.06,6966.73,4727.37,4406.29,5266.74,5595.91
TDAO3,1950,4353.86,5540.21,9696.27,12854.81,23359.51,39246.78,23393.23,9676.77,5729.74,6990.31,8300.03,8779.57
TDAO3,1951,8032.32,10295.98,7948.59,16144.8,36000.88,28334.09,19735.49,9308.15,6546.95,8907.1,6461.14,6425.76
TDAO3,1952,4671,6222.25,6551.62,18678.3,34866.91,27120.65,15994.18,7907.55,4810.39,3954.32,3259.29,3231.49
TDAO3,1953,7839.72,7870.96,6527.74,9474.66,23384.47,32668.32,17422.63,8655.16,5220.04,5130.46,5183.5,5915.14
TDAO3,1954,5197.51,5967.07,6718.36,10813.69,29190.37,32673.26,29624.38,13456.13,9165.78,5440.92,5732.22,4973.53
谢谢你,
它不是直接链接到文件 CSV,而是链接到使用标签<pre>
、 <br>
等将其显示为 HTML 的页面,这会产生问题。
但是您可以使用requests
将其下载为文本。
稍后您可以使用标准string
函数获取<pre>
和</pre>
之间的文本并将<br>
替换为'\n'
- 您将获得正确的文本 CSV。
稍后您可以使用io.StringIO
在 memory 中创建文件 - 使用pd.read_csv()
加载它而不保存在磁盘上。
import pandas as pd
import requests
import io
url = "https://www.nwrfc.noaa.gov/natural/nat_norm_text.cgi?id=TDAO3.csv"
response = requests.get(url)
start = response.text.find('<pre>') + len('<pre>')
end = response.text.find('</pre>')
pre = response.text[start:end]
text = pre.replace('<br>', '\n')
buf = io.StringIO(text) # file-like object in memory
df = pd.read_csv(buf, skiprows=2, low_memory=False)
print(df.to_string())
结果
ID Calendar Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0 TDAO3 1948 NaN NaN NaN NaN NaN NaN NaN NaN NaN 6866.80 4307.04 4379.38
1 TDAO3 1949 3546.71 4615.10 8513.31 15020.45 35251.67 21985.99 11226.06 6966.73 4727.37 4406.29 5266.74 5595.91
2 TDAO3 1950 4353.86 5540.21 9696.27 12854.81 23359.51 39246.78 23393.23 9676.77 5729.74 6990.31 8300.03 8779.57
3 TDAO3 1951 8032.32 10295.98 7948.59 16144.80 36000.88 28334.09 19735.49 9308.15 6546.95 8907.10 6461.14 6425.76
4 TDAO3 1952 4671.00 6222.25 6551.62 18678.30 34866.91 27120.65 15994.18 7907.55 4810.39 3954.32 3259.29 3231.49
5 TDAO3 1953 7839.72 7870.96 6527.74 9474.66 23384.47 32668.32 17422.63 8655.16 5220.04 5130.46 5183.50 5915.14
6 TDAO3 1954 5197.51 5967.07 6718.36 10813.69 29190.37 32673.26 29624.38 13456.13 9165.78 5440.92 5732.22 4973.53
7 TDAO3 1955 4124.26 3570.41 3843.46 7993.82 18505.47 31619.54 20408.54 8922.94 4983.31 5842.70 6982.45 9076.44
8 TDAO3 1956 8079.70 5366.62 8818.69 19754.46 40600.06 40447.34 19846.89 9726.93 5503.69 5446.20 4988.98 6006.80
9 TDAO3 1957 3940.08 4411.33 9155.00 12271.77 40111.86 27864.70 11585.75 6795.70 4613.31 4767.38 4087.55 4789.04
10 TDAO3 1958 4838.12 8246.89 7303.03 13902.66 33958.88 26239.62 12516.52 6898.78 4968.03 5198.19 6662.24 7616.43
... rest ...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.