简体   繁体   English

将 CSV 文件读入 Pandas

[英]Reading a CSV file into Pandas

I have csv data that looks like this and I'm trying to read it into a pandas df and I've tired all sorts of combinations given the ample documentation online - I've tried things like:我有 csv 数据,看起来像这样,我正在尝试将其读入 pandas df 并且鉴于在线文档充足,我已经厌倦了各种组合 - 我尝试过类似的东西:

pd.read_csv("https://www.nwrfc.noaa.gov/natural/nat_norm_text.cgi?id=TDAO3.csv", delimiter=',', skiprows=0, low_memory=False)

and I get this error -我得到这个错误 -

ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 989

Or, like this but get an empty dataframe:或者,像这样得到一个空的 dataframe:

pd.read_csv('https://www.nwrfc.noaa.gov/natural/nat_norm_text.cgi?id=TDAO3.csv', skiprows=2, 
skipfooter=3,index_col=[0], header=None,
             engine='python', # c engine doesn't have skipfooter
             sep='delimiter')
Out[31]: 
Empty DataFrame
Columns: []
Index: []

The first 10 lines of the csv file look like this: csv 文件的前 10 行如下所示:

# Water Supply Monthly Volumes for COLUMBIA - THE DALLES DAM (TDAO3) 
# Volumes are in KAF 
ID,Calendar Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
TDAO3,1948,,,,,,,,,,6866.8,4307.04,4379.38
TDAO3,1949,3546.71,4615.1,8513.31,15020.45,35251.67,21985.99,11226.06,6966.73,4727.37,4406.29,5266.74,5595.91
TDAO3,1950,4353.86,5540.21,9696.27,12854.81,23359.51,39246.78,23393.23,9676.77,5729.74,6990.31,8300.03,8779.57
TDAO3,1951,8032.32,10295.98,7948.59,16144.8,36000.88,28334.09,19735.49,9308.15,6546.95,8907.1,6461.14,6425.76
TDAO3,1952,4671,6222.25,6551.62,18678.3,34866.91,27120.65,15994.18,7907.55,4810.39,3954.32,3259.29,3231.49
TDAO3,1953,7839.72,7870.96,6527.74,9474.66,23384.47,32668.32,17422.63,8655.16,5220.04,5130.46,5183.5,5915.14
TDAO3,1954,5197.51,5967.07,6718.36,10813.69,29190.37,32673.26,29624.38,13456.13,9165.78,5440.92,5732.22,4973.53

thank you,谢谢你,

It is not link directly to file CSV but to page which displays it as HTML using tags <pre> , <br> , etc. and this makes problem.它不是直接链接到文件 CSV,而是链接到使用标签<pre><br>等将其显示为 HTML 的页面,这会产生问题。

But you can use requests to download it as text.但是您可以使用requests将其下载为文本。

Later you can use standard string -functions to get text between <pre> and </pre> and replace <br> with '\n' - and you will have text with correct CSV.稍后您可以使用标准string函数获取<pre></pre>之间的文本并将<br>替换为'\n' - 您将获得正确的文本 CSV。

And later you can use io.StringIO to create file in memory - to load it with pd.read_csv() without saving on disk.稍后您可以使用io.StringIO在 memory 中创建文件 - 使用pd.read_csv()加载它而不保存在磁盘上。

import pandas as pd
import requests
import io

url = "https://www.nwrfc.noaa.gov/natural/nat_norm_text.cgi?id=TDAO3.csv"

response = requests.get(url)

start = response.text.find('<pre>') + len('<pre>')
end   = response.text.find('</pre>')

pre = response.text[start:end]

text = pre.replace('<br>', '\n')

buf = io.StringIO(text)  # file-like object in memory

df = pd.read_csv(buf, skiprows=2, low_memory=False)

print(df.to_string())

Result结果

      ID  Calendar Year       Jan       Feb       Mar       Apr       May       Jun       Jul       Aug      Sep       Oct      Nov       Dec
0   TDAO3           1948       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN      NaN   6866.80  4307.04   4379.38
1   TDAO3           1949   3546.71   4615.10   8513.31  15020.45  35251.67  21985.99  11226.06   6966.73  4727.37   4406.29  5266.74   5595.91
2   TDAO3           1950   4353.86   5540.21   9696.27  12854.81  23359.51  39246.78  23393.23   9676.77  5729.74   6990.31  8300.03   8779.57
3   TDAO3           1951   8032.32  10295.98   7948.59  16144.80  36000.88  28334.09  19735.49   9308.15  6546.95   8907.10  6461.14   6425.76
4   TDAO3           1952   4671.00   6222.25   6551.62  18678.30  34866.91  27120.65  15994.18   7907.55  4810.39   3954.32  3259.29   3231.49
5   TDAO3           1953   7839.72   7870.96   6527.74   9474.66  23384.47  32668.32  17422.63   8655.16  5220.04   5130.46  5183.50   5915.14
6   TDAO3           1954   5197.51   5967.07   6718.36  10813.69  29190.37  32673.26  29624.38  13456.13  9165.78   5440.92  5732.22   4973.53
7   TDAO3           1955   4124.26   3570.41   3843.46   7993.82  18505.47  31619.54  20408.54   8922.94  4983.31   5842.70  6982.45   9076.44
8   TDAO3           1956   8079.70   5366.62   8818.69  19754.46  40600.06  40447.34  19846.89   9726.93  5503.69   5446.20  4988.98   6006.80
9   TDAO3           1957   3940.08   4411.33   9155.00  12271.77  40111.86  27864.70  11585.75   6795.70  4613.31   4767.38  4087.55   4789.04
10  TDAO3           1958   4838.12   8246.89   7303.03  13902.66  33958.88  26239.62  12516.52   6898.78  4968.03   5198.19  6662.24   7616.43

... rest ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM