[英]How can I read only the header column of a CSV file using Python?
I am looking for aa way to read just the header row of a large number of large CSV files.我正在寻找一种方法来读取大量大型 CSV 文件的标题行。
Using Pandas, I have this method available, for each csv file:使用 Pandas,我为每个 csv 文件提供了这种方法:
>>> df = pd.read_csv(PATH_TO_CSV)
>>> df.columns
I could do this with just the csv module:我可以只用 csv 模块来做到这一点:
>>> reader = csv.DictReader(open(PATH_TO_CSV))
>>> reader.fieldnames
The problem with these is that each CSV file is 500MB+ in size, and it seems to be a gigantic waste to read in the entire file of each just to pull the header lines.这些问题是每个 CSV 文件的大小为 500MB+,并且读取每个文件的整个文件似乎只是为了拉标题行是一种巨大的浪费。
My end goal of all of this is to pull out unique column names.我所有这一切的最终目标是提取唯一的列名。 I can do that once I have a list of column headers that are in each of these files.
一旦我拥有每个文件中的列标题列表,我就可以做到这一点。
How can I extract only the header row of a CSV file, quickly?如何快速仅提取 CSV 文件的标题行?
Expanding on the answer given by Jeff It is now possbile to use pandas
without actually reading any rows.扩展Jeff 给出的答案现在可以在不实际读取任何行的情况下使用
pandas
。
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: pd.DataFrame(np.random.randn(10, 4), columns=list('abcd')).to_csv('test.csv', mode='w')
In [4]: pd.read_csv('test.csv', index_col=0, nrows=0).columns.tolist()
Out[4]: ['a', 'b', 'c', 'd']
pandas
can have the advantage that it deals more gracefully with CSV encodings. pandas
的优势在于它可以更优雅地处理 CSV 编码。
I might be a little late to the party but here's one way to do it using just the Python standard library.我参加聚会可能有点晚了,但这是仅使用 Python 标准库的一种方法。 When dealing with text data, I prefer to use Python 3 because unicode.
在处理文本数据时,我更喜欢使用 Python 3,因为 unicode。 So this is very close to your original suggestion except I'm only reading in one row rather than the whole file.
所以这与您最初的建议非常接近,除了我只阅读一行而不是整个文件。
import csv
with open(fpath, 'r') as infile:
reader = csv.DictReader(infile)
fieldnames = reader.fieldnames
Hopefully that helps!希望这有帮助!
Here's one way.这是一种方法。 You get 1 row.
你得到 1 行。
In [9]: DataFrame(np.random.randn(10,4),columns=list('abcd')).to_csv('test.csv',mode='w')
In [10]: read_csv('test.csv',index_col=0,nrows=1)
Out[10]:
a b c d
0 0.365453 0.633631 -1.917368 -1.996505
I've used iglob
as an example to search for the .csv
files, but one way is to use a set, then adjust as necessary, eg:我以
iglob
为例来搜索.csv
文件,但一种方法是使用一组,然后根据需要进行调整,例如:
import csv
from glob import iglob
unique_headers = set()
for filename in iglob('*.csv'):
with open(filename, 'rb') as fin:
csvin = csv.reader(fin)
unique_headers.update(next(csvin, []))
What about:关于什么:
pandas.read_csv(PATH_TO_CSV, nrows=1).columns
That'll read the first row only and return the columns found.这将仅读取第一行并返回找到的列。
you have missed nrows=1
param to read_csv你错过了 read_csv 的
nrows=1
参数
>>> df= pd.read_csv(PATH_TO_CSV, nrows=1)
>>> df.columns
it depends on what the header will be used for, if you needed the headers for comparison purposes only (my case) this code will be simple and super fast, it will read the whole header as one string.这取决于标题的用途,如果您只需要标题用于比较目的(我的情况),此代码将简单且超快,它将整个标题作为一个字符串读取。 you can transform all the collected strings together according to your needs:
您可以根据需要将所有收集的字符串一起转换:
for filename in glob.glob(files_path+"\*.csv"):
with open(filename) as f:
first_line = f.readline()
import pandas as pd
get_col = list(pd.read_csv("first_test_pipe.csv",sep="|",nrows=1).columns)
print(get_col)
it is easy you can use this:你可以很容易地使用它:
df = pd.read_csv("path.csv", skiprows=0, nrows=2)
df.columns.to_list()
In this case you can only read really few row for get your header在这种情况下,您只能读取很少的行来获取标题
如果您只对标题感兴趣并且想使用 pandas,那么除了 csv 文件名之外,您需要传递的唯一额外内容是“nrows=0”:
headers = pd.read_csv("test.csv", nrows=0)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.