简体   繁体   English

如何使用 Python 仅读取 CSV 文件的标题列?

[英]How can I read only the header column of a CSV file using Python?

I am looking for aa way to read just the header row of a large number of large CSV files.我正在寻找一种方法来读取大量大型 CSV 文件的标题行。

Using Pandas, I have this method available, for each csv file:使用 Pandas,我为每个 csv 文件提供了这种方法:

>>> df = pd.read_csv(PATH_TO_CSV)
>>> df.columns

I could do this with just the csv module:我可以只用 csv 模块来做到这一点:

>>> reader = csv.DictReader(open(PATH_TO_CSV))
>>> reader.fieldnames

The problem with these is that each CSV file is 500MB+ in size, and it seems to be a gigantic waste to read in the entire file of each just to pull the header lines.这些问题是每个 CSV 文件的大小为 500MB+,并且读取每个文件的整个文件似乎只是为了拉标题行是一种巨大的浪费。

My end goal of all of this is to pull out unique column names.我所有这一切的最终目标是提取唯一的列名。 I can do that once I have a list of column headers that are in each of these files.一旦我拥有每个文件中的列标题列表,我就可以做到这一点。

How can I extract only the header row of a CSV file, quickly?如何快速仅提取 CSV 文件的标题行?

Expanding on the answer given by Jeff It is now possbile to use pandas without actually reading any rows.扩展Jeff 给出答案现在可以在不实际读取任何行的情况下使用pandas

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: pd.DataFrame(np.random.randn(10, 4), columns=list('abcd')).to_csv('test.csv', mode='w')

In [4]: pd.read_csv('test.csv', index_col=0, nrows=0).columns.tolist()
Out[4]: ['a', 'b', 'c', 'd']

pandas can have the advantage that it deals more gracefully with CSV encodings. pandas的优势在于它可以更优雅地处理 CSV 编码。

I might be a little late to the party but here's one way to do it using just the Python standard library.我参加聚会可能有点晚了,但这是仅使用 Python 标准库的一种方法。 When dealing with text data, I prefer to use Python 3 because unicode.在处理文本数据时,我更喜欢使用 Python 3,因为 unicode。 So this is very close to your original suggestion except I'm only reading in one row rather than the whole file.所以这与您最初的建议非常接近,除了我只阅读一行而不是整个文件。

import csv    

with open(fpath, 'r') as infile:
    reader = csv.DictReader(infile)
    fieldnames = reader.fieldnames

Hopefully that helps!希望这有帮助!

Here's one way.这是一种方法。 You get 1 row.你得到 1 行。

In [9]: DataFrame(np.random.randn(10,4),columns=list('abcd')).to_csv('test.csv',mode='w')

In [10]: read_csv('test.csv',index_col=0,nrows=1)
Out[10]: 
          a         b         c         d
0  0.365453  0.633631 -1.917368 -1.996505

I've used iglob as an example to search for the .csv files, but one way is to use a set, then adjust as necessary, eg:我以iglob为例来搜索.csv文件,但一种方法是使用一组,然后根据需要进行调整,例如:

import csv
from glob import iglob

unique_headers = set()
for filename in iglob('*.csv'):
    with open(filename, 'rb') as fin:
        csvin = csv.reader(fin)
        unique_headers.update(next(csvin, []))

What about:关于什么:

pandas.read_csv(PATH_TO_CSV, nrows=1).columns

That'll read the first row only and return the columns found.这将仅读取第一行并返回找到的列。

you have missed nrows=1 param to read_csv你错过了 read_csv 的nrows=1参数

>>> df= pd.read_csv(PATH_TO_CSV, nrows=1)
>>> df.columns

it depends on what the header will be used for, if you needed the headers for comparison purposes only (my case) this code will be simple and super fast, it will read the whole header as one string.这取决于标题的用途,如果您只需要标题用于比较目的(我的情况),此代码将简单且超快,它将整个标题作为一个字符串读取。 you can transform all the collected strings together according to your needs:您可以根据需要将所有收集的字符串一起转换:

for filename in glob.glob(files_path+"\*.csv"):
    with open(filename) as f:
        first_line = f.readline()
import pandas as pd

get_col = list(pd.read_csv("first_test_pipe.csv",sep="|",nrows=1).columns)
print(get_col)

it is easy you can use this:你可以很容易地使用它:

df = pd.read_csv("path.csv", skiprows=0, nrows=2)
df.columns.to_list()

In this case you can only read really few row for get your header在这种情况下,您只能读取很少的行来获取标题

如果您只对标题感兴趣并且想使用 pandas,那么除了 csv 文件名之外,您需要传递的唯一额外内容是“nrows=0”:

headers = pd.read_csv("test.csv", nrows=0)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从csv读取没有标题的列并使用Python将输出保存在txt文件中? - How to read a column without header from csv and save the output in a txt file using Python? 当不重复生成列标题时,如何将csv文件作为MultiIndexed DataFrame读入? - How can I read in a csv file as a MultiIndexed DataFrame when the spanning column header isn't repeated? How to read every column of a csv file in python after every 10-15 rows which have the same header using pandas or csv? - How to read every column of a csv file in python after every 10-15 rows which have the same header using pandas or csv? 如何使用python从CSV文件中读取标头 - How to read a header from a CSV file using python 如何使用 Z3A43B4F88325D94022C0EFA 库在 python 的 2 列 CSV 文件上更改 header 而不创建新的 C9 文件? - How do I change the header on a 2 column CSV file in python using the pandas library without creating a new file? 如何从 Python 中的 a.csv 文件中仅读取特定列和特定行? - How do I read only a specific column and a specific row from a .csv file in Python? 使用Python中的pd.read_csv()仅使用标头访问csv文件一次 - Access only once to a csv file with header using pd.read_csv() in Python 如何使用Python逐列读取CSV文件 - How to read CSV file column by column with Python 如果csv文件的最后一行在Python中只有1列,我怎么不读它呢? - How can I not read the last line of a csv file if it has simply 1 column in Python? 如何仅在使用 Python 找到特定模式后才能读取 csv 文件? - How can I read csv file only after finding a certain pattern with Python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM