[英]Python Pandas read_csv with nrows=1
I have this code reading a text file with headers. 我有这段代码读取带有标题的文本文件。 ANd append another file with the same headers to it. 并将另一个具有相同标题的文件附加到该文件。 As the main file is very huge, I only want to read in part of it and get the column headers. 由于主文件非常大,所以我只想部分阅读并获取列标题。 I will get this error if the only line there is the header. 如果唯一的行是标题,我将收到此错误。 And I do not have an idea of how many rows the file has. 而且我不知道文件有多少行。 What I would like to achieve is to read in the file and get the column header of the file. 我想要实现的是读入文件并获取文件的列标题。 Because I want to append another file to it, I am trying to ensure that the columns are correct. 因为我想向其追加另一个文件,所以我试图确保这些列是正确的。
import pandas as pd
main = pd.read_csv(main_input, nrows=1)
data = pd.read_csv(file_input)
data = data.reindex_axis(main.columns, axis=1)
data.to_csv(main_input,
quoting=csv.QUOTE_ALL,
mode='a', header=False, index=False)
Examine the stack trace: 检查堆栈跟踪:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\gohm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.py", line 420, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\gohm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.py", line 221, in _read
return parser.read(nrows)
File "C:\Users\gohm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.py", line 626, in read
ret = self._engine.read(nrows)
File "C:\Users\gohm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.py", line 1070, in read
data = self._reader.read(nrows)
File "parser.pyx", line 727, in pandas.parser.TextReader.read (pandas\parser.c:7110)
File "parser.pyx", line 774, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:7671)
StopIteration
It seems that the whole file may be being read into memory. 似乎整个文件可能正在读入内存。 You can specify a chunksize=
in read_csv(...)
as discussed in the docs here. 您可以在read_csv(...)
指定chunksize=
, read_csv(...)
的文档中所述。
I think that read_csv
s memory usage had been overhauled in version 0.10. 我认为read_csv
的内存使用已在0.10版中进行了全面检查。 So pandas your version makes a difference too see this answer from @WesMcKinney and the associated comments. 因此,您的熊猫版本也有所不同,请参阅@WesMcKinney的答案以及相关注释。 The changes were also discussed a while ago on Wes' blog 不久前还在Wes的博客上讨论了这些更改
import pandas as pd
from cStringIO import StringIO
csv_data = """\
header, I want
0.47094534, 0.40249001,
0.45562164, 0.37275901,
0.05431775, 0.69727892,
0.24307614, 0.92250565,
0.85728819, 0.31775839,
0.61310243, 0.24324426,
0.669575 , 0.14386658,
0.57515449, 0.68280618,
0.58448533, 0.51793506,
0.0791515 , 0.33833041,
0.34361147, 0.77419739,
0.53552098, 0.47761297,
0.3584255 , 0.40719249,
0.61492079, 0.44656684,
0.77277236, 0.68667805,
0.89155627, 0.88422355,
0.00214914, 0.90743799
"""
tfr = pd.read_csv(StringIO(csv_data), header=None, chunksize=1)
main = tfr.get_chunk()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.