[英]Add calculated column to pandas dataframe on the fly while iterating over the lines of a csv file?
I have a large space separated input file input.csv
, which I can't hold in memory: 我有一个很大的空格分隔的输入文件input.csv
,我无法保存在内存中:
## Header
# More header here
A B
1 2
3 4
If I use the iterator=True
argument for pandas.read_csv , then it returns a TextFileReader
/ TextParser
object. 如果我对pandas.read_csv使用iterator=True
参数,则它将返回TextFileReader
/ TextParser
对象。 This allows filtering the file on the fly and only selecting rows for which column A
is greater than 2. 这样可以动态过滤文件,并且仅选择A
列大于2的行。
But how do I add a third column to the dataframe on the fly without having to loop over all of the data once more? 但是,如何在运行中向数据帧中添加第三列,而不必再次遍历所有数据呢?
Specifically I want column C
to be equal to column A
multiplied by the value in a dictionary d
, which has the value of column B
as its key; 具体来说,我希望C
列等于A
列乘以字典d
的值,字典d
以B
列的值为键; ie C = A*d[B]
. 即C = A*d[B]
。
Currently I have this code: 目前,我有以下代码:
import pandas
d = {2: 2, 4: 3}
TextParser = pandas.read_csv('input.csv', sep=' ', iterator=True, comment='#')
df = pandas.concat([chunk[chunk['A'] > 2] for chunk in TextParser])
print(df)
Which prints this output: 哪个打印此输出:
A B
1 3 4
How do I get it to print this output ( C = A*d[B]
): 如何获取它以打印此输出( C = A*d[B]
):
A B C
1 3 4 9
You can use a generator to work on the chunks one at a time: 您可以使用生成器一次处理一个块:
Code: 码:
def on_the_fly(the_csv):
d = {2: 2, 4: 3}
chunked_csv = pd.read_csv(
the_csv, sep='\s+', iterator=True, comment='#')
for chunk in chunked_csv:
rows_idx = chunk['A'] > 2
chunk.loc[rows_idx, 'C'] = chunk[rows_idx].apply(
lambda x: x.A * d[x.B], axis=1)
yield chunk[rows_idx]
Test Code: 测试代码:
from io import StringIO
data = StringIO(u"""#
A B
1 2
3 4
4 4
""")
import pandas as pd
df = pd.concat([c for c in on_the_fly(data)])
print(df)
Results: 结果:
A B C
1 3 4 9.0
2 4 4 12.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.