简体   繁体   中英

How can I test the value of a column for each row?

I have a dataframe with 10 columns and around 20,000,000 rows. I need to compare the values of the 10 columns row by row and create five columns with the new values. To do this, I defined a function consisting of an if function and applied it to test each.

For example:

>>> import pandas as pd
>>> df = pd.DataFrame({'a':[1,2,3,4,5], 'b':[11,12,13,14,16], 'c':[21,22,23,24,25], 'd':[31,32,33,34,35])


>>> def cal1(row):
>>>     v1=0
>>>     v2=0
>>>     if 0< row['a'] <2:
>>>         v1=1
>>>     if 11< row['b'] <14:
>>>         v2=1
>>>     return v1+v2


>>> def cal2(row):
>>>     v1=0
>>>     v2=0
>>>     if 2<= row['a'] <4:
>>>         v1=-1
>>>     if 14<= row['b']<=16:
>>>         v2=-1
>>>     return v1+v2    

>>> df['n1'] = df.apply(ca11, axis=1)
>>> df['n2'] = df.apply(cal2, axis=1)

I was able to get the answer this way, but I needed five defined functions, each with a long list of conditions. And the calculation was too slow. (Actual data should be tested on all 10 columns, with at least 10 conditions.)

Is there a better way to test the data in each column row by row than this one?

apply() accepts several parameters : func and also args , which are

Positional arguments to pass to function in addition to the array/series

You could could pass in eg (0, 2, 11, 14) to allow a more generic function to score column A, (2, 4, 14, 16) for column B, and so on. Alternatively, pass in the column name and let the function make decisions based on that.

There's a fair amount of CPU overhead and memory footprint to process 20 M rows. You might find it more performant to read each row to be scored using a csvreader and emit the result with a csvwriter, and have pandas import the augmented CSV file.

IIUC, you have consistent values you want to assign to each row. For example, n1 is either 0 , 1 , or 2 . If that's the case, you can just start n1 at 0 and add with indexing:

df['n1'] = 0

mask1 = df.a.between(0, 2, inclusive=False) 
mask2 = df.b.between(11, 14, inclusive=False)

df.loc[mask1 | mask2, 'n1'] = 1
df.loc[mask1 & mask2, 'n1'] = 2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM