简体   繁体   中英

Pandas DataFrame: Writing values to column depending on a value check of existing column

I want to add a column to a pd.DataFrame in which I write values based on a check in an existing column.

I want to check for values in a dictionary. Let's say I have the following dictionary:

{"<=4":[0,4], "(4,10]":[4,10], ">10":[10,inf]}

Now I want to check in a column in my DataFrame, if the values in this column belong to any of the intervals in the dictionary. If so, I want to write the matching dictionary key to a second column in the same data frame.

So a DataFrame like:

     col_1
  a    3
  b    15
  c    8

will become:

     col_1   col_2
  a    3     "<=4"
  b    15    ">10"
  c    8     "(4,10]"

the pd.cut() function is used to convert continuous variable to categorical variable , in this case we have [0 , 4 , 10 , np.inf] , this means we have 3 categories [0 , 4] , [4 , 10] , [10 , inf] , so any value between 0 and 4 will be assigned to category [ 0 , 4] , and any value between 4 and 10 will be assigned to category [ 4 , 10 ] and so on .

then you assign a name for each category in the same order , you can do this by using the label parameter , in this case we have 3 categories [0 , 4] , [4 , 10] , [10 , inf] , simply we will assign ['<=4' , '(4,10]' , '>10'] to the label parameter , this means that [0 , 4] category will be named <=4 , and [4 , 10] category will be named (4,10] and so on .

In [83]:
df['col_2'] = pd.cut(df.col_1 , [0 , 4 , 10 , np.inf] , labels = ['<=4' , '(4,10]' , '>10'] )
df
Out[83]:
   col_1    col_2
0   3       <=4
1   15      >10
2   8       (4,10]

You can use this approach:

dico = pd.DataFrame({"<=4":[0,4], "(4,10]":[4,10], ">10":[10,float('inf')]}).transpose()

foo = lambda x: dico.index[(dico[1]>x) & (dico[0]<=x)][0]

df['col_1'].map(foo)

#0       <=4
#1       >10
#2    (4,10]
#Name: col1, dtype: object

This solution creates a function named extract_str which is applied to col_1 . It uses a conditional list comprehension to iterate through the keys and values in the dictionary, checking if the value is greater than or equal to the lower value and less than the upper value. A check is made to ensure this resulting list does not contain more than one result. If there is a value in the list, it is returned. Otherwise None is returned by default.

from numpy import inf

d = {"<=4": [0, 4], "(4,10]": [4, 10], ">10": [10, inf]}

def extract_str(val):
    results = [key for key, value_range in d.iteritems() 
               if value_range[0] <= val < value_range[1]]
    if len(results) > 1:
        raise ValueError('Multiple ranges satisfied.')
    if results:
        return results[0]

df['col_2'] = df.col_1.apply(extract_str)

>>> df
   col_1   col_2
a      3     <=4
b     15     >10
c      8  (4,10]

On this small dataframe, this solution is much faster than the solution provided by @ColonelBeauvel.

%timeit df['col_2'] = df.col_1.apply(extract_str)
1000 loops, best of 3: 220 µs per loop

%timeit df['col_2'] = df['col_1'].map(foo)
1000 loops, best of 3: 1.46 ms per loop

You can use a function to map. like the example. I hope it may help you.

import pandas as pd
d = {'col_1':[3,15,8]}
from numpy import inf
test = pd.DataFrame(d,index=['a','b','c'])
newdict = {"<=4":[0,4], "(4,10]":[4,10], ">10":[10,inf]}

def mapDict(num):
    print(num)
    for key,value in newdict.items():
        tmp0 = value[0]
        tmp1 = value[1]
        if num == 0:
            return "<=4"
        elif (num> tmp0) & (num<=tmp1):
            return key

test['col_2']=test.col_1.map(mapDict)

then test will become:

  col_1 col_2
a   3   <=4
b   15  >10
c   8   (4,10]

ps. I wanna know how to code fast in stack overflow, are there some one can tell me the tricks?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM