简体   繁体   中英

Populate column in data frame based on a range found in another dataframe

I'm attempting to populate a column in a data frame based on whether the index value of that record falls within a range defined by two columns in another data frame.

df1 looks like:

    a
0   4
1   45
2   7
3   5
4   48
5   44
6   22
7   89
8   45
9   44
10  23

and df2 is:

  START STOP CLASS
0   2   3   1
1   5   7   2
2   8   8   3

what I want would look like:

    a   CLASS
0   4   nan
1   45  nan
2   7   1
3   5   1
4   48  nan
5   44  2
6   22  2
7   89  2
8   45  3
9   44  nan
10  23  nan

The START column in df2 is the minimum value of the range and the STOP column is the max.

You can use IntervalIndex (requires v0.20.0).

First construct the index:

df2.index = pd.IntervalIndex.from_arrays(df2['START'], df2['STOP'], closed='both')

df2
Out: 
        START  STOP  CLASS
[2, 3]      2     3      1
[5, 7]      5     7      2
[8, 8]      8     8      3

Now if you index into the second DataFrame it will lookup the value in the intervals. For example,

df2.loc[6]
Out: 
START    5
STOP     7
CLASS    2
Name: [5, 7], dtype: int64

returns the second class. I don't know if it can be used with merge or with merge_asof but as an alternative you can use map:

df1['CLASS'] = df1.index.to_series().map(df2['CLASS'])

Note that I first converted the index to a Series to be able to use the Series.map method. This results in

df1
Out: 
     a  CLASS
0    4    NaN
1   45    NaN
2    7    1.0
3    5    1.0
4   48    NaN
5   44    2.0
6   22    2.0
7   89    2.0
8   45    3.0
9   44    NaN
10  23    NaN

Alternative solution:


classdict = df2.set_index("CLASS").to_dict("index")

rangedict = {}

for key,value in classdict.items():

    # get all items in range and assign value (the key)
    for item in list(range(value["START"],value["STOP"]+1)):
        rangedict[item] = key

extract rangedict:

{2: 1, 3: 1, 5: 2, 6: 2, 7: 2, 8: 3}

now map and possibly format(?):

df1['CLASS'] = df1.index.to_series().map(rangedict)
df1.applymap("{0:.0f}".format)

outputs:

a   CLASS
0   4   nan
1   45  nan
2   7   1
3   5   1
4   48  nan
5   44  2
6   22  2
7   89  2
8   45  3
9   44  nan
10  23  nan
import pandas as pd import numpy as np # Here is your existing dataframe df_existing = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD')) # Create a new empty dataframe with specific column names and data types df_new = pd.DataFrame(index=None) columns = ['field01','field02','field03','field04'] dtypes = [str,int,int,int] for c,d in zip(columns, dtypes): df_new[c] = pd.Series(dtype=d) # Set the index on the new dataframe to same as existing df_new['new_index'] = df_existing.index df_new.set_index('new_index', inplace=True) # Fill the new dataframe with specific fields from the existing dataframe df_new[['field02','field03']] = df_existing[['B','C']] print df_new

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM