[英]Populate column in data frame based on a range found in another dataframe
I'm attempting to populate a column in a data frame based on whether the index value of that record falls within a range defined by two columns in another data frame. 我正在尝试根据该记录的索引值是否落在另一个数据帧中的两列定义的范围内来填充数据帧中的一列。
df1 looks like: df1看起来像:
a
0 4
1 45
2 7
3 5
4 48
5 44
6 22
7 89
8 45
9 44
10 23
and df2 is: df2是:
START STOP CLASS
0 2 3 1
1 5 7 2
2 8 8 3
what I want would look like: 我想要的样子:
a CLASS
0 4 nan
1 45 nan
2 7 1
3 5 1
4 48 nan
5 44 2
6 22 2
7 89 2
8 45 3
9 44 nan
10 23 nan
The START column in df2 is the minimum value of the range and the STOP column is the max. df2中的START列是范围的最小值,而STOP列是最大值。
You can use IntervalIndex (requires v0.20.0). 您可以使用IntervalIndex(需要v0.20.0)。
First construct the index: 首先构造索引:
df2.index = pd.IntervalIndex.from_arrays(df2['START'], df2['STOP'], closed='both')
df2
Out:
START STOP CLASS
[2, 3] 2 3 1
[5, 7] 5 7 2
[8, 8] 8 8 3
Now if you index into the second DataFrame it will lookup the value in the intervals. 现在,如果您索引到第二个DataFrame,它将在间隔中查找值。 For example,
例如,
df2.loc[6]
Out:
START 5
STOP 7
CLASS 2
Name: [5, 7], dtype: int64
returns the second class. 返回第二个类。 I don't know if it can be used with merge or with merge_asof but as an alternative you can use map:
我不知道它是否可以与merge或merge_asof一起使用,但可以使用map作为替代:
df1['CLASS'] = df1.index.to_series().map(df2['CLASS'])
Note that I first converted the index to a Series to be able to use the Series.map method. 请注意,我首先将索引转换为Series以便能够使用Series.map方法。 This results in
这导致
df1
Out:
a CLASS
0 4 NaN
1 45 NaN
2 7 1.0
3 5 1.0
4 48 NaN
5 44 2.0
6 22 2.0
7 89 2.0
8 45 3.0
9 44 NaN
10 23 NaN
Alternative solution: 替代解决方案:
classdict = df2.set_index("CLASS").to_dict("index")
rangedict = {}
for key,value in classdict.items():
# get all items in range and assign value (the key)
for item in list(range(value["START"],value["STOP"]+1)):
rangedict[item] = key
extract rangedict: 提取rangedict:
{2: 1, 3: 1, 5: 2, 6: 2, 7: 2, 8: 3}
now map and possibly format(?): 现在映射并可能使用format(?):
df1['CLASS'] = df1.index.to_series().map(rangedict)
df1.applymap("{0:.0f}".format)
outputs: 输出:
a CLASS
0 4 nan
1 45 nan
2 7 1
3 5 1
4 48 nan
5 44 2
6 22 2
7 89 2
8 45 3
9 44 nan
10 23 nan
import pandas as pd import numpy as np # Here is your existing dataframe df_existing = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD')) # Create a new empty dataframe with specific column names and data types df_new = pd.DataFrame(index=None) columns = ['field01','field02','field03','field04'] dtypes = [str,int,int,int] for c,d in zip(columns, dtypes): df_new[c] = pd.Series(dtype=d) # Set the index on the new dataframe to same as existing df_new['new_index'] = df_existing.index df_new.set_index('new_index', inplace=True) # Fill the new dataframe with specific fields from the existing dataframe df_new[['field02','field03']] = df_existing[['B','C']] print df_new
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.