合并 df1 中的值对应于 df2 中的值的行

Question

我有两个类似于以下的大型数据集

DataFrame df1 ：

P    Y    p_start   p_stop
p1   y1      7         9
p2   y2      6         7
p3   y3      12        14

DataFrame df2 ：

T    t_start    t_stop 
t1      5          10
t2      11         15

我想检查P是否位于区域T内。 如果是这样，我需要将df1的那一行 append 到df2的相应行。 如果有多个匹配项，我需要将它们都添加到同一行。 理想情况下，我希望我的 output 看起来像这样：

所需的 output：

T   t_start  t_stop   P_1   Y_1   p_start_1   p_stop_1  P_2  Y_2  p_start_2  p_stop_2
t1     5       10      p1   y1       7           9       p2   y2      6         7
t2     11      15      p3   y3      12          14

我的逻辑类似于以下内容，但我不确定如何使其真正起作用

for line in df1:
    if df1['p_start'] >= df2['t_start'] & df1['p_end'] <= df2['t_end']:
        df2 = df1.append(['X', 'Y', 'p_start', 'p_stop'])

我正在使用列名，因为我有更多不需要 append 的列。 为简单起见，我从示例数据中省略了它们。 我更担心找到匹配项并附加到正确的 df2 行

Answer 1

利用：

# STEP 1
df3 = df2.assign(key=1).merge(df1.assign(key=1), on='key').drop('key', 1)

# STEP 2
df3 = df3[df3['t_start'].lt(df3['p_start']) & df3['t_stop'].gt(df3['p_stop'])]

# STEP 3
df3 = df3.melt(['T', 't_start', 't_stop'])

# STEP 4
df3['variable'] += '_' + df3.groupby(['T', 't_start', 't_stop', 'variable']).cumcount().add(1).astype(str)
    
# STEP 5
df3 = (
    df3.set_index(['T', 't_start', 't_stop', 'variable'])
    .unstack().droplevel(0, 1).rename_axis(columns=None).reset_index()
)

说明/步骤：

步骤 1：使用DataFrame.merge将公共临时列key上的两个数据帧合并。 通过使用合并，我们创建了两个 dataframe 中行的所有可能组合，以便我们可以过滤STEP 2中满足我们条件的行。

# STEP 1
    T  t_start  t_stop   P   Y  p_start  p_stop
0  t1        5      10  p1  y1        7       9
1  t1        5      10  p2  y2        6       7
2  t1        5      10  p3  y3       12      14
3  t2       11      15  p1  y1        7       9
4  t2       11      15  p2  y2        6       7
5  t2       11      15  p3  y3       12      14

步骤 2：过滤合并的 dataframe df3中的行，使得p_start大于t_start并且t_stop大于p_stop ，即p_start和p_stop位于区域t_start和t_stop 。

# STEP 2
    T  t_start  t_stop   P   Y  p_start  p_stop
0  t1        5      10  p1  y1        7       9
1  t1        5      10  p2  y2        6       7
5  t2       11      15  p3  y3       12      14

步骤 3：使用DataFrame.melt熔化 dataframe ie列P, Y, p_start, p_stop转换为行。

# STEP 3
     T  t_start  t_stop variable value
0   t1        5      10        P    p1
1   t1        5      10        P    p2
2   t2       11      15        P    p3
3   t1        5      10        Y    y1
4   t1        5      10        Y    y2
5   t2       11      15        Y    y3
6   t1        5      10  p_start     7
7   t1        5      10  p_start     6
8   t2       11      15  p_start    12
9   t1        5      10   p_stop     9
10  t1        5      10   p_stop     7
11  t2       11      15   p_stop    14

第 4 步：在给定的列上使用DataFrame.groupby并使用转换cumcount并将其添加到列variable以将顺序计数器添加到列variable 。

# STEP 4
     T  t_start  t_stop   variable value
0   t1        5      10        P_1    p1
1   t1        5      10        P_2    p2
2   t2       11      15        P_1    p3
3   t1        5      10        Y_1    y1
4   t1        5      10        Y_2    y2
5   t2       11      15        Y_1    y3
6   t1        5      10  p_start_1     7
7   t1        5      10  p_start_2     6
8   t2       11      15  p_start_1    12
9   t1        5      10   p_stop_1     9
10  t1        5      10   p_stop_2     7
11  t2       11      15   p_stop_1    14

第 5 步：使用set_index和DataFrame.unstack将 dataframe 和 pivot variable列中的项目作为单独的单独列解栈。

# STEP 5
    T  t_start  t_stop P_1  P_2 Y_1  Y_2 p_start_1 p_start_2 p_stop_1 p_stop_2
0  t1        5      10  p1   p2  y1   y2         7         6        9        7
1  t2       11      15  p3  NaN  y3  NaN        12       NaN       14      NaN

第 6 步：如果要对 dataframe 中的列reorder ，则为可选步骤。

# OPTIONAL for reordering columns
def sort_fx(col):
    grp, seq = re.search(
        r'(.*?)_(\d+)', col.lower()).groups()
    return seq + '1' if grp == 'start' else seq + '2' if grp == 'stop' else seq

df3 = df3.reindex(df3.columns[:3].tolist() + sorted(df3.columns[3:], key=sort_fx), axis=1)

# STEP 6
    T  t_start  t_stop P_1 Y_1 p_start_1 p_stop_1  P_2  Y_2 p_start_2 p_stop_2
0  t1        5      10  p1  y1         7        9   p2   y2         6        7
1  t2       11      15  p3  y3        12       14  NaN  NaN       NaN      NaN

合并 df1 中的值对应于 df2 中的值的行

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-06-26 16:58:02

合并 df1 中的值对应于 df2 中的值的行

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-06-26 16:58:02

解决方案1
0 已采纳 2020-06-26 16:58:02