简体   繁体   English

用不同 DataFrame 中包含的值填充 NaN 值

[英]Filling NaN values with those contained in a different DataFrame

The problem looks like this:问题看起来像这样:

Problem问题

I have a dataframe left with a 2-level multiindex, representing events tpc occurring at point onset in the time region mc .我有一个带有 2 级多tpc的数据框,表示在时区mc onset时发生的事件left Every event occurs in a layer defined by (staff, voice) :每个事件都发生在由(staff, voice)定义的层中:

            mc onset  staff  voice  tpc  dynamics  chords
section ix                                               
0       0    0     0      2      1    0       NaN     NaN
        1    0     0      2      1    0       NaN     NaN
        2    0     0      1      1    0       NaN     NaN
        3    0     0      1      1    4       NaN     NaN
        4    0     0      1      1    1       NaN     NaN
        5    0     0      1      1    0       NaN     NaN
        6    0   3/4      2      2    1       NaN     NaN
        7    0   3/4      2      1    1       NaN     NaN

Then, there is the dataframe right with other events ('dynamic', 'chords') , which need to be filled into left :然后,数据框right与其他事件('dynamic', 'chords')需要填充到left

   mc onset  staff  voice dynamics chords
0   0     0      1      1        f    NaN
1   0     0      1      1      NaN      I
2   0   1/2      2      1        p    NaN
3   0   3/4      1      1      NaN     I6
4   0   3/4      2      1      NaN    I64

The rules for filling are as follows:填写规则如下:

  1. All events from right need to appear in left right的所有事件都需要出现在left
  2. If they co-occur with left events in the same layer, fill in the respective column of left for those events (ie, join on ['mc', 'onset', 'staff', 'voice'] ; eg rows 0, 1, 4)如果它们与同一层中的left事件同时发生,请为这些事件填写相应的left列(即加入['mc', 'onset', 'staff', 'voice'] ;例如第 0 行, 1, 4)
  3. Else if they co-occur with left events in the same staff , fill in the respective column of left for those events (ie, join on ['mc', 'onset', 'staff'] ; eg row 4)否则,如果它们与同一staff中的left事件同时发生,请为这些事件填写相应的left列(即加入['mc', 'onset', 'staff'] ;例如第 4 行)
  4. Else if they co-occur with left events in some other layer, fill in the respective column of left for those events (ie, join on ['mc', 'onset'] , eg row 3)否则,如果它们与其他层中的left事件同时发生,请为这些事件填写相应的left列(即加入['mc', 'onset'] ,例如第 3 行)
  5. Else if they don't co-occur with left events, throw a warning and keep them for further treatment (eg row 2)否则,如果它们没有与left事件同时发生,则发出警告并保留它们以供进一步处理(例如第 2 行)
  6. If two events of the same type within right occur simultaneously, throw a warning and concatenate values (eg rows 3 & 4)如果right的两个相同类型的事件同时发生,则抛出警告并连接值(例如第 3 行和第 4 行)

Expected result预期结果

     mc onset  staff  voice  tpc dynamics chords
0 0   0     0      2      1    0      NaN    NaN
  1   0     0      2      1    0      NaN    NaN
  2   0     0      1      1    0      f        I
  3   0     0      1      1    4      f        I
  4   0     0      1      1    1      f        I
  5   0     0      1      1    0      f        I
  6   0   3/4      2      2    1      NaN     I6
  7   0   3/4      2      1    1      NaN  I6I64
WARNING: These events could not be attached:
   mc onset  staff  voice dynamics chords
2   0   1/2      2      1        p    NaN
WARNING: These events are simultaneous:
   mc onset  staff  voice dynamics chords
3   0   3/4      1      1      NaN     I6
4   0   3/4      2      1      NaN    I64

Attempt 1尝试 1

Since I would like to avoid an approach where I iterate through right , I tried the following:由于我想避免迭代right的方法,因此我尝试了以下操作:

left_features = ['mc', 'onset', 'staff', 'voice']
right_features = ['dynamics', 'chords']
join_on = [['mc', 'onset', 'staff', 'voice'], ['mc', 'onset', 'staff'], ['mc', 'onset']]
for on in join_on:
    match = right[on + right_features].merge(left[left_features], on=on, left_index=True)
    left_ix = match.index
    left.loc[left_ix, match.columns] = match
    # left.loc[left_ix].fillna(match, inplace=True)
    right_ix = right.merge(left[left_features], on=on, right_index=True).index
    right.drop(right_ix, inplace=True)
    if len(right) == 0:
        break
if len(right) > 0:
    print("WARNING: These events could not be attached:")
    print(right)

This approach does not work because after the first merge, match looks like this:这种方法不起作用,因为在第一次合并后, match看起来像这样:

     mc onset  staff  voice dynamics chords  tpc
0 2   0     0      1      1        f    NaN    0
  3   0     0      1      1        f    NaN    4
  4   0     0      1      1        f    NaN    1
  5   0     0      1      1        f    NaN    0
  2   0     0      1      1      NaN      I    0
  3   0     0      1      1      NaN      I    4
  4   0     0      1      1      NaN      I    1
  5   0     0      1      1      NaN      I    0
  7   0   3/4      2      1      NaN    I64    1

Since the index of match is not unique, the assignment left = match is not fully working ( dynamics are missing in the result) and the commented out approach with fillna silently doesn't do anything.由于match的索引不是唯一的,分配left = match没有完全起作用(结果中缺少dynamics )并且fillna的注释掉的方法静默地没有做任何事情。 Also, it bothers me to do the same merge twice in order to get the left_index for correct assignment and then the right_index for dropping the matched rows.此外,为了获得用于正确分配的left_index和用于删除匹配行的right_index ,我不得不两次执行相同的合并,这让我很困扰。

Attempt 2尝试 2

Facing these problems, I preprocess right before the join to unite simultaneous events in one row:面对这些问题,我在连接right进行预处理以将同时发生的事件合并为一行:

def unite_vals(df):
    r = pd.Series(index=right_features)
    for col in right_features:
        u = df[col][df[col].notna()].unique()
        if len(u) > 1:
            r[col] = ''.join(str(val) for val in u)
            print(f"WARNING:Two simultaneous events in row {df.iloc[0].name}")
        elif len(u) == 1:
            r[col] = u[0]
    return r

left_features = ['mc', 'onset', 'staff', 'voice']
right_features = ['dynamics', 'chords']
on = ['mc', 'onset']
right = right.groupby(on).apply(unite_vals).reset_index()
match = right.merge(left[left_features], on=on, left_index=True)
left_ix = match.index
left.loc[left_ix, match.columns] = match
# left.loc[left_ix].fillna(match, inplace=True)
right_ix = right.merge(left[left_features], on=on, right_index=True).index
right.drop(right_ix, inplace=True)
if len(right) > 0:
    print("WARNING: These events could not be attached:")
    print(right)

(For some unknown reason, the commented out approach with fillna again doesn't do anything. The issue of doing the same merge twice remains.) The result is one I could live with, however, it does not differentiate between the layers of right and therefore looks like this: (出于某种未知原因,注释掉的fillna方法再次没有做任何事情。两次执行相同合并的问题仍然存在。)结果是我可以接受的结果,但是,它没有区分right的层因此看起来像这样:

     mc onset  staff  voice  tpc dynamics chords
0 0   0     0      2      1    0        f      I
  1   0     0      2      1    0        f      I
  2   0     0      1      1    0        f      I
  3   0     0      1      1    4        f      I
  4   0     0      1      1    1        f      I
  5   0     0      1      1    0        f      I
  6   0   3/4      2      2    1      NaN  I6I64
  7   0   3/4      2      1    1      NaN  I6I64
WARNING:Two simultaneous events at:
   mc onset
3   0   3/4
WARNING: These events could not be attached:
   mc onset dynamics chords
1   0   1/2        p    NaN

How would this typically be solved?这通常如何解决?

Here is the source code for reproduction:这是复制的源代码:

import pandas as pd
import numpy as np
from fractions import Fraction
left_dict = {'mc': {(0, 0): 0,
  (0, 1): 0,
  (0, 2): 0,
  (0, 3): 0,
  (0, 4): 0,
  (0, 5): 0,
  (0, 6): 0,
  (0, 7): 0},
 'onset': {(0, 0): Fraction(0, 1),
  (0, 1): Fraction(0, 1),
  (0, 2): Fraction(0, 1),
  (0, 3): Fraction(0, 1),
  (0, 4): Fraction(0, 1),
  (0, 5): Fraction(0, 1),
  (0, 6): Fraction(3, 4),
  (0, 7): Fraction(3, 4)},
 'staff': {(0, 0): 2,
  (0, 1): 2,
  (0, 2): 1,
  (0, 3): 1,
  (0, 4): 1,
  (0, 5): 1,
  (0, 6): 2,
  (0, 7): 2},
 'voice': {(0, 0): 1,
  (0, 1): 1,
  (0, 2): 1,
  (0, 3): 1,
  (0, 4): 1,
  (0, 5): 1,
  (0, 6): 2,
  (0, 7): 1},
 'tpc': {(0, 0): 0,
  (0, 1): 0,
  (0, 2): 0,
  (0, 3): 4,
  (0, 4): 1,
  (0, 5): 0,
  (0, 6): 1,
  (0, 7): 1},
 'dynamics': {(0, 0): np.nan,
  (0, 1): np.nan,
  (0, 2): np.nan,
  (0, 3): np.nan,
  (0, 4): np.nan,
  (0, 5): np.nan,
  (0, 6): np.nan,
  (0, 7): np.nan},
 'chords': {(0, 0): np.nan,
  (0, 1): np.nan,
  (0, 2): np.nan,
  (0, 3): np.nan,
  (0, 4): np.nan,
  (0, 5): np.nan,
  (0, 6): np.nan,
  (0, 7): np.nan}}
left = pd.DataFrame.from_dict(left_dict)

right_dict = {'mc': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
 'onset': {0: Fraction(0, 1),
  1: Fraction(0, 1),
  2: Fraction(1, 2),
  3: Fraction(3, 4),
  4: Fraction(3, 4)},
 'staff': {0: 1, 1: 1, 2: 2, 3: 1, 4: 2},
 'voice': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
 'dynamics': {0: 'f', 1: np.nan, 2: 'p', 3: np.nan, 4: np.nan},
 'chords': {0: np.nan, 1: 'I', 2: np.nan, 3: 'I6', 4: 'I64'}}
right = pd.DataFrame.from_dict(right_dict)

attempt1 = True
if attempt1:
    left_features = ['mc', 'onset', 'staff', 'voice', 'tpc']
    right_features = ['dynamics', 'chords']
    join_on = [['mc', 'onset', 'staff', 'voice'], ['mc', 'onset', 'staff'], ['mc', 'onset']]
    for on in join_on:
        match = right[on + right_features].merge(left[left_features], on=on, left_index=True)
        left_ix = match.index
        left.loc[left_ix, match.columns] = match
        #left.loc[left_ix].fillna(match, inplace=True)
        right_ix = right.merge(left[left_features], on=on, right_index=True).index
        right.drop(right_ix, inplace=True)
        if len(right) == 0:
            break
    if len(right) > 0:
        print("WARNING: These events could not be attached:")
        print(right)
    print(left)
else:
    def unite_vals(df):
        r = pd.Series(index=right_features)
        for col in right_features:
            u = df[col][df[col].notna()].unique()
            if len(u) > 1:
                r[col] = ''.join(str(val) for val in u)
                print("WARNING:Two simultaneous events at:")
                print(df.iloc[:1][['mc', 'onset']])
            elif len(u) == 1:
                r[col] = u[0]
        return r

    left_features = ['mc', 'onset', 'staff', 'voice']
    right_features = ['dynamics', 'chords']
    on = ['mc', 'onset']
    right = right.groupby(on).apply(unite_vals).reset_index()
    match = right.merge(left[left_features], on=on, left_index=True)
    left_ix = match.index
    left.loc[left_ix, match.columns] = match
    # left.loc[left_ix].fillna(match, inplace=True)
    right_ix = right.merge(left[left_features], on=on, right_index=True).index
    right.drop(right_ix, inplace=True)
    if len(right) > 0:
        print("WARNING: These events could not be attached:")
        print(right)
    print(left)

After all it has turned out that the easiest way to solve my problem is using loops:毕竟,解决我的问题的最简单方法是使用循环:

isnan = lambda num:  num != num
right_features = ['dynamics', 'chords']
for i, r in right.iterrows():
    same_os = left.loc[(left.mc == r.mc) & (left.onset == r.onset)]
    if len(same_os) > 0:
        same_staff = same_os.loc[same_os.staff == r.staff]
        same_voice = same_staff.loc[same_staff.voice == r.voice]
        if len(same_voice) > 0:
            fill = same_voice
        elif len(same_staff) > 0:
            fill = same_staff
        else:
            fill = same_os

        for f in right_features:
            if not isnan(r[f]):
                F = left.loc[fill.index, f]
                notna = F.notna()
                if notna.any():
                    print(f"WARNING:Feature existed and was concatenated: {F[notna]}")
                    left.loc[F[notna].index, f] += r[f]
                    left.loc[F[~notna].index, f] = r[f]
                else:
                    left.loc[fill.index, f] = r[f]
    else:
        print(f"WARNING:Event could not be attached: {r}")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM