简体   繁体   中英

Filling NaN values with those contained in a different DataFrame

The problem looks like this:

Problem

I have a dataframe left with a 2-level multiindex, representing events tpc occurring at point onset in the time region mc . Every event occurs in a layer defined by (staff, voice) :

            mc onset  staff  voice  tpc  dynamics  chords
section ix                                               
0       0    0     0      2      1    0       NaN     NaN
        1    0     0      2      1    0       NaN     NaN
        2    0     0      1      1    0       NaN     NaN
        3    0     0      1      1    4       NaN     NaN
        4    0     0      1      1    1       NaN     NaN
        5    0     0      1      1    0       NaN     NaN
        6    0   3/4      2      2    1       NaN     NaN
        7    0   3/4      2      1    1       NaN     NaN

Then, there is the dataframe right with other events ('dynamic', 'chords') , which need to be filled into left :

   mc onset  staff  voice dynamics chords
0   0     0      1      1        f    NaN
1   0     0      1      1      NaN      I
2   0   1/2      2      1        p    NaN
3   0   3/4      1      1      NaN     I6
4   0   3/4      2      1      NaN    I64

The rules for filling are as follows:

  1. All events from right need to appear in left
  2. If they co-occur with left events in the same layer, fill in the respective column of left for those events (ie, join on ['mc', 'onset', 'staff', 'voice'] ; eg rows 0, 1, 4)
  3. Else if they co-occur with left events in the same staff , fill in the respective column of left for those events (ie, join on ['mc', 'onset', 'staff'] ; eg row 4)
  4. Else if they co-occur with left events in some other layer, fill in the respective column of left for those events (ie, join on ['mc', 'onset'] , eg row 3)
  5. Else if they don't co-occur with left events, throw a warning and keep them for further treatment (eg row 2)
  6. If two events of the same type within right occur simultaneously, throw a warning and concatenate values (eg rows 3 & 4)

Expected result

     mc onset  staff  voice  tpc dynamics chords
0 0   0     0      2      1    0      NaN    NaN
  1   0     0      2      1    0      NaN    NaN
  2   0     0      1      1    0      f        I
  3   0     0      1      1    4      f        I
  4   0     0      1      1    1      f        I
  5   0     0      1      1    0      f        I
  6   0   3/4      2      2    1      NaN     I6
  7   0   3/4      2      1    1      NaN  I6I64
WARNING: These events could not be attached:
   mc onset  staff  voice dynamics chords
2   0   1/2      2      1        p    NaN
WARNING: These events are simultaneous:
   mc onset  staff  voice dynamics chords
3   0   3/4      1      1      NaN     I6
4   0   3/4      2      1      NaN    I64

Attempt 1

Since I would like to avoid an approach where I iterate through right , I tried the following:

left_features = ['mc', 'onset', 'staff', 'voice']
right_features = ['dynamics', 'chords']
join_on = [['mc', 'onset', 'staff', 'voice'], ['mc', 'onset', 'staff'], ['mc', 'onset']]
for on in join_on:
    match = right[on + right_features].merge(left[left_features], on=on, left_index=True)
    left_ix = match.index
    left.loc[left_ix, match.columns] = match
    # left.loc[left_ix].fillna(match, inplace=True)
    right_ix = right.merge(left[left_features], on=on, right_index=True).index
    right.drop(right_ix, inplace=True)
    if len(right) == 0:
        break
if len(right) > 0:
    print("WARNING: These events could not be attached:")
    print(right)

This approach does not work because after the first merge, match looks like this:

     mc onset  staff  voice dynamics chords  tpc
0 2   0     0      1      1        f    NaN    0
  3   0     0      1      1        f    NaN    4
  4   0     0      1      1        f    NaN    1
  5   0     0      1      1        f    NaN    0
  2   0     0      1      1      NaN      I    0
  3   0     0      1      1      NaN      I    4
  4   0     0      1      1      NaN      I    1
  5   0     0      1      1      NaN      I    0
  7   0   3/4      2      1      NaN    I64    1

Since the index of match is not unique, the assignment left = match is not fully working ( dynamics are missing in the result) and the commented out approach with fillna silently doesn't do anything. Also, it bothers me to do the same merge twice in order to get the left_index for correct assignment and then the right_index for dropping the matched rows.

Attempt 2

Facing these problems, I preprocess right before the join to unite simultaneous events in one row:

def unite_vals(df):
    r = pd.Series(index=right_features)
    for col in right_features:
        u = df[col][df[col].notna()].unique()
        if len(u) > 1:
            r[col] = ''.join(str(val) for val in u)
            print(f"WARNING:Two simultaneous events in row {df.iloc[0].name}")
        elif len(u) == 1:
            r[col] = u[0]
    return r

left_features = ['mc', 'onset', 'staff', 'voice']
right_features = ['dynamics', 'chords']
on = ['mc', 'onset']
right = right.groupby(on).apply(unite_vals).reset_index()
match = right.merge(left[left_features], on=on, left_index=True)
left_ix = match.index
left.loc[left_ix, match.columns] = match
# left.loc[left_ix].fillna(match, inplace=True)
right_ix = right.merge(left[left_features], on=on, right_index=True).index
right.drop(right_ix, inplace=True)
if len(right) > 0:
    print("WARNING: These events could not be attached:")
    print(right)

(For some unknown reason, the commented out approach with fillna again doesn't do anything. The issue of doing the same merge twice remains.) The result is one I could live with, however, it does not differentiate between the layers of right and therefore looks like this:

     mc onset  staff  voice  tpc dynamics chords
0 0   0     0      2      1    0        f      I
  1   0     0      2      1    0        f      I
  2   0     0      1      1    0        f      I
  3   0     0      1      1    4        f      I
  4   0     0      1      1    1        f      I
  5   0     0      1      1    0        f      I
  6   0   3/4      2      2    1      NaN  I6I64
  7   0   3/4      2      1    1      NaN  I6I64
WARNING:Two simultaneous events at:
   mc onset
3   0   3/4
WARNING: These events could not be attached:
   mc onset dynamics chords
1   0   1/2        p    NaN

How would this typically be solved?

Here is the source code for reproduction:

import pandas as pd
import numpy as np
from fractions import Fraction
left_dict = {'mc': {(0, 0): 0,
  (0, 1): 0,
  (0, 2): 0,
  (0, 3): 0,
  (0, 4): 0,
  (0, 5): 0,
  (0, 6): 0,
  (0, 7): 0},
 'onset': {(0, 0): Fraction(0, 1),
  (0, 1): Fraction(0, 1),
  (0, 2): Fraction(0, 1),
  (0, 3): Fraction(0, 1),
  (0, 4): Fraction(0, 1),
  (0, 5): Fraction(0, 1),
  (0, 6): Fraction(3, 4),
  (0, 7): Fraction(3, 4)},
 'staff': {(0, 0): 2,
  (0, 1): 2,
  (0, 2): 1,
  (0, 3): 1,
  (0, 4): 1,
  (0, 5): 1,
  (0, 6): 2,
  (0, 7): 2},
 'voice': {(0, 0): 1,
  (0, 1): 1,
  (0, 2): 1,
  (0, 3): 1,
  (0, 4): 1,
  (0, 5): 1,
  (0, 6): 2,
  (0, 7): 1},
 'tpc': {(0, 0): 0,
  (0, 1): 0,
  (0, 2): 0,
  (0, 3): 4,
  (0, 4): 1,
  (0, 5): 0,
  (0, 6): 1,
  (0, 7): 1},
 'dynamics': {(0, 0): np.nan,
  (0, 1): np.nan,
  (0, 2): np.nan,
  (0, 3): np.nan,
  (0, 4): np.nan,
  (0, 5): np.nan,
  (0, 6): np.nan,
  (0, 7): np.nan},
 'chords': {(0, 0): np.nan,
  (0, 1): np.nan,
  (0, 2): np.nan,
  (0, 3): np.nan,
  (0, 4): np.nan,
  (0, 5): np.nan,
  (0, 6): np.nan,
  (0, 7): np.nan}}
left = pd.DataFrame.from_dict(left_dict)

right_dict = {'mc': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
 'onset': {0: Fraction(0, 1),
  1: Fraction(0, 1),
  2: Fraction(1, 2),
  3: Fraction(3, 4),
  4: Fraction(3, 4)},
 'staff': {0: 1, 1: 1, 2: 2, 3: 1, 4: 2},
 'voice': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
 'dynamics': {0: 'f', 1: np.nan, 2: 'p', 3: np.nan, 4: np.nan},
 'chords': {0: np.nan, 1: 'I', 2: np.nan, 3: 'I6', 4: 'I64'}}
right = pd.DataFrame.from_dict(right_dict)

attempt1 = True
if attempt1:
    left_features = ['mc', 'onset', 'staff', 'voice', 'tpc']
    right_features = ['dynamics', 'chords']
    join_on = [['mc', 'onset', 'staff', 'voice'], ['mc', 'onset', 'staff'], ['mc', 'onset']]
    for on in join_on:
        match = right[on + right_features].merge(left[left_features], on=on, left_index=True)
        left_ix = match.index
        left.loc[left_ix, match.columns] = match
        #left.loc[left_ix].fillna(match, inplace=True)
        right_ix = right.merge(left[left_features], on=on, right_index=True).index
        right.drop(right_ix, inplace=True)
        if len(right) == 0:
            break
    if len(right) > 0:
        print("WARNING: These events could not be attached:")
        print(right)
    print(left)
else:
    def unite_vals(df):
        r = pd.Series(index=right_features)
        for col in right_features:
            u = df[col][df[col].notna()].unique()
            if len(u) > 1:
                r[col] = ''.join(str(val) for val in u)
                print("WARNING:Two simultaneous events at:")
                print(df.iloc[:1][['mc', 'onset']])
            elif len(u) == 1:
                r[col] = u[0]
        return r

    left_features = ['mc', 'onset', 'staff', 'voice']
    right_features = ['dynamics', 'chords']
    on = ['mc', 'onset']
    right = right.groupby(on).apply(unite_vals).reset_index()
    match = right.merge(left[left_features], on=on, left_index=True)
    left_ix = match.index
    left.loc[left_ix, match.columns] = match
    # left.loc[left_ix].fillna(match, inplace=True)
    right_ix = right.merge(left[left_features], on=on, right_index=True).index
    right.drop(right_ix, inplace=True)
    if len(right) > 0:
        print("WARNING: These events could not be attached:")
        print(right)
    print(left)

After all it has turned out that the easiest way to solve my problem is using loops:

isnan = lambda num:  num != num
right_features = ['dynamics', 'chords']
for i, r in right.iterrows():
    same_os = left.loc[(left.mc == r.mc) & (left.onset == r.onset)]
    if len(same_os) > 0:
        same_staff = same_os.loc[same_os.staff == r.staff]
        same_voice = same_staff.loc[same_staff.voice == r.voice]
        if len(same_voice) > 0:
            fill = same_voice
        elif len(same_staff) > 0:
            fill = same_staff
        else:
            fill = same_os

        for f in right_features:
            if not isnan(r[f]):
                F = left.loc[fill.index, f]
                notna = F.notna()
                if notna.any():
                    print(f"WARNING:Feature existed and was concatenated: {F[notna]}")
                    left.loc[F[notna].index, f] += r[f]
                    left.loc[F[~notna].index, f] = r[f]
                else:
                    left.loc[fill.index, f] = r[f]
    else:
        print(f"WARNING:Event could not be attached: {r}")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM