简体   繁体   中英

Append Pandas disjunction of 2 dataframes to first dataframe

Given 2 pandas tables, both with the 3 columns id , x and y coordinates. So several rows of same id represent a graph with its x - y values. How would I find paths that do not exist in the first table, but in the second and append them to 1st table? Key problem is that the order of the graphs in both tables can be different.

Example:

df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3], 'x':[1,1,5,4,4,1,1,1], 'y':[1,2,4,4,3,4,5,6]})
df2 = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,4,4,4], 'x':[1,1,1,1,1,5,4,4,10,10,9], 'y':[4,5,6,1,2,4,4,3,1,2,2]})

(df1   intersect df2  )  --------->  df1
id x y       id x y              id x y 
1  1 1       1  1 4              1  1 1 
1  1 2       1  1 5              1  1 2
2  5 4       1  1 6              2  5 4
2  4 4       2  1 1              2  4 4
2  4 3       2  1 2              2  4 3
3  1 4       3  5 4              3  1 4
3  1 5       3  4 4              3  1 5
3  1 6       3  4 3              3  1 6
             4  10 1             4  10 1
             4  10 2             4  10 2
             4   9 2             4   9 2 
Should become:
df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3,4,4,4], 'x':[1,1,5,4,4,1,1,1,10,10,9], 'y':[1,2,4,4,3,4,5,6,1,2,2]})

As you can see until id = 3, df1 and df2 have similar graphs, but their order is different from one to another table. In this case for example df1 first graph is df2 seconds graph. Now df2 has a 4th path that is not in df1 . In that case the 4th path should be detected and appended to df1 . Like that I want to get the intersection of the 2 pandas table and append the disjunction of the both to the first table, with the condition that the id , so to say the order of the paths can be different from one and another.

Imports:

import pandas as pd

Set starting DataFrames:

df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3], 
                    'x':[1,1,5,4,4,1,1,1], 
                    'y':[1,2,4,4,3,4,5,6]})
df2 = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,4,4,4], 
                    'x':[1,1,1,1,1,5,4,4,10,10,9], 
                    'y':[4,5,6,1,2,4,4,3,1,2,2]})

Outer Merge:

df_merged = df1.merge(df2, on=['x', 'y'], how='outer')

produces:

df_merged =

   id_x  x  y   id_y
0   1.0  1  1   2
1   1.0  1  2   2
2   2.0  5  4   3
3   2.0  4  4   3
4   2.0  4  3   3
5   3.0  1  4   1
6   3.0  1  5   1
7   3.0  1  6   1
8   NaN  10 1   4
9   NaN  10 2   4
10  NaN  9  2   4

Note: Why does id_x become floats?

Fill NaN:

df_merged.id_x = df_merged.id_x.fillna(df_merged.id_y).astype('int')

produces:

df_merged = 

 id_x   x   y   id_y
0   1   1   1   2
1   1   1   2   2
2   2   5   4   3
3   2   4   4   3
4   2   4   3   3
5   3   1   4   1
6   3   1   5   1
7   3   1   6   1
8   4   10  1   4
9   4   10  2   4
10  4   9   2   4

Drop id_y :

df_merged = df_merged.drop(['id_y'], axis=1)

produces:

df_merged = 

    id_x    x   y
0      1    1   1
1      1    1   2
2      2    5   4
3      2    4   4
4      2    4   3
5      3    1   4
6      3    1   5
7      3    1   6
8      4    10  1
9      4    10  2
10     4    9   2

Rename id_x to id :

df_merged = df_merged.rename(columns={'id_x': 'id'})

produces:

df_merged = 

    id  x   y
0   1   1   1
1   1   1   2
2   2   5   4
3   2   4   4
4   2   4   3
5   3   1   4
6   3   1   5
7   3   1   6
8   4   10  1
9   4   10  2
10  4   9   2

Final Program is 4 lines of code:

import pandas as pd

df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3], 
                    'x':[1,1,5,4,4,1,1,1], 
                    'y':[1,2,4,4,3,4,5,6]})
df2 = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,4,4,4], 
                    'x':[1,1,1,1,1,5,4,4,10,10,9], 
                    'y':[4,5,6,1,2,4,4,3,1,2,2]})

df_merged = df1.merge(df2, on=['x', 'y'], how='outer')
df_merged.id_x = df_merged.id_x.fillna(df_merged.id_y).astype('int')
df_merged = df_merged.drop(['id_y'], axis=1)
df_merged = df_merged.rename(columns={'id_x': 'id'})

Please remember to put a check next to the selected answer.

Mauritius, try this code:

df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3], 'x':[1,1,5,4,4,1,1,1], 'y':[1,2,4,4,3,4,5,6]})
df2 = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,4,4,4,5], 'x':[1,1,1,1,1,5,4,4,10,10,9,1], 'y':[4,5,6,1,2,4,4,3,1,2,2,2]})

df1_s = [{(x,y) for x, y in df1[['x','y']][df1.id==i].values} for i in df1.id.unique()]

def f(df2):
    data = {(x,y) for x, y in df2[['x','y']].values}
    if data not in df1_s:
        return True
    else:
        return False

check = df2.groupby('id').apply(f).apply(pd.Series)
ids = check[check[0]].index.values
df2 = df2.set_index('id').loc[ids].reset_index()

df1 = df1.append(df2)

OUT:

   id   x  y
0   1   1  1
1   1   1  2
2   2   5  4
3   2   4  4
4   2   4  3
5   3   1  4
6   3   1  5
7   3   1  6
0   4  10  1
1   4  10  2
2   4   9  2
3   5   1  2

I think it can be done more simple and pythonic, but I think a lot and still don't know how = )

And I think, should to check ids is not the same in df1 and df2, before append one df to another (in the end). I might add this later.

Does this code do what you want?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM