將兩個pandas數據幀的值相互組合並完成

Question

我有2個數據幀，缺少值，我想合並並完成彼此的數據，

簡單的可視化：

df1 :
A,B,C
A1,B1,C1
A2,B2,
A3,B3,C3 

df2 :
A,B,C
A1,,C1
A4,B4,C4
A2,B2,C2

The result wanted:
A,B,C
A1,B1,C1
A2,B2,B2
A3,B3,C3
A4,B4,C4

基本上合並數據幀而不復制“A”列，並通過比較數據幀之間相同列“A”的值來完成行中是否存在缺失值。

我在Stackexchange上的Pandas文檔+解決方案上嘗試過很多東西，但每次都失敗了

這些都是我嘗試過的不同之處：

pd.merge_ordered(df1, df2, fill_method='ffill', left_by='A')
df1.combine_first(df2)
df1.update(df2)
pd.concat([df1, df2])
pd.merge(df1, df2, on=['A','B','C'], how='right')
pd.merge(df1, df2, on=['A','B','C'], how='outer')
pd.merge(df1, df2, on=['A','B','C'], how='left')
df1.join(df2, how='outer')
df1.join(df2, how='left')
df1.set_index('A').join(df2.set_index('A'))

（你可以看到我最后非常絕望）

知道怎么做嗎？

Answer 1

您是否嘗試將combine_first與A作為索引？

df1.set_index('A').combine_first(df2.set_index('A')).reset_index()

    A   B   C
0  A1  B1  C1
1  A2  B2  C2
2  A3  B3  C3
3  A4  B4  C4

Answer 2

或者你可以first使用

pd.concat([df1,df2]).replace('',np.nan).groupby('A',as_index=False).first()
Out[53]: 
    A   B   C
0  A1  B1  C1
1  A2  B2  C2
2  A3  B3  C3
3  A4  B4  C4

Answer 3

設定
既然你把它們寫成csvs，我會假設它們是csvs。

df1 = pd.read_csv('df1.csv', sep=',', index_col=0)
df2 = pd.read_csv('df2.csv', sep=',', index_col=0)

解
使用fillna有使用后align

pd.DataFrame.fillna(*df1.align(df2))

     B   C
A         
A1  B1  C1
A2  B2  C2
A3  B3  C3
A4  B4  C4

如果你堅持，你可以使用reset_index ，但我認為保持reset_index是更漂亮的。

Answer 4

您可以使用pandas 分類數據類型來設置有序的類別列表，對這些有序類別進行排序，以及刪除具有Null值的行以獲得所需的結果：

from pandas.api.types import CategoricalDtype
# Create first dataframe from OP values
df1 = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
                    'B': ['B1', 'B2', 'B3'],
                    'C': ['C1', '', 'C3']})

# create second dataframe from original values
df2 = pd.DataFrame({'A': ['A1', 'A4', 'A2'],
                    'B': ['', 'B4', 'B2'],
                    'C': ['C1', 'C4', 'C2']})

# concatenate the two together for a long dataframe
final = pd.concat([df1, df2])

# specify the letters in your dataset  
letters = ['A', 'B', 'C']
# create a placeholder dictionary to store the categorical datatypes
cat_dict = {}

# iterate over the letters
for let in letters:
    # create the ordered categories - set hte range for the max # of values
    cats = ['{}{}'.format(let, num) for num in list(range(1000))]
    # create ordered categorical datatype
    cat_type = CategoricalDtype(cats, ordered=True)
    # insert into placeholder
    cat_dict[let] = cat_type

# properly format your columns as the ordered categories
final['A'] = final['A'].astype(cat_dict['A'])
final['B'] = final['B'].astype(cat_dict['B'])
final['C'] = final['C'].astype(cat_dict['C'])
# finally sort on the three columns and drop rows with NA values
final.sort_values(['A', 'B', 'C']).dropna(how='any')

# which outputs desired results
    A   B   C
0  A1  B1  C1
2  A2  B2  C2
2  A3  B3  C3
1  A4  B4  C4

雖然這有點長，但這樣做的一個好處是你的數據可以在輸入時以任何順序。 這會將繼承等級插入每列中的值，因此A1 <A2 <A3，依此類推。 這也使您可以對列進行排序。

將兩個pandas數據幀的值相互組合並完成

問題描述

4 個解決方案

解決方案1
4 已采納 2018-02-17 03:26:13

解決方案2
4 2018-02-17 03:48:51

解決方案3
4 2018-02-17 04:08:45

解決方案4
1 2018-02-17 03:25:45

將兩個pandas數據幀的值相互組合並完成

問題描述

4 個解決方案

解決方案1 4 已采納 2018-02-17 03:26:13

解決方案2 4 2018-02-17 03:48:51

解決方案3 4 2018-02-17 04:08:45

解決方案4 1 2018-02-17 03:25:45

解決方案1
4 已采納 2018-02-17 03:26:13

解決方案2
4 2018-02-17 03:48:51

解決方案3
4 2018-02-17 04:08:45

解決方案4
1 2018-02-17 03:25:45