简体   繁体   中英

How do I find in dataframe value in column B exists in Column A in a dataframe, and if so, replace the value in column B with Column A's value?

I have a dataframe-

df = pd.DataFrame({'Col A': ['A:A', 'A:A', 'B:B', 'C:C', 'D:D', 'E:E', 'F:F', 'F:F', 'G:G', 
'H:H'],
                  'Col B': ['A:A', 'F:F', 'B:B', 'C:C', 'D:D', 'E:E', 'E:E', 'F:F', 'G:G', 
'H:H']},
                  )

My end goal is to combine all duplicate values of row A, and find out if that row's value in column B exists in Column A - if it does, i want to update Column B's value to add that value to it- example below:

Index Col A Col B
0 A:A A:A, F:F, E:E
1 B:B B:B
2 C:C C:C
3 D:D D:D
4 E:E E:E
5 F:F F:F, E:E
6 G:G G:G
7 H:H H:H

I've tried applying a depth first search:

visited = set()
def dfs(visited, graph, node):
    if node not in visited:
        print (node)
        visited.add(node)
        for neighbour in graph[node]:
            dfs(visited, graph, neighbour)

However, i get a key error when i try that:

data = np.array(['A:A', 'B:B', 'C:C', 'D:D', 'E:E', 'F:F', 'G:G', 'H:H'])
ser = {'A:A', 'B:B', 'C:C', 'D:D', 'E:E', 'F:F', 'G:G', 'H:H'}
ser = pd.Series(data)

df = df.groupby(['Col1'])['Col2'].apply(' , '.join).reset_index()
for i in df:
    dfs(visited, df, i)



KeyError: 'A:A'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-66-e47c7e4cac17> in <module>
     26 print(df)
     27 for i in df:
---> 28     dfs(visited, df, i)

<ipython-input-66-e47c7e4cac17> in dfs(visited, graph, node)
     17         visited.add(node)
     18         for neighbour in graph[node]:
---> 19             dfs(visited, graph, neighbour)
     20 
     21 data = np.array(['A:A', 'B:B', 'C:C', 'D:D', 'E:E', 'F:F', 'G:G', 'H:H'])

<ipython-input-66-e47c7e4cac17> in dfs(visited, graph, node)
     16         print (node)
     17         visited.add(node)
---> 18         for neighbour in graph[node]:
     19             dfs(visited, graph, neighbour)
     20 

Unfortunately, my experience in python is limited-what is the best way to go about getting my goal here?

A short, fast solution would be to group by A and then aggregate B into a list:

new_df = df.groupby('Col A')['Col B'].agg(list).str.join(', ').reset_index()

Output:

>>> new_df
  Col A     Col B
0   A:A  A:A, F:F
1   B:B       B:B
2   C:C       C:C
3   D:D       D:D
4   E:E       E:E
5   F:F  E:E, F:F
6   G:G       G:G
7   H:H       H:H

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM