[英]Replace comma-separated values in a dataframe with values from another dataframe
this is my first question on StackOverflow, so please pardon if I am not clear enough.这是我关于 StackOverflow 的第一个问题,所以如果我不够清楚,请原谅。 I usually find my answers here but this time I had no luck.我通常在这里找到我的答案,但这次我没有运气。 Maybe I am being dense, but here we go.也许我很密集,但我们走了。
I have two pandas dataframes formatted as follows我有两个 Pandas 数据框,格式如下
df1 df1
+------------+-------------+
| References | Description |
+------------+-------------+
| 1,2 | Descr 1 |
| 3 | Descr 2 |
| 2,3,5 | Descr 3 |
+------------+-------------+
df2 df2
+--------+--------------+
| Ref_ID | ShortRef |
+--------+--------------+
| 1 | Smith (2006) |
| 2 | Mike (2009) |
| 3 | John (2014) |
| 4 | Cole (2007) |
| 5 | Jill (2019) |
| 6 | Tom (2007) |
+--------+--------------+
Basically, Ref_ID in df2 contains IDs that form the string contained in the field References in df1基本上, df2中的Ref_ID包含形成df1中的字段References中包含的字符串的 ID
What I would like to do is to replace values in the References field in df1 so it looks like this:我想要做的是替换df1 中References字段中的值,使其看起来像这样:
+-------------------------------------+-------------+
| References | Description |
+-------------------------------------+-------------+
| Smith (2006); Mike (2009) | Descr 1 |
| John (2014) | Descr 2 |
| Mike (2009);John (2014);Jill (2019) | Descr 3 |
+-------------------------------------+-------------+
So far, I had to deal with columns and IDs with a 1-1 relationship, and this works perfectly Pandas - Replacing Values by Looking Up in an Another Dataframe到目前为止,我不得不处理具有 1-1 关系的列和 ID,这非常有效Pandas - 通过在另一个数据帧中查找来替换值
But I cannot get my mind around this slightly different problem.但是我无法理解这个略有不同的问题。 The only solution I could think of is to re-iterate a for and if cycles that compare every string of df1 to df2 and make the substitution.我能想到的唯一解决方案是重新迭代一个 for 和 if 循环,将df1 的每个字符串与df2进行比较并进行替换。
This would be, I am afraid, very slow as I have ca.恐怕这会很慢,因为我有大约。 2000 unique Ref_ID s and I have to repeat this operation in several columns similar to the References one. 2000 个唯一的Ref_ID ,我必须在类似于References 的几列中重复此操作。
Anyone is willing to point me in the right direction?有人愿意指出我正确的方向吗?
Many thanks in advance.提前谢谢了。
you can use some list comprehension and dict lookups and I dont think this will be too slow你可以使用一些列表理解和字典查找,我认为这不会太慢
First, making a fast-to-access mapping for id to short_ref首先,为 id 到 short_ref 做一个快速访问映射
mapping_dict = df2.set_index('Ref_ID')['ShortRef'].to_dict()
Then, lets split references by commas然后,让我们用逗号分割引用
df1_values = [v.split(',') for v in df1['References']]
Finally, we can iterate over and do dictionary lookups, before concatenating back to strings最后,我们可以迭代并进行字典查找,然后再连接回字符串
df1['References'] = pd.Series([';'.join([mapping_dict[v] for v in values]) for values in df1_values])
Is this usable or is it too slow?这是可用的还是太慢了?
Let's try this:让我们试试这个:
df1 = pd.DataFrame({'Reference':['1,2','3','1,3,5'], 'Description':['Descr 1', 'Descr 2', 'Descr 3']})
df2 = pd.DataFrame({'Ref_ID':[1,2,3,4,5,6], 'ShortRef':['Smith (2006)',
'Mike (2009)',
'John (2014)',
'Cole (2007)',
'Jill (2019)',
'Tom (2007)']})
df1['Reference2'] = (df1['Reference'].str.split(',')
.explode()
.map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
.set_index('Ref_ID')['ShortRef'])
.groupby(level=0).agg(list))
Output:输出:
Reference Description Reference2
0 1,2 Descr 1 [Smith (2006), Mike (2009)]
1 3 Descr 2 [John (2014)]
2 1,3,5 Descr 3 [Smith (2006), John (2014), Jill (2019)]
@Datanovice thanks for the update. @Datanovice 感谢您的更新。
df1['Reference2'] = (df1['Reference'].str.split(',')
.explode()
.map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
.set_index('Ref_ID')['ShortRef'])
.groupby(level=0).agg(';'.join))
Output:输出:
Reference Description Reference2
0 1,2 Descr 1 Smith (2006);Mike (2009)
1 3 Descr 2 John (2014)
2 1,3,5 Descr 3 Smith (2006);John (2014);Jill (2019)
Another solution is using str.get_dummies
and dot
另一个解决方案是使用str.get_dummies
和dot
df3 = (df1.set_index('Description').Reference.str.get_dummies(',')
.reindex(columns=df2.Ref_ID.astype(str).values, fill_value=0))
df_final = (df3.dot(df2.ShortRef.values+';').str.strip(';').rename('References')
.reset_index())
Out[462]:
Description References
0 Descr 1 Smith (2006);Mike (2009)
1 Descr 2 John (2014)
2 Descr 3 Mike (2009);John (2014);Jill (2019)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.