用来自另一个数据帧的值替换数据帧中的逗号分隔值

Question

this is my first question on StackOverflow, so please pardon if I am not clear enough.这是我关于 StackOverflow 的第一个问题，所以如果我不够清楚，请原谅。 I usually find my answers here but this time I had no luck.我通常在这里找到我的答案，但这次我没有运气。 Maybe I am being dense, but here we go.也许我很密集，但我们走了。

I have two pandas dataframes formatted as follows我有两个 Pandas 数据框，格式如下

df1 df1

+------------+-------------+
| References | Description |
+------------+-------------+
| 1,2        | Descr 1     |
| 3          | Descr 2     |
| 2,3,5      | Descr 3     |
+------------+-------------+

df2 df2

+--------+--------------+
| Ref_ID |   ShortRef   |
+--------+--------------+
|      1 | Smith (2006) |
|      2 | Mike (2009)  |
|      3 | John (2014)  |
|      4 | Cole (2007)  |
|      5 | Jill (2019)  |
|      6 | Tom (2007)   |
+--------+--------------+

Basically, Ref_ID in df2 contains IDs that form the string contained in the field References in df1基本上， df2中的Ref_ID包含形成df1中的字段References中包含的字符串的 ID

What I would like to do is to replace values in the References field in df1 so it looks like this:我想要做的是替换df1 中References字段中的值，使其看起来像这样：

+-------------------------------------+-------------+
|             References              | Description |
+-------------------------------------+-------------+
| Smith (2006); Mike (2009)           | Descr 1     |
| John (2014)                         | Descr 2     |
| Mike (2009);John (2014);Jill (2019) | Descr 3     |
+-------------------------------------+-------------+

So far, I had to deal with columns and IDs with a 1-1 relationship, and this works perfectly Pandas - Replacing Values by Looking Up in an Another Dataframe到目前为止，我不得不处理具有 1-1 关系的列和 ID，这非常有效Pandas - 通过在另一个数据帧中查找来替换值

But I cannot get my mind around this slightly different problem.但是我无法理解这个略有不同的问题。 The only solution I could think of is to re-iterate a for and if cycles that compare every string of df1 to df2 and make the substitution.我能想到的唯一解决方案是重新迭代一个 for 和 if 循环，将df1 的每个字符串与df2进行比较并进行替换。

This would be, I am afraid, very slow as I have ca.恐怕这会很慢，因为我有大约。 2000 unique Ref_ID s and I have to repeat this operation in several columns similar to the References one. 2000 个唯一的Ref_ID ，我必须在类似于References 的几列中重复此操作。

Anyone is willing to point me in the right direction?有人愿意指出我正确的方向吗？

Many thanks in advance.提前谢谢了。

Answer 1

you can use some list comprehension and dict lookups and I dont think this will be too slow你可以使用一些列表理解和字典查找，我认为这不会太慢

First, making a fast-to-access mapping for id to short_ref首先，为 id 到 short_ref 做一个快速访问映射

mapping_dict = df2.set_index('Ref_ID')['ShortRef'].to_dict()

Then, lets split references by commas然后，让我们用逗号分割引用

df1_values = [v.split(',') for v in df1['References']]

Finally, we can iterate over and do dictionary lookups, before concatenating back to strings最后，我们可以迭代并进行字典查找，然后再连接回字符串

df1['References'] = pd.Series([';'.join([mapping_dict[v] for v in values]) for values in df1_values])

Is this usable or is it too slow?这是可用的还是太慢了？

Answer 2

Let's try this:让我们试试这个：

df1 = pd.DataFrame({'Reference':['1,2','3','1,3,5'], 'Description':['Descr 1', 'Descr 2', 'Descr 3']})
df2 = pd.DataFrame({'Ref_ID':[1,2,3,4,5,6], 'ShortRef':['Smith (2006)',
                                                       'Mike (2009)',
                                                       'John (2014)',
                                                       'Cole (2007)',
                                                       'Jill (2019)',
                                                       'Tom (2007)']})

df1['Reference2'] = (df1['Reference'].str.split(',')
                                     .explode()
                                     .map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
                                             .set_index('Ref_ID')['ShortRef'])
                                     .groupby(level=0).agg(list))

Output:输出：

  Reference Description                                Reference2
0       1,2     Descr 1               [Smith (2006), Mike (2009)]
1         3     Descr 2                             [John (2014)]
2     1,3,5     Descr 3  [Smith (2006), John (2014), Jill (2019)]

@Datanovice thanks for the update. @Datanovice 感谢您的更新。

df1['Reference2'] = (df1['Reference'].str.split(',')
                                     .explode()
                                     .map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
                                             .set_index('Ref_ID')['ShortRef'])
                                     .groupby(level=0).agg(';'.join))

Output:输出：

  Reference Description                            Reference2
0       1,2     Descr 1              Smith (2006);Mike (2009)
1         3     Descr 2                           John (2014)
2     1,3,5     Descr 3  Smith (2006);John (2014);Jill (2019)

Answer 3

Another solution is using str.get_dummies and dot另一个解决方案是使用str.get_dummies和dot

df3 = (df1.set_index('Description').Reference.str.get_dummies(',')
          .reindex(columns=df2.Ref_ID.astype(str).values, fill_value=0))
df_final = (df3.dot(df2.ShortRef.values+';').str.strip(';').rename('References')
               .reset_index())

Out[462]:
  Description                           References
0     Descr 1             Smith (2006);Mike (2009)
1     Descr 2                          John (2014)
2     Descr 3  Mike (2009);John (2014);Jill (2019)

用来自另一个数据帧的值替换数据帧中的逗号分隔值

问题描述

3 个解决方案

解决方案1
3 2020-01-06 18:35:45

解决方案2
3 已采纳 2020-01-06 18:36:31

解决方案3
1 2020-01-06 19:15:40

用来自另一个数据帧的值替换数据帧中的逗号分隔值

问题描述

3 个解决方案

解决方案1 3 2020-01-06 18:35:45

解决方案2 3 已采纳 2020-01-06 18:36:31

解决方案3 1 2020-01-06 19:15:40

解决方案1
3 2020-01-06 18:35:45

解决方案2
3 已采纳 2020-01-06 18:36:31

解决方案3
1 2020-01-06 19:15:40