简体   繁体   English

用来自另一个数据帧的值替换数据帧中的逗号分隔值

[英]Replace comma-separated values in a dataframe with values from another dataframe

this is my first question on StackOverflow, so please pardon if I am not clear enough.这是我关于 StackOverflow 的第一个问题,所以如果我不够清楚,请原谅。 I usually find my answers here but this time I had no luck.我通常在这里找到我的答案,但这次我没有运气。 Maybe I am being dense, but here we go.也许我很密集,但我们走了。

I have two pandas dataframes formatted as follows我有两个 Pandas 数据框,格式如下

df1 df1

+------------+-------------+
| References | Description |
+------------+-------------+
| 1,2        | Descr 1     |
| 3          | Descr 2     |
| 2,3,5      | Descr 3     |
+------------+-------------+

df2 df2

+--------+--------------+
| Ref_ID |   ShortRef   |
+--------+--------------+
|      1 | Smith (2006) |
|      2 | Mike (2009)  |
|      3 | John (2014)  |
|      4 | Cole (2007)  |
|      5 | Jill (2019)  |
|      6 | Tom (2007)   |
+--------+--------------+

Basically, Ref_ID in df2 contains IDs that form the string contained in the field References in df1基本上, df2中的Ref_ID包含形成df1中的字段References中包含的字符串的 ID

What I would like to do is to replace values in the References field in df1 so it looks like this:我想要做的是替换df1 中References字段中的值,使其看起来像这样:

+-------------------------------------+-------------+
|             References              | Description |
+-------------------------------------+-------------+
| Smith (2006); Mike (2009)           | Descr 1     |
| John (2014)                         | Descr 2     |
| Mike (2009);John (2014);Jill (2019) | Descr 3     |
+-------------------------------------+-------------+

So far, I had to deal with columns and IDs with a 1-1 relationship, and this works perfectly Pandas - Replacing Values by Looking Up in an Another Dataframe到目前为止,我不得不处理具有 1-1 关系的列和 ID,这非常有效Pandas - 通过在另一个数据帧中查找来替换值

But I cannot get my mind around this slightly different problem.但是我无法理解这个略有不同的问题。 The only solution I could think of is to re-iterate a for and if cycles that compare every string of df1 to df2 and make the substitution.我能想到的唯一解决方案是重新迭代一个 for 和 if 循环,将df1 的每个字符串与df2进行比较并进行替换。

This would be, I am afraid, very slow as I have ca.恐怕这会很慢,因为我有大约。 2000 unique Ref_ID s and I have to repeat this operation in several columns similar to the References one. 2000 个唯一的Ref_ID ,我必须在类似于References 的几列中重复此操作。

Anyone is willing to point me in the right direction?有人愿意指出我正确的方向吗?

Many thanks in advance.提前谢谢了。

you can use some list comprehension and dict lookups and I dont think this will be too slow你可以使用一些列表理解和字典查找,我认为这不会太慢

First, making a fast-to-access mapping for id to short_ref首先,为 id 到 short_ref 做一个快速访问映射

mapping_dict = df2.set_index('Ref_ID')['ShortRef'].to_dict()

Then, lets split references by commas然后,让我们用逗号分割引用

df1_values = [v.split(',') for v in df1['References']]

Finally, we can iterate over and do dictionary lookups, before concatenating back to strings最后,我们可以迭代并进行字典查找,然后再连接回字符串

df1['References'] = pd.Series([';'.join([mapping_dict[v] for v in values]) for values in df1_values])

Is this usable or is it too slow?这是可用的还是太慢了?

Let's try this:让我们试试这个:

df1 = pd.DataFrame({'Reference':['1,2','3','1,3,5'], 'Description':['Descr 1', 'Descr 2', 'Descr 3']})
df2 = pd.DataFrame({'Ref_ID':[1,2,3,4,5,6], 'ShortRef':['Smith (2006)',
                                                       'Mike (2009)',
                                                       'John (2014)',
                                                       'Cole (2007)',
                                                       'Jill (2019)',
                                                       'Tom (2007)']})

df1['Reference2'] = (df1['Reference'].str.split(',')
                                     .explode()
                                     .map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
                                             .set_index('Ref_ID')['ShortRef'])
                                     .groupby(level=0).agg(list))

Output:输出:

  Reference Description                                Reference2
0       1,2     Descr 1               [Smith (2006), Mike (2009)]
1         3     Descr 2                             [John (2014)]
2     1,3,5     Descr 3  [Smith (2006), John (2014), Jill (2019)]

@Datanovice thanks for the update. @Datanovice 感谢您的更新。

df1['Reference2'] = (df1['Reference'].str.split(',')
                                     .explode()
                                     .map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
                                             .set_index('Ref_ID')['ShortRef'])
                                     .groupby(level=0).agg(';'.join))

Output:输出:

  Reference Description                            Reference2
0       1,2     Descr 1              Smith (2006);Mike (2009)
1         3     Descr 2                           John (2014)
2     1,3,5     Descr 3  Smith (2006);John (2014);Jill (2019)

Another solution is using str.get_dummies and dot另一个解决方案是使用str.get_dummiesdot

df3 = (df1.set_index('Description').Reference.str.get_dummies(',')
          .reindex(columns=df2.Ref_ID.astype(str).values, fill_value=0))
df_final = (df3.dot(df2.ShortRef.values+';').str.strip(';').rename('References')
               .reset_index())

Out[462]:
  Description                           References
0     Descr 1             Smith (2006);Mike (2009)
1     Descr 2                          John (2014)
2     Descr 3  Mike (2009);John (2014);Jill (2019)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 groupby逗号分隔值在单个DataFrame列python / pandas中 - groupby comma-separated values in single DataFrame column python/pandas pandas dataframe 中将逗号分隔值转换为 integer 列表 - Convert comma-separated values into integer list in pandas dataframe 通过使用基于另一个 dataframe 的查找替换逗号分隔列的值来创建新列 - Create a new column by replacing comma-separated column's values with a lookup based on another dataframe pandas:来自dict的数据帧,以逗号分隔的值 - pandas: dataframe from dict with comma separated values 检查 dataframe 中的逗号分隔值是否包含来自 python 中另一个 dataframe 的值 - Check if comma separated values in a dataframe contains values from another dataframe in python 基于另一个 dataframe 创建一个新的 dataframe,其中多个值用逗号分隔 - creating a new dataframe based on another dataframe with multiple values separated by comma 将 DataFrame 打印为逗号分隔值 - Print DataFrame as comma separated values Python基于逗号分隔字符向量列的值熔化数据框 - Python melt dataframe based on values of comma-separated character vector column 将 Pandas 数据框列的所有行转换为逗号分隔的值,每个值都用单引号 - Convert all rows of a Pandas dataframe column to comma-separated values with each value in single quote Python Dataframe 替换由逗号单行分隔的多个值 - Python Dataframe replace multiple values that separated by comma single row
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM