简体   繁体   English

Pandas:如果数据框中的值包含来自另一个数据帧的字符串,则追加列

[英]Pandas : if value in a dataframe contains string from another dataframe, append columns

Let's say I have two dataframes df1 and df2. 假设我有两个数据帧df1和df2。 I want to append some columns of df2 to df1 if the value of a specific column of df1 contains the string in a specific column of df2, NaN if not. 如果df1的特定列的值包含df2的特定列中的字符串,则我想将df2的一些列附加到df1,否则为NaN。

A small example: 一个小例子:

import pandas as pd
df1 = pd.DataFrame({'col': ['abc', 'def', 'abg', 'xyz']})
df2 = pd.DataFrame({'col1': ['ab', 'ef'], 'col2': ['match1', 'match2'], 'col3': [1, 2]})

df1:
   col
0  abc
1  def
2  abg
3  xyz

df2:

  col1    col2    col3
0   ab  match1       1
1   ef  match2       2

I want: 我想要:

   col   col2_match   col3_match
0  abc       match1            1
1  def       match2            2
2  abg       match1            1
3  xyz          NaN          NaN

I managed to do it in a dirty and unefficient way, but in my case df1 contains like 100K rows and it takes forever... 我设法以肮脏和低效的方式做到这一点,但在我的情况下,df1包含100K行,它需要永远......

Thanks in advance ! 提前致谢 !

EDIT 编辑

A bit dirty but gets the work done relatively quickly (I still thinks there exists a smartest way though...): 有点脏,但相对较快地完成工作(我仍然认为存在一种最聪明的方式......):

import pandas as pd
import numpy as np


df1 = pd.DataFrame({'col': ['abc', 'def', 'abg']})
df2 = pd.DataFrame({'col1': ['ab', 'ef'],
                    'col2': ['match1', 'match2'],
                    'col3': [1, 2]})


def return_nan(tup):
    return(np.nan if len(tup[0]) == 0 else tup[0][0])


def get_indexes_match(l1, l2):
    return([return_nan(np.where([x in e for x in l2])) for e in l1])


def merge(df1, df2, left_on, right_on):
    df1.loc[:, 'idx'] = get_indexes_match(df1[left_on].values,
                                          df2[right_on].values)
    df2.loc[:, 'idx'] = np.arange(len(df2))
    return(pd.merge(df1, df2, how='left', on='idx'))


merge(df1, df2, left_on='col', right_on='col1')

You can use python difflib module for fuzzy match like this 您可以使用python difflib模块进行模糊匹配

import difflib 
difflib.get_close_matches
df1.col = df1.col.map(lambda x: difflib.get_close_matches(x, df2.col1)[0])

So now your df1 is 所以现在你的df1是

    col
0   ab
1   ef
2   ab

You can call it df3 if you wish to keep df1 unaltered. 如果您希望保持df1不变,可以将其命名为df3。

Now you can merge 现在你可以合并了

merged = df1.merge(df2, left_on = 'col', right_on = 'col1', how = 'outer').drop('col1', axis = 1)

The merged dataframe looks like 合并的数据框看起来像

    col col2    col3
0   ab  match1  1
1   ab  match1  1
2   ef  match2  2

EDIT: In case of no match like the new example given, you just need to put a conditional in lambda 编辑:如果没有像给出的新例子那样匹配,你只需要在lambda中放置一个条件

df1.col = df1.col.map(lambda x: difflib.get_close_matches(x, df2.col1)[0] if difflib.get_close_matches(x, df2.col1) else x)

Now after the merge you get 现在合并后你得到了

    col col2    col3
0   ab  match1  1
1   ab  match1  1
2   ef  match2  2
3   xyz NaN     NaN

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如果pandas系列中的字符串包含另一个pandas数据帧中的字符串 - if string in pandas series contains a string from another pandas dataframe 从 Pandas DataFrame 中删除名称包含特定字符串的列 - Drop columns whose name contains a specific string from pandas DataFrame Append 列到 Pandas Dataframe - Append columns to Pandas Dataframe Pandas 如果列包含字符串,则从另一列获取唯一值并从 dataframe 中删除行 - Pandas if colum contains string then get unique value from another column and drop rows from dataframe 如果另一个列中的值是另一个DataFrame中的pandas列? - pandas columns from another DataFrame if value is in another column? Pandas dataframe select 列基于其他 Z6A8064B5DF479455500553 列中的值47DC - Pandas dataframe select Columns based on other dataframe contains column value in it Python Pandas:仅当列值唯一时,才将数据框追加到另一个数据框 - Python Pandas: Append Dataframe To Another Dataframe Only If Column Value is Unique 用来自另一个数据框中的字符串匹配的平均值列向pandas数据框附加 - Append pandas dataframe with column of averaged values from string matches in another dataframe pandas DataFrame:用另一个值替换多个列中的值 - pandas DataFrame: replace values in multiple columns with the value from another append 或将一个 dataframe 中的值连接到另一个 Z6A8064B5DF4794555500553C47C55057DZ 中的 Z251D2CEBBFE303B95E4AZDC1 中的每一行 - append or join value from one dataframe to every row in another dataframe in Pandas
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM