简体   繁体   English

根据 Pandas 中设置的其他列中的匹配将值排序到列中

[英]Sort values into columns based on match in other column set in Pandas

I have data like this:我有这样的数据:

df = pd.DataFrame({
   'nameset1_0': [np.nan, 'A', 'B', 'C', np.nan],
   'nameset1_1': ['D', np.nan, 'E', 'F', 'G'],
   'nameset1_2': ['H', np.nan, np.nan, np.nan, np.nan],
   'nameset2_0': ['H', 'A', 'E', 'F', np.nan],
   'nameset2_1': ['D', np.nan, np.nan, 'C', 'G'],
   'nameset2_2': [np.nan, np.nan, 'B', np.nan, np.nan],
   'nameset2_val_0': [6, 76,  7, 34, 30],
   'nameset2_val_1': [33, 97, 73, 21, 45],
   'nameset2_val_2': [53, 28, 47, 94, 34]
})

For nameset2 , the values in each of the _0 , _1 , _2 suffix nameset2_val_ columns correspond to the name/label in the nameset2_ columns with the corresponding suffix.对于nameset2 ,每个_0_1_2后缀nameset2_val_列中的值对应于具有相应后缀的nameset2_列中的名称/标签。

There are the same set of values in each row between the nameset1_ and nameset2_ columns, but they're shuffled differently in each row.nameset1_nameset2_列之间的每一行中都有相同的值集,但它们在每一行中的洗牌方式不同。

What I need to do is create a set of value columns for nameset1 that correctly match the nameset2 values to the appropriate name in nameset1_ .我需要做的是为nameset1创建一组值列,将nameset2值与nameset1_中的适当名称正确匹配。 The output should look like this (I'm being as careful as I can but if you think there's an error here please drop a comment): output 应该看起来像这样(我尽可能小心,但如果您认为这里有错误,请发表评论):

df = pd.DataFrame({
   'nameset1_0': [np.nan, 'A', 'B', 'C', np.nan],
   'nameset1_1': ['D', np.nan, 'E', 'F', 'G'],
   'nameset1_2': ['H', np.nan, np.nan, np.nan, np.nan],
   'nameset2_0': ['H', 'A', 'E', 'F', np.nan],
   'nameset2_1': ['D', np.nan, np.nan, 'C', 'G'],
   'nameset2_2': [np.nan, np.nan, 'B', np.nan, np.nan],
   'nameset2_val_0': [6, 76,  7, 34, np.nan],
   'nameset2_val_1': [33, np.nan, np.nan, 21, 45],
   'nameset2_val_2': [np.nan, np.nan, 47, np.nan, np.nan],
   'nameset1_val_0': [np.nan, 76, 47, 21, np.nan],
   'nameset1_val_1': [33, np.nan, 7, 34, 45],
   'nameset1_val_2': [6, np.nan, np.nan, np.nan, np.nan]
})

My insanely clunky code to try to handle this currently looks something like this, but it works inconsistently or not at all:我尝试处理此问题的异常笨拙的代码目前看起来像这样,但它的工作方式不一致或根本不工作:

for i in list(range(3)):
    df['nameset1_val_'+str(i)] = df[
        ['nameset1_'+str(i)]
        +['nameset2_val_'+str(j) for j in list(range(3))]
    ].apply(
        lambda row: [i for i,e in enumerate(row[1:]) if e==row[0]],
        axis=1
    ).apply(lambda lst: lst.pop() if len(lst)==1 else np.nan)
    
    prefix='nameset2_val_'
    df['nameset1_val_'+str(i)] = df[
         ['nameset2_val_'+str(i) for i in list(range(3))]
    ].to_numpy()[df.index,
                 df.columns.get_indexer(
                     df['nameset1_val_'+str(i)].fillna(-1).astype(int).astype(str).radd(prefix)
                 )]

I believe this gives what you need.我相信这可以满足您的需求。 The nameset_dict maps all the conversions needed from a certain character to an integer and then we create new columns by using replace nameset_dict 将从某个字符所需的所有转换映射到 integer 然后我们使用替换创建新列

nameset_dict = {}
for col in range(0, 3):
    for _, row in df.loc[~pd.isnull(df[f"nameset2_{str(col)}"])].iterrows():
          nameset_dict[row[f"nameset2_{str(col)}"]] = row[f"nameset2_val_{str(col)}"]
    for col in range(0, 3):   
         df[f"nameset1_val_{str(col)}"] = df[f"nameset1_{str(col)}"].replace(nameset_dict)

This is the result that I am getting这是我得到的结果

nameset1_val_0 nameset1_val_1 nameset1_val_2
1    76.0        33.0             6.0
2    47.0        NaN              NaN
3    21.0        7.0              NaN
4     NaN        34.0             NaN
5     NaN        45.0             NaN 
 
                                
                                                       
                                
                                

You could do:你可以这样做:

df1 = df.select_dtypes(include=['object']).melt()
df1 = df1.assign(grp = df1.groupby('variable').cumcount()).dropna()
df1['grp2'] = df1.variable.str.extract('(\\d+$)')

df2= df.select_dtypes(include=['int64','float64']).melt(var_name='var1', value_name='val')
df2['grp'] =  df2.groupby('var1').cumcount()
df2['grp2'] = df2.var1.str.extract('(\\d+$)')

df3 = df1.merge(df2).drop(['value', 'grp2', 'var1'], axis=1)
df3['variable'] = df3.variable.str.replace('(_.*)', '_val\\1')


df3.pivot('grp', 'variable') # Is what you are looking for

Another option:另外的选择:

# Create Mapper For nameset2 Keys and Values
m = df.filter(like='nameset2')
m.columns = m.columns \
    .str.replace(r'val_(\d+)$', r'\1_val', regex=True) \
    .str.replace(r'_(\d+)$', r'_\1_key', regex=True) \
    .str.split('_', expand=True).droplevel(0)
m = m.stack(level=0).dropna() \
    .droplevel(1).reset_index() \
    .set_index(['index', 'key'])

# Join with nameset1 values and pivot to wide format
vals = df.filter(like='nameset1') \
    .stack() \
    .reset_index() \
    .join(m, on=['level_0', 0]) \
    .pivot(columns='level_1', index='level_0') \
    .rename_axis(None)

# Fix Column Names
vals.columns = vals.columns.map(
    lambda s: '{}_val_{}'.format(*s[1].split('_'))
    if s[0] == 'val' else
    s[1]
)

# Join vals with nameset2
new_df = vals.join(df.filter(like='nameset2'))

print(new_df.to_string())

  1. Create a mapper m that associates index key value pairs:创建关联索引键值对的映射器m
           val
index key     
0     H      6
      D     33
1     A     76
2     E      7
      B     47
3     F     34
      C     21
4     G     45
  1. Join this mapper with nameset1 to get values and pivot to wide format:使用 nameset1 加入此映射器以获取值,并将 pivot 加入宽格式:
                 0                              val                      
level_1 nameset1_0 nameset1_1 nameset1_2 nameset1_0 nameset1_1 nameset1_2
0              NaN          D          H        NaN       33.0        6.0
1                A        NaN        NaN       76.0        NaN        NaN
2                B          E        NaN       47.0        7.0        NaN
3                C          F        NaN       21.0       34.0        NaN
4              NaN          G        NaN        NaN       45.0        NaN
  1. Cleanup Multi-Index Columns:清理多索引列:
nameset1_0 nameset1_1 nameset1_2  nameset1_val_0  nameset1_val_1  nameset1_val_2
  1. Join with nameset_2 values:加入 nameset_2 值:
  nameset1_0 nameset1_1 nameset1_2  nameset1_val_0  nameset1_val_1  nameset1_val_2 nameset2_0 nameset2_1 nameset2_2  nameset2_val_0  nameset2_val_1  nameset2_val_2
0        NaN          D          H             NaN            33.0             6.0          H          D        NaN               6              33              53
1          A        NaN        NaN            76.0             NaN             NaN          A        NaN        NaN              76              97              28
2          B          E        NaN            47.0             7.0             NaN          E        NaN          B               7              73              47
3          C          F        NaN            21.0            34.0             NaN          F          C        NaN              34              21              94
4        NaN          G        NaN             NaN            45.0             NaN        NaN          G        NaN              30              45              34

I found an answer (still using .loc assignment) that's pretty simple and appears to be producing the correct output.我找到了一个非常简单的答案(仍然使用.loc分配),并且似乎正在生成正确的 output。

for i,j in itertools.product(range(3), range(3)):

    df.loc[
        df[f'nameset2_{i}']==df[f'nameset1_{j}'],
        f'nameset1_val_{i}'
    ] = df.loc[
        df[f'nameset2_{i}']==df[f'nameset1_{j}'],
        f'nameset2_val_2{j}'
    ]

Note: it doesn't handle cases where the same name appears multiple times in the same column set in a given row.注意:它不处理相同名称在给定行的同一列集中多次出现的情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM