[英]Sort values into columns based on match in other column set in Pandas
I have data like this:我有这样的数据:
df = pd.DataFrame({
'nameset1_0': [np.nan, 'A', 'B', 'C', np.nan],
'nameset1_1': ['D', np.nan, 'E', 'F', 'G'],
'nameset1_2': ['H', np.nan, np.nan, np.nan, np.nan],
'nameset2_0': ['H', 'A', 'E', 'F', np.nan],
'nameset2_1': ['D', np.nan, np.nan, 'C', 'G'],
'nameset2_2': [np.nan, np.nan, 'B', np.nan, np.nan],
'nameset2_val_0': [6, 76, 7, 34, 30],
'nameset2_val_1': [33, 97, 73, 21, 45],
'nameset2_val_2': [53, 28, 47, 94, 34]
})
For nameset2
, the values in each of the _0
, _1
, _2
suffix nameset2_val_
columns correspond to the name/label in the nameset2_
columns with the corresponding suffix.对于
nameset2
,每个_0
、 _1
、 _2
后缀nameset2_val_
列中的值对应于具有相应后缀的nameset2_
列中的名称/标签。
There are the same set of values in each row between the nameset1_
and nameset2_
columns, but they're shuffled differently in each row.在
nameset1_
和nameset2_
列之间的每一行中都有相同的值集,但它们在每一行中的洗牌方式不同。
What I need to do is create a set of value columns for nameset1
that correctly match the nameset2
values to the appropriate name in nameset1_
.我需要做的是为
nameset1
创建一组值列,将nameset2
值与nameset1_
中的适当名称正确匹配。 The output should look like this (I'm being as careful as I can but if you think there's an error here please drop a comment): output 应该看起来像这样(我尽可能小心,但如果您认为这里有错误,请发表评论):
df = pd.DataFrame({
'nameset1_0': [np.nan, 'A', 'B', 'C', np.nan],
'nameset1_1': ['D', np.nan, 'E', 'F', 'G'],
'nameset1_2': ['H', np.nan, np.nan, np.nan, np.nan],
'nameset2_0': ['H', 'A', 'E', 'F', np.nan],
'nameset2_1': ['D', np.nan, np.nan, 'C', 'G'],
'nameset2_2': [np.nan, np.nan, 'B', np.nan, np.nan],
'nameset2_val_0': [6, 76, 7, 34, np.nan],
'nameset2_val_1': [33, np.nan, np.nan, 21, 45],
'nameset2_val_2': [np.nan, np.nan, 47, np.nan, np.nan],
'nameset1_val_0': [np.nan, 76, 47, 21, np.nan],
'nameset1_val_1': [33, np.nan, 7, 34, 45],
'nameset1_val_2': [6, np.nan, np.nan, np.nan, np.nan]
})
My insanely clunky code to try to handle this currently looks something like this, but it works inconsistently or not at all:我尝试处理此问题的异常笨拙的代码目前看起来像这样,但它的工作方式不一致或根本不工作:
for i in list(range(3)):
df['nameset1_val_'+str(i)] = df[
['nameset1_'+str(i)]
+['nameset2_val_'+str(j) for j in list(range(3))]
].apply(
lambda row: [i for i,e in enumerate(row[1:]) if e==row[0]],
axis=1
).apply(lambda lst: lst.pop() if len(lst)==1 else np.nan)
prefix='nameset2_val_'
df['nameset1_val_'+str(i)] = df[
['nameset2_val_'+str(i) for i in list(range(3))]
].to_numpy()[df.index,
df.columns.get_indexer(
df['nameset1_val_'+str(i)].fillna(-1).astype(int).astype(str).radd(prefix)
)]
I believe this gives what you need.我相信这可以满足您的需求。 The nameset_dict maps all the conversions needed from a certain character to an integer and then we create new columns by using replace
nameset_dict 将从某个字符所需的所有转换映射到 integer 然后我们使用替换创建新列
nameset_dict = {}
for col in range(0, 3):
for _, row in df.loc[~pd.isnull(df[f"nameset2_{str(col)}"])].iterrows():
nameset_dict[row[f"nameset2_{str(col)}"]] = row[f"nameset2_val_{str(col)}"]
for col in range(0, 3):
df[f"nameset1_val_{str(col)}"] = df[f"nameset1_{str(col)}"].replace(nameset_dict)
This is the result that I am getting这是我得到的结果
nameset1_val_0 nameset1_val_1 nameset1_val_2
1 76.0 33.0 6.0
2 47.0 NaN NaN
3 21.0 7.0 NaN
4 NaN 34.0 NaN
5 NaN 45.0 NaN
You could do:你可以这样做:
df1 = df.select_dtypes(include=['object']).melt()
df1 = df1.assign(grp = df1.groupby('variable').cumcount()).dropna()
df1['grp2'] = df1.variable.str.extract('(\\d+$)')
df2= df.select_dtypes(include=['int64','float64']).melt(var_name='var1', value_name='val')
df2['grp'] = df2.groupby('var1').cumcount()
df2['grp2'] = df2.var1.str.extract('(\\d+$)')
df3 = df1.merge(df2).drop(['value', 'grp2', 'var1'], axis=1)
df3['variable'] = df3.variable.str.replace('(_.*)', '_val\\1')
df3.pivot('grp', 'variable') # Is what you are looking for
Another option:另外的选择:
# Create Mapper For nameset2 Keys and Values
m = df.filter(like='nameset2')
m.columns = m.columns \
.str.replace(r'val_(\d+)$', r'\1_val', regex=True) \
.str.replace(r'_(\d+)$', r'_\1_key', regex=True) \
.str.split('_', expand=True).droplevel(0)
m = m.stack(level=0).dropna() \
.droplevel(1).reset_index() \
.set_index(['index', 'key'])
# Join with nameset1 values and pivot to wide format
vals = df.filter(like='nameset1') \
.stack() \
.reset_index() \
.join(m, on=['level_0', 0]) \
.pivot(columns='level_1', index='level_0') \
.rename_axis(None)
# Fix Column Names
vals.columns = vals.columns.map(
lambda s: '{}_val_{}'.format(*s[1].split('_'))
if s[0] == 'val' else
s[1]
)
# Join vals with nameset2
new_df = vals.join(df.filter(like='nameset2'))
print(new_df.to_string())
m
that associates index key value pairs:m
: val
index key
0 H 6
D 33
1 A 76
2 E 7
B 47
3 F 34
C 21
4 G 45
0 val
level_1 nameset1_0 nameset1_1 nameset1_2 nameset1_0 nameset1_1 nameset1_2
0 NaN D H NaN 33.0 6.0
1 A NaN NaN 76.0 NaN NaN
2 B E NaN 47.0 7.0 NaN
3 C F NaN 21.0 34.0 NaN
4 NaN G NaN NaN 45.0 NaN
nameset1_0 nameset1_1 nameset1_2 nameset1_val_0 nameset1_val_1 nameset1_val_2
nameset1_0 nameset1_1 nameset1_2 nameset1_val_0 nameset1_val_1 nameset1_val_2 nameset2_0 nameset2_1 nameset2_2 nameset2_val_0 nameset2_val_1 nameset2_val_2
0 NaN D H NaN 33.0 6.0 H D NaN 6 33 53
1 A NaN NaN 76.0 NaN NaN A NaN NaN 76 97 28
2 B E NaN 47.0 7.0 NaN E NaN B 7 73 47
3 C F NaN 21.0 34.0 NaN F C NaN 34 21 94
4 NaN G NaN NaN 45.0 NaN NaN G NaN 30 45 34
I found an answer (still using .loc
assignment) that's pretty simple and appears to be producing the correct output.我找到了一个非常简单的答案(仍然使用
.loc
分配),并且似乎正在生成正确的 output。
for i,j in itertools.product(range(3), range(3)):
df.loc[
df[f'nameset2_{i}']==df[f'nameset1_{j}'],
f'nameset1_val_{i}'
] = df.loc[
df[f'nameset2_{i}']==df[f'nameset1_{j}'],
f'nameset2_val_2{j}'
]
Note: it doesn't handle cases where the same name appears multiple times in the same column set in a given row.注意:它不处理相同名称在给定行的同一列集中多次出现的情况。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.