[英]Data Conversion Error while applying a function to each row in pandas Python
[英]Python Pandas: applying a specific function to each row
我正在尝试对我拥有的数据应用一种标准化形式。 我希望从 dataframe 中的每个值中减去每一行的中位数。 到目前为止我所拥有的:
# Generate sample data
data = { "sample_name": ["s1", "s2", "s3", "s4", "s5", "s6"],
"group_name": ["g1", "g1", "g1", "g2", "g2", "g2"],
'col1':[1, 22, 3, 45, 31, 53],
'col2':[30, 21, 10, 42, 56, 20],
'col3':[78, 25, 33, 87, 20, 19],
'col4':[11, 23, 14, 98, 55, 66],
'col5':[19, 29, 39, 49, 59, 69],
}
df = pd.DataFrame(data)
# calculate medians of each row
median_ls = list(df.median(axis=1))
# [19.0, 23.0, 14.0, 49.0, 55.0, 53.0]
预期结果是:
-18,11,59,-8,0
-1,-2,2,0,6
-11,-4,19,0,25
-4,-7,38,49,0
-24,1,-35,0,4
0,-33,-34,13,16
我查看了df.apply(<function>, axis=1)
,但无法弄清楚如何跨行迭代地应用特定于行的 function 的语法。
使用DataFrame.select_dtypes
获取数字列并减去DataFrame.sub
与axis=1
:
df1 = df.select_dtypes(np.number).sub(df.median(axis=1), axis=0)
print (df1)
col1 col2 col3 col4 col5
0 -18.0 11.0 59.0 -8.0 0.0
1 -1.0 -2.0 2.0 0.0 6.0
2 -11.0 -4.0 19.0 0.0 25.0
3 -4.0 -7.0 38.0 49.0 0.0
4 -24.0 1.0 -35.0 0.0 4.0
5 0.0 -33.0 -34.0 13.0 16.0
如果需要分配回 output 使用:
cols = df.select_dtypes(np.number).columns
df[cols] = df[cols].sub(df.median(axis=1), axis=0)
print (df)
sample_name group_name col1 col2 col3 col4 col5
0 s1 g1 -18.0 11.0 59.0 -8.0 0.0
1 s2 g1 -1.0 -2.0 2.0 0.0 6.0
2 s3 g1 -11.0 -4.0 19.0 0.0 25.0
3 s4 g2 -4.0 -7.0 38.0 49.0 0.0
4 s5 g2 -24.0 1.0 -35.0 0.0 4.0
5 s6 g2 0.0 -33.0 -34.0 13.0 16.0
另一个想法是 select 所有没有前 2 的行DataFrame.iloc
:
df.iloc[:, 2:] = df.iloc[:, 2:].sub(df.median(axis=1), axis=0)
print (df)
sample_name group_name col1 col2 col3 col4 col5
0 s1 g1 -18.0 11.0 59.0 -8.0 0.0
1 s2 g1 -1.0 -2.0 2.0 0.0 6.0
2 s3 g1 -11.0 -4.0 19.0 0.0 25.0
3 s4 g2 -4.0 -7.0 38.0 49.0 0.0
4 s5 g2 -24.0 1.0 -35.0 0.0 4.0
5 s6 g2 0.0 -33.0 -34.0 13.0 16.0
尝试:
df.sub(df.median(axis=1), axis=0)
我允许自己只使用数字部分
import pandas as pd
# Generate sample data
data = {
"sample_name": ["s1", "s2", "s3", "s4", "s5", "s6"],
"group_name": ["g1", "g1", "g1", "g2", "g2", "g2"],
'col1':[1, 22, 3, 45, 31, 53],
'col2':[30, 21, 10, 42, 56, 20],
'col3':[78, 25, 33, 87, 20, 19],
'col4':[11, 23, 14, 98, 55, 66],
'col5':[19, 29, 39, 49, 59, 69],
}
keys = ['col1','col2','col3','col4','col5']
df = pd.DataFrame(data)
print(df)
# calculate medians of each row
median_ls = list(df.median(axis=1))
# [19.0, 23.0, 14.0, 49.0, 55.0, 53.0]
print(median_ls)
print(df[keys].subtract(median_ls, axis=0))
结果:
col1 col2 col3 col4 col5
0 -18.0 11.0 59.0 -8.0 0.0
1 -1.0 -2.0 2.0 0.0 6.0
2 -11.0 -4.0 19.0 0.0 25.0
3 -4.0 -7.0 38.0 49.0 0.0
4 -24.0 1.0 -35.0 0.0 4.0
5 0.0 -33.0 -34.0 13.0 16.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.