[英]get the mean of colum after concatenating adding one column at the end of another in pandas
I have a dataset that looks like this :我有一个看起来像这样的数据集:
Interactor A Interactor B Interaction Score score2
0 P02574 P39205 0.928736 0.375000
1 P02574 Q6NR18 0.297354 0.166667
2 P02574 Q7KML4 0.297354 0.142857
3 P02574 Q9BP34 0.297354 0.166667
4 P02574 Q9BP35 0.297354 0.16666
data.shape = (112049, 5)
I want to add Interactor B
at the end of Interactor A
column uniquely and add a column that shows their Rank.我想在
Interactor A
列的末尾添加Interactor B
并添加一个显示其排名的列。 I did this by :我这样做了:
cols = [data[col].squeeze() for col in data[['Interactor A','Interactor B']]]
n =pd.concat(cols, ignore_index=True)
n = pd.DataFrame(n,columns = ['AB'])
to make the column unique :使列唯一:
t = pd.unique(n['AB'])
t= pd.DataFrame(t, columns=[ "AB"])
then :然后 :
t2 = n.groupby(['AB'],sort=False).size()
t2 = pd.DataFrame(t2)
finally : by concatenating t2 and t :最后:通过连接 t2 和 t :
data_1 = pd.concat([t,l], axis=1)
AB Rank
0 P02574 4
data.shape = (13631, 2)
now I want to add the Interaction Score
and score2
column to DF .现在我想将
Interaction Score
和score2
列添加到 DF 。 if there is duplicate take the mean of their Interaction Score
and delete the duplicates and replace the value of the Interaction Score
by the mean.如果有重复,则取其
Interaction Score
的平均值并删除重复项并用平均值替换Interaction Score
的值。
I used :我用了 :
score2 = data.groupby(['Interactor A','Interactor B'])['score2'].mean()
score2 = pd.DataFrame(score2, columns=['score2'])
the output in this case is like :在这种情况下的输出是这样的:
score2
Interactor A Interactor B
A0A023GPK8 Q9VQW1 0.200000
A0A076NAB7 Q9VYN8 0.000000
A0A0B4JD97 Q400N2 0.000000
Q9VC64 0.090909
Q9VNE4 0.307692
112049 rows × 1 columns
but what I is to add columns with mean of 'score2'
and 'Interaction Score'
column for 13631 unique data that I made.但是我要为我制作的 13631 个独特数据添加具有
'score2'
和'Interaction Score'
列平均值的列。 How can achieve this ??怎么能做到这一点? please help.
请帮忙。 the final df should be like :
最终的 df 应该是这样的:
Interactor Rank Interaction Score score2 P02574 5 0.928736 0.44交互者排名 交互得分 score2 P02574 5 0.928736 0.44
ie: score2 is the average of all 'P0257' score that have been in the dataset即:score2 是数据集中所有“P0257”分数的平均值
IIUC - You simply need to reshape your data from wide to long and then run aggregation assuming scores pair with interactors one for one. IIUC - 您只需要将数据从宽到长重塑,然后假设分数与交互者一对一配对运行聚合。 Consider
wide_to_long
for reshape after setting up stub names and id field.在设置存根名称和 id 字段后,考虑用
wide_to_long
进行wide_to_long
。 Then, run groupby().agg()
for counts and means.然后,运行
groupby().agg()
以获取计数和均值。
Data数据
from io import StringIO
import pandas as pd
txt = ''' "Interactor A" "Interactor B" "Interaction Score" "score2"
0 P02574 P39205 0.928736 0.375000
1 P02574 Q6NR18 0.297354 0.166667
2 P02574 Q7KML4 0.297354 0.142857
3 P02574 Q9BP34 0.297354 0.166667
4 P02574 Q9BP35 0.297354 0.16666'''
data = pd.read_csv(StringIO(txt), sep="\s+")
Reshape重塑
# FOR id FIELD
data["id"] = data.index
# FOR STUB NAMES
data = data.rename(columns={"Interaction Score": "score A",
"score2": "score B"})
df_long = pd.wide_to_long(data, ["Interactor", "score"], i="id",
j="score_type", sep=" ", suffix="(A|B)")
df_long
# Interactor score
# id score_type
# 0 A P02574 0.928736
# 1 A P02574 0.297354
# 2 A P02574 0.297354
# 3 A P02574 0.297354
# 4 A P02574 0.297354
# 0 B P39205 0.375000
# 1 B Q6NR18 0.166667
# 2 B Q7KML4 0.142857
# 3 B Q9BP34 0.166667
# 4 B Q9BP35 0.166660
Interactor Aggregation交互者聚合
df_long.groupby(["Interactor"])["score"].agg(["count", "mean"])
# count mean
# Interactor
# P02574 5 0.423630
# P39205 1 0.375000
# Q6NR18 1 0.166667
# Q7KML4 1 0.142857
# Q9BP34 1 0.166667
# Q9BP35 1 0.166660
Interactor + Score Groupby Aggregation Interactor + Score Groupby 聚合
df_long.groupby(["Interactor", "score_type"])['score'].agg(["count", "mean"])
# count mean
# Interactor score_type
# P02574 A 5 0.423630
# P39205 B 1 0.375000
# Q6NR18 B 1 0.166667
# Q7KML4 B 1 0.142857
# Q9BP34 B 1 0.166667
# Q9BP35 B 1 0.166660
Interactor + Score Pivot Aggregation Interactor + Score Pivot 聚合
df_long.pivot_table(index="Interactor", columns="score_type", values='score',
aggfunc = ["count", "mean"])
# count mean
# score_type A B A B
# Interactor
# P02574 5.0 NaN 0.42363 NaN
# P39205 NaN 1.0 NaN 0.375000
# Q6NR18 NaN 1.0 NaN 0.166667
# Q7KML4 NaN 1.0 NaN 0.142857
# Q9BP34 NaN 1.0 NaN 0.166667
# Q9BP35 NaN 1.0 NaN 0.166660
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.