[英]Average of embeddings array per group in pandas
I have a pandas dataframe in which 32 dimensional embeddings are stored as a pandas.core.series.Series column named Embeddings .我有一个Pandas 数据框,其中 32 维嵌入存储为名为Embeddings的pandas.core.series.Series列。
0,8431,-7.73677,110.372814,ID_YGK179,Indonesia,96,Yogyakarta,625,"[-0.08708319 -0.9635474
-1.075278 -0.8778672 1.0672983 0.21834892
0.10251518 -1.4207497 -1.3847003 -0.7889203 -0.58245313 -1.2558284
-0.44232526 -2.44585 1.3060646 -0.6015553 0.21264891 -0.62279683
-0.4118958 -0.10933076 0.2864734 0.42591774 0.35520273 -1.2562522
-1.3118799 0.1367726 0.89168227 0.08609396 -0.7965635 0.03220405
-1.2149535 0.06975704]"
1,8425,-8.82022551263183,115.171107687056,ID_BLI079,Indonesia,96,Bali,623,"[ 0.20398486 -0.3435272 -1.8947698 -1.0723802 1.2999498 0.211587
0.16329497 -0.09804655 -0.41587254 -0.09957021 0.8152087 -0.6022888
-0.10874949 -1.4237555 -0.02137504 -0.60817945 0.81695604 -0.0106029
1.2845753 0.18705958 0.5555717 0.53619224 1.6209115 1.3571581
-0.1660664 0.12530853 -0.12268435 -0.19951908 0.27602577 -0.66749376
-0.09328692 -0.07952076]"
2,8431,-8.23575827574888,114.351026639342,ID_BWI026,Indonesia,96,Banyuwangi,770,"[-0.14250259 -0.60264546 0.39676255 -0.24801618 0.61574996 -0.5373072
0.97321934 -0.22758694 -0.8498406 -0.86897266 0.565802 -1.383025
-0.16449492 -1.6958055 -0.25523412 -0.50068396 0.36182633 -1.5886943
0.56873196 -0.42583758 -0.16461776 0.12368935 1.470881 0.23292007
-1.2004089 0.34835646 0.48000658 0.27867964 -0.35181814 0.20428348
0.04278001 -0.16710897]"
Embeddings is the last column of the given sample data of three rows. Embeddings 是给定的三行样本数据的最后一列。 I want to group the data with column 2 (guest_id)(8431, 8425, 8431) and calculate the average of embeddings array per group.
我想将数据与第 2 列 (guest_id)(8431, 8425, 8431) 分组并计算每组嵌入数组的平均值。
I tried with the following code but the variable a contains single numpy array only and subsequently the zip function does not work.我尝试使用以下代码,但变量a仅包含单个 numpy 数组,随后zip函数不起作用。
#Get the average of n 32 dimension embeddings
def get_average(values):
a = np.array(values.values)
a = np.array(a[0].split()[1:-1]).astype(float)
print(a.shape) # Returns (32,) n number of times
return ([float(sum(col))/len(col) for col in zip(*a)])
#Read embeddings CSV file
hotelFrame = pd.read_csv('96_embedding')
hotelFrame = hotelFrame.iloc[: , [1, -1]] # select only 2 columns, guest_id and embedding
hotelFrame.columns = ['guest_id', 'embedding']
print(type(hotelFrame.embedding)) # Returns <class 'pandas.core.series.Series'>
average_embeddings = result.groupby("guest_id").embedding.agg(get_average).to_frame()
Error: TypeError: zip argument #1 must support iteration错误:类型错误:zip 参数 #1 必须支持迭代
How do I get the guest_id, embeddings_average dataframe in output?如何在输出中获取 guest_id、embeddings_average 数据帧? What am I doing wrong?
我究竟做错了什么?
You can create an auxiliary column avg_embedding
and then do a regular groupby
:您可以创建一个辅助列
avg_embedding
然后执行常规groupby
:
df['avg_embedding'] = df.embedding.apply(lambda x: pd.np.fromstring(x[1:-1], sep=' ').mean())
df.groupby("guest_id").avg_embedding.mean()
Result:结果:
guest_id
8425 0.044565
8431 -0.268278
Name: avg_embedding, dtype: float64
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.