简体   繁体   English

熊猫中每组平均嵌入数组

[英]Average of embeddings array per group in pandas

I have a pandas dataframe in which 32 dimensional embeddings are stored as a pandas.core.series.Series column named Embeddings .我有一个Pandas 数据框,其中 32 维嵌入存储为名为Embeddingspandas.core.series.Series列。

0,8431,-7.73677,110.372814,ID_YGK179,Indonesia,96,Yogyakarta,625,"[-0.08708319 -0.9635474 
 -1.075278   -0.8778672   1.0672983   0.21834892
  0.10251518 -1.4207497  -1.3847003  -0.7889203  -0.58245313 -1.2558284
 -0.44232526 -2.44585     1.3060646  -0.6015553   0.21264891 -0.62279683
 -0.4118958  -0.10933076  0.2864734   0.42591774  0.35520273 -1.2562522
 -1.3118799   0.1367726   0.89168227  0.08609396 -0.7965635   0.03220405
 -1.2149535   0.06975704]"

1,8425,-8.82022551263183,115.171107687056,ID_BLI079,Indonesia,96,Bali,623,"[ 0.20398486 -0.3435272  -1.8947698  -1.0723802   1.2999498   0.211587
  0.16329497 -0.09804655 -0.41587254 -0.09957021  0.8152087  -0.6022888
 -0.10874949 -1.4237555  -0.02137504 -0.60817945  0.81695604 -0.0106029
  1.2845753   0.18705958  0.5555717   0.53619224  1.6209115   1.3571581
 -0.1660664   0.12530853 -0.12268435 -0.19951908  0.27602577 -0.66749376
 -0.09328692 -0.07952076]"

2,8431,-8.23575827574888,114.351026639342,ID_BWI026,Indonesia,96,Banyuwangi,770,"[-0.14250259 -0.60264546  0.39676255 -0.24801618  0.61574996 -0.5373072
  0.97321934 -0.22758694 -0.8498406  -0.86897266  0.565802   -1.383025
 -0.16449492 -1.6958055  -0.25523412 -0.50068396  0.36182633 -1.5886943
  0.56873196 -0.42583758 -0.16461776  0.12368935  1.470881    0.23292007
 -1.2004089   0.34835646  0.48000658  0.27867964 -0.35181814  0.20428348
  0.04278001 -0.16710897]"

Embeddings is the last column of the given sample data of three rows. Embeddings 是给定的三行样本数据的最后一列。 I want to group the data with column 2 (guest_id)(8431, 8425, 8431) and calculate the average of embeddings array per group.我想将数据与第 2 列 (guest_id)(8431, 8425, 8431) 分组并计算每组嵌入数组的平均值。

I tried with the following code but the variable a contains single numpy array only and subsequently the zip function does not work.我尝试使用以下代码,但变量a仅包含单个 numpy 数组,随后zip函数不起作用。

#Get the average of n 32 dimension embeddings
def get_average(values):
    a = np.array(values.values)
    a = np.array(a[0].split()[1:-1]).astype(float)
    print(a.shape)  # Returns (32,) n number of times
    return ([float(sum(col))/len(col) for col in zip(*a)])

#Read embeddings CSV file
hotelFrame = pd.read_csv('96_embedding')
hotelFrame = hotelFrame.iloc[: , [1, -1]] # select only 2 columns, guest_id and embedding
hotelFrame.columns = ['guest_id', 'embedding']
print(type(hotelFrame.embedding)) # Returns <class 'pandas.core.series.Series'>

average_embeddings = result.groupby("guest_id").embedding.agg(get_average).to_frame() 

Error: TypeError: zip argument #1 must support iteration错误:类型错误:zip 参数 #1 必须支持迭代

How do I get the guest_id, embeddings_average dataframe in output?如何在输出中获取 guest_id、embeddings_average 数据帧? What am I doing wrong?我究竟做错了什么?

You can create an auxiliary column avg_embedding and then do a regular groupby :您可以创建一个辅助列avg_embedding然后执行常规groupby

df['avg_embedding'] = df.embedding.apply(lambda x: pd.np.fromstring(x[1:-1], sep=' ').mean())
df.groupby("guest_id").avg_embedding.mean()

Result:结果:

guest_id
8425    0.044565
8431   -0.268278
Name: avg_embedding, dtype: float64

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM