[英]Pandas join grouped and normal dataframe
I'm using Pandas (0.9.1) to write a physics code. 我正在使用Pandas(0.9.1)编写物理代码。 I have two dataframes:
我有两个数据帧:
Levels: 级别:
class 'pandas.core.frame.DataFrame'>
Int64Index: 37331 entries, 0 to 37330
Data columns:
atomic_number 37331 non-null values
ion_number 37331 non-null values
level_number 37331 non-null values
energy 37331 non-null values
g 37331 non-null values
metastable 37331 non-null values
Lines: 行:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 314338 entries, 0 to 314337
Data columns:
id 314338 non-null values
wavelength 314338 non-null values
atomic_number 314338 non-null values
ion_number 314338 non-null values
f_ul 314338 non-null values
f_lu 314338 non-null values
level_number_lower 314338 non-null values
level_number_upper 314338 non-null values
dtypes: float64(3), int64(7)
There's a couple of things I need to do: I need to join levels with lines (atom, ion, level): at first on atom, ion, level_number_upper and then atom, ion, level_number_lower. 我需要做一些事情:我需要用线(原子,离子,水平)连接水平:首先是原子,离子,level_number_upper然后是原子,离子,level_number_lower。 Is there a way to precompute the join - memory is not an issue, but speed is.
有没有办法预先计算连接 - 内存不是问题,但速度是。
I also need to group levels (on atom, ion) and do an operation on levels. 我还需要对水平(原子,离子)进行分组并在水平上进行操作。 I did this already (incredibly fast), but then had trouble joining the resulting series with the lines dataframe.
我已经这样做了(速度非常快),但是在使用行数据帧加入生成的系列时遇到了麻烦。
How do I do this? 我该怎么做呢?
Cheers Wolfgang 干杯沃尔夫冈
update v1: 更新v1:
To show what I want to join merge here a code snippet 要显示我想加入的内容,请在此处合并代码段
def calc_group_func(group):
return np.sum(group['g']*np.exp(-group['energy'])
grouped_data = levels.group_by('atomic_number', 'ion_number')
grouped_data.apply(calc_group_func)
and then I want to join/merge grouped data with lines on atomic_number and ion_number 然后我想加入/合并分组数据与atomic_number和ion_number上的行
There may be a better way, but perhaps df.merge() would work here. 可能有更好的方法,但也许df.merge()可以在这里工作。 df.merge() works on two DataFrames, so the values computed for each (atom, ion) pair, which are in a Series after apply(), need to be placed in a DataFrame first, at which time the final column name can also be specified.
df.merge()适用于两个DataFrame,因此为apply()之后的系列中的每个(atom,ion)对计算的值需要首先放在DataFrame中,此时最终的列名称可以也可以指定。
In [9]: grouped_vals = grouped_data.apply(calc_group_func)
In [10]: grouped_vals
Out[10]:
atomic_number ion_number
0 0 0.517541
1 0.046833
1 0 0.253188
1 0.440194
In [11]: lines.merge(pd.DataFrame({'group_val': grouped_vals}),
....: left_on=['atomic_number', 'ion_number'],
....: right_index=True)
Out[11]:
atomic_number ion_number group_val
id
a 0 0 0.517541
b 0 0 0.517541
c 0 1 0.046833
d 0 1 0.046833
e 1 0 0.253188
f 1 0 0.253188
g 1 1 0.440194
h 1 1 0.440194
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.