简体   繁体   English

Pandas 有两个数据框,想要每组之间的划分的平均值

[英]Pandas has two dataframes, want the average of the divisions between each group

I have a dataframe like this:我有一个这样的 dataframe:

dataA = [["A1", "t1", 5], ["A1", "t2", 8], ["A1", "t3", 7],
    ["A1","t4", 4], ["A1", "t5", 2], ["A1", "t6", 2],
    ["A2", "t1", 15], ["A2", "t2", 6], ["A2", "t3", 1], 
    ["A2", "t4", 11], ["A2", "t5", 12], ["A2", "t6", 7], 
    ["A3", "t1", 12], ["A3", "t2", 8], ["A3", "t3", 3], 
    ["A3", "t4", 7], ["A3", "t5", 15], ["A3", "t6", 14]]

dataB = [["B1", "t1", 2], ["B1", "t2", 9], ["B1", "t3", 17],
    ["B1","t4", 14], ["B1", "t5", 32], ["B1", "t6", 3],
    ["B2", "t1", 44], ["B2", "t2", 36], ["B2", "t3", 51], 
    ["B2", "t4", 81], ["B2", "t5", 82]]

data1 = pd.DataFrame(data = dataA, columns=["An", "colA", "Val"])

data2 = pd.DataFrame(data = dataB, columns=["Bm", "colA", "Val"])

How to get this result:如何得到这个结果:

 GroupA | GroupB| result | 
 ---------------------------
| A1    | B1    | val_11 | 
 --------------------------
| A1    | B2    | val_12 | 
 --------------------------
| A2    | B1    | val_21 | 
 --------------------------
| A2    | B2    | val_22 | 
 --------------------------
| A3    | B1    | val_31 |
 --------------------------
| A3    | B2    | val_32 |

...........................

| An    | Bm    | val_nm  |

The way calculate val_nm as follows: val_11 is equal to the column mean value of the column value of A1 divided by the column value of B1, Note that the column A1 divided by the column B1, the corresponding number is divided by the result, if it is greater than 1, take the reciprocal , and then find the average of the result So whether A1 is divided by B1 or B1 is divided by A1, the result value must be the same. val_nm的计算方式如下: val_11等于A1的列值除以B1的列值的列平均值,注意是A1列除以B1列,对应的数除以结果,如果大于1,取倒数,然后求结果的平均值 所以不管是A1除以B1还是B1除以A1,结果值一定是一样的。

In order to calculate val, it may be necessary to define a function, val is greater than 0, there will be no division by 0为了计算val,可能需要定义一个function,val大于0,就不会被0除

I take val_11 as example我以 val_11 为例

A1[5,8,7,4,2,2] B1[2,9,17,14,32,3] A1[5,8,7,4,2,2] B1[2,9,17,14,32,3]

val_11 =avg (A1/B1) =avg( 5/2 take 2/5 + 8/9 +7/17 + 4/15 +2/32 +2/3) val_11 =avg (A1/B1) =avg( 5/2 取 2/5 + 8/9 +7/17 + 4/15 +2/32 +2/3)

= 0.4525 = 0.4525

so no matter A1/B1 or B1/A1, result will be the same所以无论A1/B1还是B1/A1,结果都是一样的

please help me caculate result请帮我计算结果

Taking the straight definition of what you want to calculate直接定义要计算的内容

  • shape data frames first, data is key / value pairs, create tables using pivot()首先塑造数据框,数据是键/值对,使用pivot()创建表
  • do a Cartesian product between the two tables merge() on a synthetic column foo在合成列foo上的两个表merge()之间做笛卡尔积
  • complete calculation you specified完成您指定的计算
  • filter down columns to get to your required output过滤列以获得所需的 output
def meanofdiv(dfa):
    a = dfa.loc[:,[c for c in dfa.columns if "_A" in c]].values 
    b = dfa.loc[:,[c for c in dfa.columns if "_B" in c]].values
    return np.where((a/b)>1, b/a, a/b).mean(axis=1)

# pivot key/val pair data to tables
# caretesian product of tables
# simple calculation of columns from A and a column from B
dfr = pd.merge(
    data1.pivot(index="An", columns="colA", values="Val").reset_index().assign(foo=1),
    data2.pivot(index="Bm", columns="colA", values="Val").reset_index().assign(foo=1),
    on="foo",
    suffixes=("_A","_B")
).assign(resname=lambda dfa: dfa["An"]+dfa["Bm"],
        res=meanofdiv)

dfr.loc[:,["An","Bm","res"]]

An一个 Bm Bm res资源
0 0 A1 A1 B1 B1 0.452589 0.452589
1 1个 A1 A1 B2 B2 0.202259 0.202259
2 2个 A2 A2 B1 B1 0.408018 0.408018
3 3个 A2 A2 B2 B2 0.206316 0.206316
4 4个 A3 A3 B1 B1 0.40251 0.40251
5 5个 A3 A3 B2 B2 0.172901 0.172901

ragged data sets参差不齐的数据集

  • this deals with A and B sets being different lengths and stopping calc at last B observation这涉及 A 和 B 集的长度不同,并在最后一次 B 观察时停止计算
  • changed to be row by row apply(axis=1)改为逐行apply(axis=1)
  • modify arrays to be same size by looking at NaN in B通过查看 B 中的NaN将 arrays 修改为相同大小
def meanofdiv(dfa):
    dfa = dfa.to_frame().T
    a = dfa.loc[:,[c for c in dfa.columns if "_A" in c]].astype(float).values[0] 
    b = dfa.loc[:,[c for c in dfa.columns if "_B" in c]].astype(float).values[0]
    a = a[~np.isnan(b)]
    b = b[~np.isnan(b)]
    return np.where((a/b)>1, b/a, a/b).mean()

# pivot key/val pair data to tables
# caretesian product of tables
# simple calculation of columns from A and a column from B
dfr = pd.merge(
    data1.pivot(index="An", columns="colA", values="Val").reset_index().assign(foo=1),
    data2.pivot(index="Bm", columns="colA", values="Val").reset_index().assign(foo=1),
    on="foo",
    suffixes=("_A","_B")
).assign(resname=lambda dfa: dfa["An"]+dfa["Bm"],
        res=lambda dfa: dfa.apply(meanofdiv, axis=1))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM