[英]Pandas Pivot Table: Error when filtering by condition
I have a dataframe which I pivoted and trying to create a updated dataframe when the values meet certain condition.我有一个 dataframe,当值满足特定条件时,我尝试创建一个更新的 dataframe。 The problem, I have is the values in the columns are structured in two lines.我遇到的问题是列中的值分为两行。 The comparison needs to be done on line1 of the value.需要在值的第 1 行进行比较。 For example, if the col7 value is '100.2\n11', then I need to compare 100.2 against the condition and if it satisfies the condition, then the final dataframe should contain the full value('100.2\n11') of the data and not just 100.2.例如,如果 col7 值为 '100.2\n11',那么我需要将 100.2 与条件进行比较,如果它满足条件,那么最终的 dataframe 应该包含数据的完整值('100.2\n11')并且不只是 100.2。
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16],
'col2': ['test1', 'test1', 'test1', 'test1', 'test2', 'test2', 'test2',
'test2', 'test3', 'test3', 'test3', 'test3', 'test4', 'test5',
'test1', 'test1'],
'col3': ['t1', 't1', 't1', 't1', 't1', 't1', 't1', 't1', 't1', 't1', 't1',
't1', 't1', 't1', 't1', 't1'],
'col4': ['input1', 'input2', 'input3', 'input4', 'input1', 'input2',
'input3', 'input4', 'input1', 'input2', 'input3', 'input5',
'input2', 'input6', 'input1', 'input1'],
'col5': ['result1', 'result2', 'result3', 'result4', 'result1', 'result2',
'result3', 'result4', 'result1', 'result2', 'result3', 'result4',
'result2', 'result1', 'result2', 'result6'],
'col6': [10, 20, 30, 40, 10, 20, 30, 40, 10, 20, 30, 50, 20, 100, 10, 10],
'col7': ['100.2\n11','101.2\n21','102.3\n34','101.4\n41','100.0\n10','103.0\n20.6','104.0\n31.2','105.0\n42','102.0\n10.2',
'87.0\n15','107.0\n32.1','110.2\n61.2','120.0\n22.4','88.0\n90','106.2\n16.2','101.1\n10.1']})
df1=df.pivot_table(values = 'col7', index = ['col4', 'col5', 'col6'], columns = ['col2'], aggfunc = 'max')
df2 = df1[((df1.groupby(level='col4').rank(ascending=False) == 1.).any(axis=1)) & (df1 >= 105).any(axis=1)]
print(df2)
I am getting the following error:我收到以下错误:
File "pandas\_libs\ops.pyx", line 107, in pandas._libs.ops.scalar_compare
TypeError: '>=' not supported between instances of 'str' and 'int'
The final pivot table output after the condition is satisfied should be as follows:条件满足后最终的pivot表output应该是这样的:
col2 test1 test2 test3 test4 test5
col4 col5 col6
input1 result2 10 106.2\n16.2 NaN NaN NaN NaN
input2 result2 20 101.2\n21 103.0\n20.6 87.0\n15 120.0\n22.4 NaN
input3 result3 30 102.3\n34 104.0\n31.2 107.0\n32.1 NaN NaN
input4 result4 40 101.4\n41 105.0\n42 NaN NaN NaN
input5 result4 50 NaN NaN 110.2\n61.2 NaN NaN
Any guidance is much appreciated.非常感谢任何指导。 Thanks in advance.提前致谢。
You could use Pandas applymap
to create an auxiliary dataframe that contains only the first line values from df1
and then apply it to the filter conditions.您可以使用 Pandas applymap
创建辅助 dataframe,它仅包含df1
的第一行值,然后将其应用于过滤条件。
...
...
df1=df.pivot_table(values = 'col7', index = ['col4', 'col5', 'col6'], columns = ['col2'], aggfunc = 'max')
df_tmp = df1.applymap(lambda x: float(str(x).split('\n')[0]))
df2 = df1[
((df_tmp.groupby(level='col4').rank(ascending=False) == 1.).any(axis=1)) &
(df_tmp >= 105).any(axis=1)
]
print(df2)
col2 test1 test2 test3 test4 test5
col4 col5 col6
input1 result2 10 106.2\n16.2 NaN NaN NaN NaN
input2 result2 20 101.2\n21 103.0\n20.6 87.0\n15 120.0\n22.4 NaN
input3 result3 30 102.3\n34 104.0\n31.2 107.0\n32.1 NaN NaN
input4 result4 40 101.4\n41 105.0\n42 NaN NaN NaN
input5 result4 50 NaN NaN 110.2\n61.2 NaN NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.