[英]How to merge categories Categorical column in pandas dataframe?
I have a dataframe: 我有一个数据框:
Date Open High Low Close Struct Trend
2000-12-31 1477.87 1553.10 1254.19 1320.28 ohlc D
2001-12-31 1321.62 1383.37 944.07 1148.08 ohlc D
2002-12-31 1148.08 1176.97 768.58 879.82 ohlc D
2003-12-31 881.69 1112.52 788.90 1111.92 olhc U
2004-12-31 1112.61 1217.33 1060.74 1211.92 olhc U
2005-12-31 1213.43 1275.80 1136.22 1248.29 olhc U
2006-12-31 1252.03 1431.81 1219.29 1418.30 olhc U
2007-12-31 1418.03 1576.09 1364.14 1468.36 olhc U/D
2008-12-31 1468.36 1471.77 741.02 903.25 ohlc D
2009-12-31 903.25 1130.38 666.79 1115.10 olhc U/D
2010-12-31 1115.10 1262.60 1010.91 1257.64 olhc U
2011-12-31 1257.62 1370.58 1074.77 1257.60 ohlc U
2012-12-31 1258.86 1474.51 1258.86 1426.19 olhc U
2013-12-31 1426.19 1849.44 1426.19 1848.36 olhc U
2014-12-31 1845.86 2093.55 1737.92 2058.90 olhc U
2015-12-31 2058.90 2134.72 1867.01 2043.94 ohlc U
2016-12-31 2038.20 2277.53 1810.10 2238.83 olhc U
2017-12-31 2251.57 2694.97 2245.13 2673.61 olhc U
2018-12-31 2683.73 2940.91 2346.58 2506.85 ohlc U
Data has two categorical columns 'Struct' and 'Trend'. 数据具有两个类别列“结构”和“趋势”。
I would like to group data by these two columns. 我想按这两列对数据进行分组。
When I do like this: 当我这样做时:
groups = data.groupby(['Struct', 'Trend'])
pandas get possible 6 different combinations of 'Struct' and 'Trend': [('ohlc', 'D'), ('ohlc', 'U'), ('ohlc', 'U/D'), ('olhc', 'D'), ('olhc', 'U'), ('olhc', 'U/D')] 熊猫可能会获得“结构”和“趋势”的6种不同组合:[('ohlc','D'),('ohlc','U'),('ohlc','U / D'),(' olhc”,“ D”),(“ olhc”,“ U”),(“ olhc”,“ U / D”)]
How to merge groups, where 'Trend' category has 'D' as a substring of value ??? 如何合并组,其中“趋势”类别具有“ D”作为值的子字符串?
I expect only 4 groups:: 我希望只有4组:
Simply say, each group 'D' must include all data 'D' and 'U/D'. 简而言之,每个组“ D”必须包含所有数据“ D”和“ U / D”。 Each group 'U' must include data 'U' and 'U/D'
每个组“ U”必须包含数据“ U”和“ U / D”
Edited: 编辑:
Expected result for sample above: 上面样本的预期结果:
Date Open High Low Close Struct Trend
2003-12-31 881.69 1112.52 788.90 1111.92 olhc U
2004-12-31 1112.61 1217.33 1060.74 1211.92 olhc U
2005-12-31 1213.43 1275.80 1136.22 1248.29 olhc U
2006-12-31 1252.03 1431.81 1219.29 1418.30 olhc U
2007-12-31 1418.03 1576.09 1364.14 1468.36 olhc U/D
2009-12-31 903.25 1130.38 666.79 1115.10 olhc U/D
2010-12-31 1115.10 1262.60 1010.91 1257.64 olhc U
2011-12-31 1257.62 1370.58 1074.77 1257.60 ohlc U
2012-12-31 1258.86 1474.51 1258.86 1426.19 olhc U
2013-12-31 1426.19 1849.44 1426.19 1848.36 olhc U
2014-12-31 1845.86 2093.55 1737.92 2058.90 olhc U
2015-12-31 2058.90 2134.72 1867.01 2043.94 ohlc U
2016-12-31 2038.20 2277.53 1810.10 2238.83 olhc U
2017-12-31 2251.57 2694.97 2245.13 2673.61 olhc U
2018-12-31 2683.73 2940.91 2346.58 2506.85 ohlc U
Date Open High Low Close Struct Trend
2000-12-31 1477.87 1553.10 1254.19 1320.28 ohlc D
2001-12-31 1321.62 1383.37 944.07 1148.08 ohlc D
2002-12-31 1148.08 1176.97 768.58 879.82 ohlc D
2007-12-31 1418.03 1576.09 1364.14 1468.36 olhc U/D
2008-12-31 1468.36 1471.77 741.02 903.25 ohlc D
2009-12-31 903.25 1130.38 666.79 1115.10 olhc U/D
I am doing like this, but I get just dataframe and want groups: 我这样做,但我只得到数据框并想要组:
trend_dtype = pd.api.types.CategoricalDtype(categories=['D', 'U/D'], ordered=False)
data['Trend'] = data['Trend'].astype(trend_dtype)
print(data.dropna())
You can use boolen indexing . 您可以使用布尔值索引 。
df.loc[['U' in key for key in df['Trend']]]
Date Open High Low Close Struct Trend
3 2003-12-31 881.69 1112.52 788.90 1111.92 olhc U
4 2004-12-31 1112.61 1217.33 1060.74 1211.92 olhc U
5 2005-12-31 1213.43 1275.80 1136.22 1248.29 olhc U
6 2006-12-31 1252.03 1431.81 1219.29 1418.30 olhc U
7 2007-12-31 1418.03 1576.09 1364.14 1468.36 olhc U/D
9 2009-12-31 903.25 1130.38 666.79 1115.10 olhc U/D
10 2010-12-31 1115.10 1262.60 1010.91 1257.64 olhc U
11 2011-12-31 1257.62 1370.58 1074.77 1257.60 ohlc U
12 2012-12-31 1258.86 1474.51 1258.86 1426.19 olhc U
13 2013-12-31 1426.19 1849.44 1426.19 1848.36 olhc U
14 2014-12-31 1845.86 2093.55 1737.92 2058.90 olhc U
15 2015-12-31 2058.90 2134.72 1867.01 2043.94 ohlc U
16 2016-12-31 2038.20 2277.53 1810.10 2238.83 olhc U
17 2017-12-31 2251.57 2694.97 2245.13 2673.61 olhc U
18 2018-12-31 2683.73 2940.91 2346.58 2506.85 ohlc U
df.loc[['D' in key for key in df['Trend']]]
Date Open High Low Close Struct Trend
0 2000-12-31 1477.87 1553.10 1254.19 1320.28 ohlc D
1 2001-12-31 1321.62 1383.37 944.07 1148.08 ohlc D
2 2002-12-31 1148.08 1176.97 768.58 879.82 ohlc D
7 2007-12-31 1418.03 1576.09 1364.14 1468.36 olhc U/D
8 2008-12-31 1468.36 1471.77 741.02 903.25 ohlc D
9 2009-12-31 903.25 1130.38 666.79 1115.10 olhc U/D
You can view your problem as duplicate the rows where Trend
is U/D
. 您可以将
Trend
复制为Trend
为U/D
的行来查看问题。 So here's an approach: 所以这是一种方法:
df = (df.iloc[:,:-1]
.join(df.Trend.str.split('/', expand=True))
.melt(id_vars=df.columns[:-1], value_name='Trend')
.dropna()
.drop('variable', axis=1)
)
And your df is: 而您的df是:
Date Open High Low Close Struct Trend
0 2000-12-31 1477.87 1553.10 1254.19 1320.28 ohlc D
1 2001-12-31 1321.62 1383.37 944.07 1148.08 ohlc D
2 2002-12-31 1148.08 1176.97 768.58 879.82 ohlc D
3 2003-12-31 881.69 1112.52 788.90 1111.92 olhc U
4 2004-12-31 1112.61 1217.33 1060.74 1211.92 olhc U
5 2005-12-31 1213.43 1275.80 1136.22 1248.29 olhc U
6 2006-12-31 1252.03 1431.81 1219.29 1418.30 olhc U
7 2007-12-31 1418.03 1576.09 1364.14 1468.36 olhc U
8 2008-12-31 1468.36 1471.77 741.02 903.25 ohlc D
9 2009-12-31 903.25 1130.38 666.79 1115.10 olhc U
10 2010-12-31 1115.10 1262.60 1010.91 1257.64 olhc U
11 2011-12-31 1257.62 1370.58 1074.77 1257.60 ohlc U
12 2012-12-31 1258.86 1474.51 1258.86 1426.19 olhc U
13 2013-12-31 1426.19 1849.44 1426.19 1848.36 olhc U
14 2014-12-31 1845.86 2093.55 1737.92 2058.90 olhc U
15 2015-12-31 2058.90 2134.72 1867.01 2043.94 ohlc U
16 2016-12-31 2038.20 2277.53 1810.10 2238.83 olhc U
17 2017-12-31 2251.57 2694.97 2245.13 2673.61 olhc U
18 2018-12-31 2683.73 2940.91 2346.58 2506.85 ohlc U
26 2007-12-31 1418.03 1576.09 1364.14 1468.36 olhc D
28 2009-12-31 903.25 1130.38 666.79 1115.10 olhc D
Notice the lines (7,26)
and (9,28)
. 注意线
(7,26)
和(9,28)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.