简体   繁体   English

如何在熊猫数据框中合并类别分类列?

[英]How to merge categories Categorical column in pandas dataframe?

I have a dataframe: 我有一个数据框:

Date        Open     High      Low     Close     Struct  Trend                                           
2000-12-31  1477.87  1553.10  1254.19  1320.28   ohlc     D
2001-12-31  1321.62  1383.37   944.07  1148.08   ohlc     D
2002-12-31  1148.08  1176.97   768.58   879.82   ohlc     D
2003-12-31   881.69  1112.52   788.90  1111.92   olhc     U
2004-12-31  1112.61  1217.33  1060.74  1211.92   olhc     U
2005-12-31  1213.43  1275.80  1136.22  1248.29   olhc     U
2006-12-31  1252.03  1431.81  1219.29  1418.30   olhc     U
2007-12-31  1418.03  1576.09  1364.14  1468.36   olhc   U/D
2008-12-31  1468.36  1471.77   741.02   903.25   ohlc     D
2009-12-31   903.25  1130.38   666.79  1115.10   olhc   U/D
2010-12-31  1115.10  1262.60  1010.91  1257.64   olhc     U
2011-12-31  1257.62  1370.58  1074.77  1257.60   ohlc     U
2012-12-31  1258.86  1474.51  1258.86  1426.19   olhc     U
2013-12-31  1426.19  1849.44  1426.19  1848.36   olhc     U
2014-12-31  1845.86  2093.55  1737.92  2058.90   olhc     U
2015-12-31  2058.90  2134.72  1867.01  2043.94   ohlc     U
2016-12-31  2038.20  2277.53  1810.10  2238.83   olhc     U
2017-12-31  2251.57  2694.97  2245.13  2673.61   olhc     U
2018-12-31  2683.73  2940.91  2346.58  2506.85   ohlc     U

Data has two categorical columns 'Struct' and 'Trend'. 数据具有两个类别列“结构”和“趋势”。

I would like to group data by these two columns. 我想按这两列对数据进行分组。

When I do like this: 当我这样做时:

groups = data.groupby(['Struct', 'Trend'])

pandas get possible 6 different combinations of 'Struct' and 'Trend': [('ohlc', 'D'), ('ohlc', 'U'), ('ohlc', 'U/D'), ('olhc', 'D'), ('olhc', 'U'), ('olhc', 'U/D')] 熊猫可能会获得“结构”和“趋势”的6种不同组合:[('ohlc','D'),('ohlc','U'),('ohlc','U / D'),(' olhc”,“ D”),(“ olhc”,“ U”),(“ olhc”,“ U / D”)]

How to merge groups, where 'Trend' category has 'D' as a substring of value ??? 如何合并组,其中“趋势”类别具有“ D”作为值的子字符串?

I expect only 4 groups:: 我希望只有4组:

  1. ('ohlc', 'D') + ('ohlc', 'U/D') --> ('ohlc', 'D') ('ohlc','D')+('ohlc','U / D')->('ohlc','D')
  2. ('ohlc', 'U') + ('ohlc', 'U/D') --> ('ohlc', 'U') ('ohlc','U')+('ohlc','U / D')->('ohlc','U')
  3. ('olhc', 'D') + ('ohlc', 'U/D') --> ('olhc', 'D') ('olhc','D')+('ohlc','U / D')->('olhc','D')
  4. ('olhc', 'U') + ('ohlc', 'U/D') --> ('olhc', 'U') ('olhc','U')+('ohlc','U / D')->('olhc','U')

Simply say, each group 'D' must include all data 'D' and 'U/D'. 简而言之,每个组“ D”必须包含所有数据“ D”和“ U / D”。 Each group 'U' must include data 'U' and 'U/D' 每个组“ U”必须包含数据“ U”和“ U / D”

Edited: 编辑:

Expected result for sample above: 上面样本的预期结果:

Date        Open     High      Low     Close     Struct  Trend                                           
2003-12-31   881.69  1112.52   788.90  1111.92   olhc     U
2004-12-31  1112.61  1217.33  1060.74  1211.92   olhc     U
2005-12-31  1213.43  1275.80  1136.22  1248.29   olhc     U
2006-12-31  1252.03  1431.81  1219.29  1418.30   olhc     U
2007-12-31  1418.03  1576.09  1364.14  1468.36   olhc   U/D
2009-12-31   903.25  1130.38   666.79  1115.10   olhc   U/D
2010-12-31  1115.10  1262.60  1010.91  1257.64   olhc     U
2011-12-31  1257.62  1370.58  1074.77  1257.60   ohlc     U
2012-12-31  1258.86  1474.51  1258.86  1426.19   olhc     U
2013-12-31  1426.19  1849.44  1426.19  1848.36   olhc     U
2014-12-31  1845.86  2093.55  1737.92  2058.90   olhc     U
2015-12-31  2058.90  2134.72  1867.01  2043.94   ohlc     U
2016-12-31  2038.20  2277.53  1810.10  2238.83   olhc     U
2017-12-31  2251.57  2694.97  2245.13  2673.61   olhc     U
2018-12-31  2683.73  2940.91  2346.58  2506.85   ohlc     U



Date        Open     High      Low     Close     Struct  Trend                                           
2000-12-31  1477.87  1553.10  1254.19  1320.28   ohlc     D
2001-12-31  1321.62  1383.37   944.07  1148.08   ohlc     D
2002-12-31  1148.08  1176.97   768.58   879.82   ohlc     D
2007-12-31  1418.03  1576.09  1364.14  1468.36   olhc   U/D
2008-12-31  1468.36  1471.77   741.02   903.25   ohlc     D
2009-12-31   903.25  1130.38   666.79  1115.10   olhc   U/D

I am doing like this, but I get just dataframe and want groups: 我这样做,但我只得到数据框并想要组:

trend_dtype = pd.api.types.CategoricalDtype(categories=['D', 'U/D'], ordered=False)
data['Trend'] = data['Trend'].astype(trend_dtype)
print(data.dropna())

You can use boolen indexing . 您可以使用布尔值索引

df.loc[['U' in key for key in df['Trend']]]

          Date     Open     High      Low    Close Struct Trend
3   2003-12-31   881.69  1112.52   788.90  1111.92   olhc     U
4   2004-12-31  1112.61  1217.33  1060.74  1211.92   olhc     U
5   2005-12-31  1213.43  1275.80  1136.22  1248.29   olhc     U
6   2006-12-31  1252.03  1431.81  1219.29  1418.30   olhc     U
7   2007-12-31  1418.03  1576.09  1364.14  1468.36   olhc   U/D
9   2009-12-31   903.25  1130.38   666.79  1115.10   olhc   U/D
10  2010-12-31  1115.10  1262.60  1010.91  1257.64   olhc     U
11  2011-12-31  1257.62  1370.58  1074.77  1257.60   ohlc     U
12  2012-12-31  1258.86  1474.51  1258.86  1426.19   olhc     U
13  2013-12-31  1426.19  1849.44  1426.19  1848.36   olhc     U
14  2014-12-31  1845.86  2093.55  1737.92  2058.90   olhc     U
15  2015-12-31  2058.90  2134.72  1867.01  2043.94   ohlc     U
16  2016-12-31  2038.20  2277.53  1810.10  2238.83   olhc     U
17  2017-12-31  2251.57  2694.97  2245.13  2673.61   olhc     U
18  2018-12-31  2683.73  2940.91  2346.58  2506.85   ohlc     U

df.loc[['D' in key for key in df['Trend']]]

             Date     Open     High      Low    Close Struct Trend
0  2000-12-31  1477.87  1553.10  1254.19  1320.28   ohlc     D
1  2001-12-31  1321.62  1383.37   944.07  1148.08   ohlc     D
2  2002-12-31  1148.08  1176.97   768.58   879.82   ohlc     D
7  2007-12-31  1418.03  1576.09  1364.14  1468.36   olhc   U/D
8  2008-12-31  1468.36  1471.77   741.02   903.25   ohlc     D
9  2009-12-31   903.25  1130.38   666.79  1115.10   olhc   U/D

You can view your problem as duplicate the rows where Trend is U/D . 您可以将Trend复制为TrendU/D的行来查看问题。 So here's an approach: 所以这是一种方法:

df = (df.iloc[:,:-1]
   .join(df.Trend.str.split('/', expand=True))
   .melt(id_vars=df.columns[:-1], value_name='Trend')
   .dropna()
   .drop('variable', axis=1)
)

And your df is: 而您的df是:

          Date     Open     High      Low    Close Struct Trend
0   2000-12-31  1477.87  1553.10  1254.19  1320.28   ohlc     D
1   2001-12-31  1321.62  1383.37   944.07  1148.08   ohlc     D
2   2002-12-31  1148.08  1176.97   768.58   879.82   ohlc     D
3   2003-12-31   881.69  1112.52   788.90  1111.92   olhc     U
4   2004-12-31  1112.61  1217.33  1060.74  1211.92   olhc     U
5   2005-12-31  1213.43  1275.80  1136.22  1248.29   olhc     U
6   2006-12-31  1252.03  1431.81  1219.29  1418.30   olhc     U
7   2007-12-31  1418.03  1576.09  1364.14  1468.36   olhc     U
8   2008-12-31  1468.36  1471.77   741.02   903.25   ohlc     D
9   2009-12-31   903.25  1130.38   666.79  1115.10   olhc     U
10  2010-12-31  1115.10  1262.60  1010.91  1257.64   olhc     U
11  2011-12-31  1257.62  1370.58  1074.77  1257.60   ohlc     U
12  2012-12-31  1258.86  1474.51  1258.86  1426.19   olhc     U
13  2013-12-31  1426.19  1849.44  1426.19  1848.36   olhc     U
14  2014-12-31  1845.86  2093.55  1737.92  2058.90   olhc     U
15  2015-12-31  2058.90  2134.72  1867.01  2043.94   ohlc     U
16  2016-12-31  2038.20  2277.53  1810.10  2238.83   olhc     U
17  2017-12-31  2251.57  2694.97  2245.13  2673.61   olhc     U
18  2018-12-31  2683.73  2940.91  2346.58  2506.85   ohlc     U
26  2007-12-31  1418.03  1576.09  1364.14  1468.36   olhc     D
28  2009-12-31   903.25  1130.38   666.79  1115.10   olhc     D

Notice the lines (7,26) and (9,28) . 注意线(7,26)(9,28)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在具有分类列的熊猫数据框中,跨类别外推另一列的值 - In a pandas dataframe with a categorical column, extrapolate value of another column across categories 在 pandas 中,如何在缺少类别的分类系列上使用 pivot 和 dataframe? - In pandas, how to pivot a dataframe on a categorical series with missing categories? 如何将分类索引更新为 Pandas DataFrame (Python) 中使用的类别? - How to update categorical index to used categories in Pandas DataFrame (Python)? 如何根据分类列对pandas数据框进行随机排序 - How to shuffle a pandas dataframe according to a categorical column Pandas:在DataFrame构造函数中使用类别定义分类dtype - Pandas: Define categorical dtype with categories in DataFrame constructor 熊猫:在分类数据框中添加一列 - Pandas: add a column to a categorical dataframe 通过分类列扩展pandas数据框 - Expand pandas dataframe by categorical column 熊猫数据框中分类列的概率 - probability of a categorical column in pandas dataframe 如何在Pandas Dataframe(分类数据)中将列名称分类到bin中 - How to sort column names into bins in Pandas Dataframe (Categorical Data) 如何根据分类列检查 pandas dataframe 中的日期范围是否重叠? - How to check if date ranges are overlapping in a pandas dataframe according to a categorical column?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM