简体   繁体   English

按熊猫数据框列的相同部分字符串分组

[英]group by same partial string of pandas dataframe column

I have several csv files and each one contains one stock price in one month and has millions of data. 我有几个csv文件,每个文件包含一个月内的一个股票价格,并具有数百万个数据。 The raw csv data data is like: 原始的CSV数据数据如下:

AA_Candy.csv AA_Candy.csv

Index   CompanyName      Time       Price
  1      AA Candy    030101090355   1.78
  2      AA Candy    030101091533   1.79
  .......
333498   AA Candy    031231145556   2.18

BB_Cookie.csv BB_Cookie.csv

   1     BB Cookie   030101090225   3.20
   2     BB Cookie   030101090845   3.14
  .......
391373   BB Cookie   031231145958   3.88

I use python and pandas to process the data, after I load and combine some of the datafiles, now I have a dataframe like: 在加载并合并一些数据文件后,我使用python和pandas处理数据,现在我有了一个数据框,如下所示:

frame: 帧:

Index   CompanyName      Time       Price
  1      AA Candy    030101090355   1.78
  2      AA Candy    030101091533   1.79
  .......
333498   AA Candy    031231145556   2.18
333499   BB Cookie   030101090225   3.20
333500   BB Cookie   030101090845   3.14
  .......
712871   BB Cookie   031231145958   3.88

The time 031231145958 represent 2013-12-31 14:59:58 时间031231145958代表2013-12-31 14:59:58

now I want to get the highest price and final price in every one hour of each company, and get an output file like: 现在我想获得每个公司每一个小时的最高价格和最终价格,并获得如下输出文件:

range_start   AA Candy/Max    AA Candy/Close    BB Cookie/Max     BB Cookie/Close
0301010900     1.79              1.77            3.20              3.10
........
0312311400     2.24              2.18            3.88              3.88

Therefore I want to groupby the CompanyName and first 8 character of Time to get the data of same company in one hour, then do the calculation to find the max price value and final price value of each company and output the outcome with same start hour in one row; 因此,我想对公司名称和时间的前8个字符进行分组,以在一小时内获得同一公司的数据,然后进行计算以找到每个公司的最大价格值和最终价格值,并在相同的开始时间输出结果。一排 let companyName/Max or Close be the new column name. 让companyName / Max或Close为新列名。

Because I am really new in pandas and dataframe, I have the following questions: 因为我真的是熊猫和数据框的新手,所以我有以下问题:

  1. How to group the data by the first 8 character of Time Column(Object) and then get my expected value? 如何按时间列(对象)的前8个字符对数据进行分组,然后获得我的期望值?
  2. How to form a new output dataframe/matrix as my expected output? 如何形成一个新的输出数据框/矩阵作为我的预期输出?

Thanks!! 谢谢!!

Perform a groupby on the company name and first 8 characters of your string timestamp (ie date plus hour). 对公司名称和字符串时间戳的前8个字符(即日期加小时)进行groupby Then use agg on the price to get custom functions for each (first, max, min and last). 然后在价格上使用agg获取每个(第一个,最大,最小和最后一个)的自定义函数。 Unstack the company names, swap the levels of the company names and open/high/low/close and optionally sort your symbols. 取消堆叠公司名称,交换公司名称的级别并打开/高/低/关闭,并选择对您的代码进行排序。

gb = (df.groupby(['CompanyName', df.Time.str[:8]])
        .Price
        .agg({'open': 'first', 
              'high': np.max, 
              'low': np.min, 
              'close': 'last'})[['open', 'high', 'low', 'close']]
        .unstack('CompanyName'))
gb.columns = gb.columns.swaplevel(0, 1)
>>> gb.sortlevel(level=0, axis=1)
CompanyName AA Candy                   BB Cookie                  
                open  high   low close      open  high   low close
Time                                                              
03010109        1.78  1.79  1.78  1.79      3.20  3.20  3.14  3.14
03123114        2.18  2.18  2.18  2.18      3.88  3.88  3.88  3.88

For your first question, you can use 对于第一个问题,您可以使用

df.groupby(df.Time.str[0:8])

For your second question, unstack should be what you want: 对于第二个问题,应根据需要进行unstack

df.groupby(df.Time.str[0:8]).unstack()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM