简体   繁体   中英

Pandas: grouping by a slice of a string

I have a large data set that I'm working on, it has about 6000 rows and couple hundred columns. I have managed to get most of the information sorted out as I need, but now I've gotten stuck since i can't manage to correctly group by a slice of a string.

The original data is in the form:

6001  17/11/2019 6:00:00 PM         2019  ...        30.519371    NaN
6002  17/11/2019 6:00:00 PM         2019  ...         0.000000    NaN
6003  17/11/2019 6:00:00 PM         2019  ...         0.000000    NaN
6004  17/11/2019 6:00:00 PM         2019  ...         0.000000    NaN
6005  17/11/2019 6:00:00 PM         2019  ...         0.000000    NaN

[6006 rows x 153 columns]>

First I have ran a query to filter out data based on one of the columns. After this I'm left with 1500 lines of data and I need to group them based on 2 columns and sum up the numbers in the third. This code seems to do the job mostly:

grouped_data = data_drill.groupby(['PeriodStartDate', 'Blast'])
                                  ['Calc_DRILLING_Holes'].sum()

and here's what I get as a result:

In[9]: grouped_data
Out[9]: 
PeriodStartDate        Blast 
1/09/2019 6:00:00 AM   6317.0     70.786625
                       7253.0     60.964185
                       8140.0     41.540451
1/09/2019 6:00:00 PM   6317.0     77.692637
                       7253.0     66.911911
                       8140.0     45.593178
1/10/2019 6:00:00 AM   2040.0     50.791661
                       2379.0     90.084856
                       5271.0     66.029160
1/10/2019 6:00:00 PM   2040.0     42.119914
                       2379.0     98.873622
                       5271.0     72.471029
1/11/2019 6:00:00 AM   2376.0     96.204423

Which is exactly what I need with the exception that here due to the format the date is presented in, the information for a single day is separated in to 6am and 6pm blocks. I don't need this separation, I need the combined data for the entire 24hr periods.

I tried using str.slice to only take the first 10 digits of PeriodStartDate column, but I can't seem to get it right.

Finally, as you can see in the output above, the resulting dates are sorted in a weird fashion - 1st of September is followed by the 1st of October, while there is an entire month of dates in between. Is there a way to get them to come out sorted properly?

Thanks in advance!

You can use the str attribute:

grouped_data = data_drill.groupby([data_drill['PeriodStartDate'].str[:9], 'Blast'])
                                  ['Calc_DRILLING_Holes'].sum()

This assumes that your indexing will work for all your dates.

Alternatively, convert that column to a datetime and use data_drill['PeriodStartDate'].dt.date

If the column is a datetime type, it might be good to just remove the timestamp all together and only keep the date

df['PeriodStartDate'] = df['PeriodStartDate'].dt.date

then you can go about grouping by the date.

If it's not a datetime object (If you're having problems slicing it, then I would suspect that it is), you can achieve that by converting it

pd.to_datetime(df.PeriodStartDate)

after that, for sorting, you can just sort on the date after the group by

df.groupby(['PeriodStartDate', 'Blast'])['Calc_DRILLING_Holes'].sum().reset_index().sort_values('PeriodStartDate')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM