I have a large data set that I'm working on, it has about 6000 rows and couple hundred columns. I have managed to get most of the information sorted out as I need, but now I've gotten stuck since i can't manage to correctly group by a slice of a string.
The original data is in the form:
6001 17/11/2019 6:00:00 PM 2019 ... 30.519371 NaN
6002 17/11/2019 6:00:00 PM 2019 ... 0.000000 NaN
6003 17/11/2019 6:00:00 PM 2019 ... 0.000000 NaN
6004 17/11/2019 6:00:00 PM 2019 ... 0.000000 NaN
6005 17/11/2019 6:00:00 PM 2019 ... 0.000000 NaN
[6006 rows x 153 columns]>
First I have ran a query to filter out data based on one of the columns. After this I'm left with 1500 lines of data and I need to group them based on 2 columns and sum up the numbers in the third. This code seems to do the job mostly:
grouped_data = data_drill.groupby(['PeriodStartDate', 'Blast'])
['Calc_DRILLING_Holes'].sum()
and here's what I get as a result:
In[9]: grouped_data
Out[9]:
PeriodStartDate Blast
1/09/2019 6:00:00 AM 6317.0 70.786625
7253.0 60.964185
8140.0 41.540451
1/09/2019 6:00:00 PM 6317.0 77.692637
7253.0 66.911911
8140.0 45.593178
1/10/2019 6:00:00 AM 2040.0 50.791661
2379.0 90.084856
5271.0 66.029160
1/10/2019 6:00:00 PM 2040.0 42.119914
2379.0 98.873622
5271.0 72.471029
1/11/2019 6:00:00 AM 2376.0 96.204423
Which is exactly what I need with the exception that here due to the format the date is presented in, the information for a single day is separated in to 6am and 6pm blocks. I don't need this separation, I need the combined data for the entire 24hr periods.
I tried using str.slice
to only take the first 10 digits of PeriodStartDate column, but I can't seem to get it right.
Finally, as you can see in the output above, the resulting dates are sorted in a weird fashion - 1st of September is followed by the 1st of October, while there is an entire month of dates in between. Is there a way to get them to come out sorted properly?
Thanks in advance!
You can use the str
attribute:
grouped_data = data_drill.groupby([data_drill['PeriodStartDate'].str[:9], 'Blast'])
['Calc_DRILLING_Holes'].sum()
This assumes that your indexing will work for all your dates.
Alternatively, convert that column to a datetime
and use data_drill['PeriodStartDate'].dt.date
If the column is a datetime type, it might be good to just remove the timestamp all together and only keep the date
df['PeriodStartDate'] = df['PeriodStartDate'].dt.date
then you can go about grouping by the date.
If it's not a datetime object (If you're having problems slicing it, then I would suspect that it is), you can achieve that by converting it
pd.to_datetime(df.PeriodStartDate)
after that, for sorting, you can just sort on the date after the group by
df.groupby(['PeriodStartDate', 'Blast'])['Calc_DRILLING_Holes'].sum().reset_index().sort_values('PeriodStartDate')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.