data_c["dropoff_district"] = "default value"
data_c["distance"] = "default value" #Formed a new column named distance for geocoder
data_c["time_of_day"] = "default value" #Formed a new column named time of the day for timestamps
So I create these columns at the start of the project for plotting and data manipulaton.After I edited and filled these columns with certain values, I wanted to perform a groupby operation on data_c.
avg_d = data_c.groupby(by = 'distance').sum().reset_index()
Although when I perform a groupby on data_c, I somehow lose my 'time_of_day' and 'dropoff_district' columns in avg_d. How can I solve this issue?
The problem is that Pandas doesn't know how to add date/time objects together. Thus, when you tell Pandas to groupby and then sum, it throws out the columns it doesn't know what to do with. Example,
df = pd.DataFrame([['2019-01-01', 2, 3], ['2019-02-02', 2, 4], ['2019-02-03', 3, 5]],
columns=['day', 'distance', 'duration'])
df.day = pd.to_datetime(df.day)
If I just run your query, I'd get,
>>> df.groupby('distance').sum()
duration
distance
2 7
3 5
You can fix this by telling Pandas you want to do something different with those columns---for example, take the first value,
df.groupby('distance').agg({
'duration': 'sum',
'day': 'first'
})
which brings them back,
duration day
distance
2 7 2019-01-01
3 5 2019-02-03
Groupby does not remove your columns. The sum()
call does. If those columns are not numeric, you will not retain them after sum()
.
So how do you like to retain columns 'time_of_day' and 'dropoff_district'? Assume you still want to keep them when they are distinct, put them into groupby
:
data_c.groupby(['distance','time_of_day','dropoff_district']).sum().reset_index()
otherwise, you will have multiple different 'time_of_day' for the same 'distance'. You need to massage your data first.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.