[英]Pandas: Iterating over sorted rows in a df, implementing a counter
I tried this in Stata, and failed. 我在Stata尝试了这个,但失败了。 Trying it Python/pandas now - something I'm less familiar with...
现在尝试Python / pandas - 我不熟悉的东西......
I've got a dataframe on attendance data, with each row being a timestamped entry or exit. 我有一个关于考勤数据的数据框,每行都是带时间戳的进入或退出。 It looks like this:
它看起来像这样:
And what I want is to calculate how many people are in the office at any given time, on any given day. 而我想要的是计算在任何特定时间,在任何特定时间,办公室里有多少人。 I'd like to set up a
counter
which adds 1 for every entry ( type=="O"
), and subtracts 1 for every exit ( type=="C"
). 我想设置一个
counter
,为每个条目添加1( type=="O"
),并为每个出口减去1( type=="C"
)。
My Python attempt is this: 我的Python尝试是这样的:
df = pd.read_stata("some-data.dta")
sort = df.sort(['date', 'att_time'])
for i, day in enumerate(sort['date']):
sort['counter'][i] = 0
if type=="O":
sort['counter'][i] = sort['counter'][i-1] + 1
elif type=="C":
sort['counter'][i] = sort['counter'][i-1] - 1
Which throws this error: 这引发了这个错误:
__main__:2
: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame__main__:2
:SettingWithCopyWarning:尝试在DataFrame的切片副本上设置值See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
请参阅文档中的警告: http : //pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
From reading other SO posts, I tried setting the copy flag to False
( sort.is_copy==False
), but the error message still pops up. 从阅读其他SO帖子,我尝试将复制标志设置为
False
( sort.is_copy==False
),但仍会弹出错误消息。 Also, worryingly, I noticed that it's possibly not iterating over the sorted list: 另外,令人担忧的是,我注意到它可能没有迭代排序列表:
for i, day in enumerate(sorted(sort['date'])):
print i, day, sort['date'][i]
The day
and sort['date'][i]
, which should be the same date, aren't. day
和sort['date'][i]
,应该是相同的日期,不是。 So my i
index seemingly can't be relied on, even if I got around the SettingWithCopyWarning
. 因此,即使我绕过了
SettingWithCopyWarning
,我的i
索引也似乎无法依赖。 Halp? HALP?
You can use the cumsum
to simplify the process, which is mush faster than manually looping over all rows. 您可以使用
cumsum
来简化过程,这比手动循环所有行更快。
# artificial data
# =========================
df = pd.DataFrame('0 0 0 0 C 0 C 0 0 C 0 C'.split(), index=pd.date_range('2015-08-31 08:00:00', periods=12, freq='5min'), columns=['type'])
df
type
2015-08-31 08:00:00 0
2015-08-31 08:05:00 0
2015-08-31 08:10:00 0
2015-08-31 08:15:00 0
2015-08-31 08:20:00 C
2015-08-31 08:25:00 0
2015-08-31 08:30:00 C
2015-08-31 08:35:00 0
2015-08-31 08:40:00 0
2015-08-31 08:45:00 C
2015-08-31 08:50:00 0
2015-08-31 08:55:00 C
# processing
# ===================================
df['counter'] = df['type'].map({'0': 1, 'C': -1}).cumsum()
df
type counter
2015-08-31 08:00:00 0 1
2015-08-31 08:05:00 0 2
2015-08-31 08:10:00 0 3
2015-08-31 08:15:00 0 4
2015-08-31 08:20:00 C 3
2015-08-31 08:25:00 0 4
2015-08-31 08:30:00 C 3
2015-08-31 08:35:00 0 4
2015-08-31 08:40:00 0 5
2015-08-31 08:45:00 C 4
2015-08-31 08:50:00 0 5
2015-08-31 08:55:00 C 4
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.