简体   繁体   English

Pandas:迭代df中的已排序行,实现计数器

[英]Pandas: Iterating over sorted rows in a df, implementing a counter

I tried this in Stata, and failed. 我在Stata尝试了这个,但失败了。 Trying it Python/pandas now - something I'm less familiar with... 现在尝试Python / pandas - 我不熟悉的东西......

I've got a dataframe on attendance data, with each row being a timestamped entry or exit. 我有一个关于考勤数据的数据框,每行都是带时间戳的进入或退出。 It looks like this: 它看起来像这样: 基线数据

And what I want is to calculate how many people are in the office at any given time, on any given day. 而我想要的是计算在任何特定时间,在任何特定时间,办公室里有多少人。 I'd like to set up a counter which adds 1 for every entry ( type=="O" ), and subtracts 1 for every exit ( type=="C" ). 我想设置一个counter ,为每个条目添加1( type=="O" ),并为每个出口减去1( type=="C" )。

My Python attempt is this: 我的Python尝试是这样的:

            df = pd.read_stata("some-data.dta")

            sort = df.sort(['date', 'att_time'])

            for i, day in enumerate(sort['date']):
                sort['counter'][i] = 0
                if type=="O":
                    sort['counter'][i] = sort['counter'][i-1] + 1
                elif type=="C":
                    sort['counter'][i] = sort['counter'][i-1] - 1

Which throws this error: 这引发了这个错误:

__main__:2 : SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame __main__:2 :SettingWithCopyWarning:尝试在DataFrame的切片副本上设置值

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy 请参阅文档中的警告: http//pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

From reading other SO posts, I tried setting the copy flag to False ( sort.is_copy==False ), but the error message still pops up. 从阅读其他SO帖子,我尝试将复制标志设置为Falsesort.is_copy==False ),但仍会弹出错误消息。 Also, worryingly, I noticed that it's possibly not iterating over the sorted list: 另外,令人担忧的是,我注意到它可能没有迭代排序列表:

                for i, day in enumerate(sorted(sort['date'])):
                    print i, day, sort['date'][i]

The day and sort['date'][i] , which should be the same date, aren't. daysort['date'][i] ,应该是相同的日期,不是。 So my i index seemingly can't be relied on, even if I got around the SettingWithCopyWarning . 因此,即使我绕过了SettingWithCopyWarning ,我的i索引也似乎无法依赖。 Halp? HALP?

You can use the cumsum to simplify the process, which is mush faster than manually looping over all rows. 您可以使用cumsum来简化过程,这比手动循环所有行更快。

# artificial data
# =========================
df = pd.DataFrame('0 0 0 0 C 0 C 0 0 C 0 C'.split(), index=pd.date_range('2015-08-31 08:00:00', periods=12, freq='5min'), columns=['type'])
df

                    type
2015-08-31 08:00:00    0
2015-08-31 08:05:00    0
2015-08-31 08:10:00    0
2015-08-31 08:15:00    0
2015-08-31 08:20:00    C
2015-08-31 08:25:00    0
2015-08-31 08:30:00    C
2015-08-31 08:35:00    0
2015-08-31 08:40:00    0
2015-08-31 08:45:00    C
2015-08-31 08:50:00    0
2015-08-31 08:55:00    C


# processing
# ===================================
df['counter'] = df['type'].map({'0': 1, 'C': -1}).cumsum()
df

                    type  counter
2015-08-31 08:00:00    0        1
2015-08-31 08:05:00    0        2
2015-08-31 08:10:00    0        3
2015-08-31 08:15:00    0        4
2015-08-31 08:20:00    C        3
2015-08-31 08:25:00    0        4
2015-08-31 08:30:00    C        3
2015-08-31 08:35:00    0        4
2015-08-31 08:40:00    0        5
2015-08-31 08:45:00    C        4
2015-08-31 08:50:00    0        5
2015-08-31 08:55:00    C        4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM