[英]How to get a count of certain entity per minute in a given dataframe in pandas?
I have a Dataframe that looks like below我有一个如下所示的 Dataframe
data = [(datetime.datetime(2021, 2, 10, 7, 49, 7, 118658), u'12.100.90.10', u'100.100.12.1', u'LT_DOWN'),
(datetime.datetime(2021, 2, 10, 7, 49, 14, 312273), u'12.100.90.10', u'100.100.12.1', u'LT_UP'),
(datetime.datetime(2021, 2, 10, 7, 49, 21, 535932), u'12.100.90.10', u'100.100.12.1', u'LT_UP'),
(datetime.datetime(2021, 2, 10, 7, 50, 28, 725961), u'12.100.90.10', u'100.100.12.1', u'PL_DOWN'),
(datetime.datetime(2021, 2, 10, 7, 50, 32, 450853), u'10.100.80.10', u'10.55.10.1', u'PL_LOW'),
(datetime.datetime(2021, 2, 10, 7, 51, 32, 450853), u'10.10.80.10', u'10.55.10.1', u'MA_HIGH'),
(datetime.datetime(2021, 2, 10, 7, 52, 34, 264042), u'10.10.80.10', u'10.55.10.1', u'PL_DOWN')]
As you can see there is data getting logged per minute.如您所见,每分钟记录一次数据。 I have just presented part of the complete data here.
我刚刚在这里展示了完整数据的一部分。
This is how it looks on loading it in pandas这是在 pandas 中加载它时的样子
date start end type
0 2021-02-10 07:49:07.118658 12.100.90.10 100.100.12.1 LT_DOWN
1 2021-02-10 07:49:14.312273 12.100.90.10 100.100.12.1 LT_UP
2 2021-02-10 07:49:21.535932 12.100.90.10 100.100.12.1 LT_UP
3 2021-02-10 07:50:28.725961 12.100.90.10 100.100.12.1 PL_DOWN
4 2021-02-10 07:50:32.450853 10.100.80.10 10.55.10.1 PL_LOW
5 2021-02-10 07:51:32.450853 10.10.80.10 10.55.10.1 MA_HIGH
6 2021-02-10 07:52:34.264042 10.10.80.10 10.55.10.1 PL_DOWN
Now what I want is two things现在我想要的是两件事
First, get count of each value in type
per minute and furthermore only first part of _
split in value should be considered for count.首先,获取每分钟
type
中每个值的计数,此外,应该只考虑_
split in value 的第一部分进行计数。 So my final output should be something like this所以我最终的 output 应该是这样的
date LT PL MA
0 2021-02-10 07:49 3 0 0
1 2021-02-10 07:50 0 2 0
2 2021-02-10 07:51 0 0 1
3 2021-02-10 07:52 0 1 0
Second convert any non zero to True
and zero to False
.其次将任何非零转换为
True
并将零转换为False
。 So final output would look like所以最终的 output 看起来像
date LT PL MA
0 2021-02-10 07:49 True False False
1 2021-02-10 07:50 False True False
2 2021-02-10 07:51 False False True
3 2021-02-10 07:52 False True False
Now regarding the first part, I can find the count of each value (split on first part of _
) on the total Dataframe by doing something like this现在关于第一部分,我可以通过执行以下操作在总 Dataframe 上找到每个值的计数(拆分为
_
的第一部分)
df = pd.DataFrame(data)
df.columns = ["date", "start", "end", "type"]
df = df.drop_duplicates()
issue = "LT"
df_check = df[df['type'].str.contains(issue)].groupby(['start', 'end']).size().to_frame('icount').reset_index()
But how do I do it on a minute basis and map it to True
and False
?但是,我该如何在一分钟内完成并将 map 设置为
True
和False
?
In the data you provided, last item in type
is MA_DOWN
, but in the dataframe and the results it is PL_DOWN
, I think it was an error on your part.在您提供的数据中,
type
中的最后一项是MA_DOWN
,但在 dataframe 中,结果是PL_DOWN
,我认为这是您的错误。 I have considered the dataframe version here.我在这里考虑了 dataframe 版本。
You can use pandas.DataFrame.resample
with pandas.get_dummies
:您可以将
pandas.DataFrame.resample
与pandas.get_dummies
一起使用:
>>> pd.get_dummies(df['type'].str.extract(r'(\w+)_', expand=False)
).set_index(df['date']).resample('1T').sum()
LT MA PL
date
2021-02-10 07:49:00 3 0 0
2021-02-10 07:50:00 0 0 2
2021-02-10 07:51:00 0 1 0
2021-02-10 07:52:00 0 0 1
Then astype(bool)
:然后
astype(bool)
:
>>> pd.get_dummies(df['type'].str.extract(r'(\w+)_', expand=False)
).set_index(df['date']).resample('1T').sum().astype(bool)
LT MA PL
date
2021-02-10 07:49:00 True False False
2021-02-10 07:50:00 False False True
2021-02-10 07:51:00 False True False
2021-02-10 07:52:00 False False True
According to your comments, you can use pd.crosstab
:根据您的评论,您可以使用
pd.crosstab
:
>>> pd.crosstab(
index=df['date'].dt.floor('1min'),
columns=[
df['start'].add('-').add(df['end']).rename('star-end'),
df['type'].str.extract(r'(\w+)_', expand=False)
],
dropna=False
)
start-end 10.10.80.10-10.55.10.1 10.100.80.10-10.55.10.1 12.100.90.10-100.100.12.1
type LT MA PL LT MA PL LT MA PL
date
2021-02-10 07:49:00 0 0 0 0 0 0 3 0 0
2021-02-10 07:50:00 0 0 0 0 0 1 0 0 1
2021-02-10 07:51:00 0 1 0 0 0 0 0 0 0
2021-02-10 07:52:00 0 0 1 0 0 0 0 0 0
Or,或者,
>>> pd.crosstab(
index=df['date'].dt.floor('1min'),
columns=[
df['start'],
df['end'],
df['type'].str.extract(r'(\w+)_', expand=False)
],
dropna=False
)
start 10.10.80.10 10.100.80.10 12.100.90.10
end 10.55.10.1 100.100.12.1 10.55.10.1 100.100.12.1 10.55.10.1 100.100.12.1
type LT MA PL LT MA PL LT MA PL LT MA PL LT MA PL LT MA PL
date
2021-02-10 07:49:00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0
2021-02-10 07:50:00 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
2021-02-10 07:51:00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2021-02-10 07:52:00 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.