[英]How to count the occurrences of a string starts with a specific substring from comma separated values in a pandas data frame?
I am new to Python.我是 Python 的新手。 I am working with a dataframe (360000 rows and 2 columns) that looks something like this: business_id date
我正在使用看起来像这样的 dataframe(360000 行和 2 列):business_id date
P01 2019-07-6 , 2018-06-05, 2019-07-06...
P02 2016-03-6 , 2019-04-10
P03 2019-01-02
The date column has dates separated by comma and dates from year 2010-2019.日期列包含用逗号分隔的日期和 2010-2019 年的日期。 I am trying to count only the dates for each month that are in year 2019 for each business id.
我试图仅计算每个企业 ID 的 2019 年每个月的日期。 Specifically, I am looking for the output:
具体来说,我正在寻找 output:
Can anyone please help me?谁能帮帮我吗? Thanks.
谢谢。
You can do as follows您可以执行以下操作
str.split
to separate the dates in each cell to a list,str.split
将每个单元格中的日期分隔到一个列表中,explode
to flatten the listsexplode
以展平列表pd.to_datetime
and extract the monthpd.to_datetime
转换为日期时间并提取月份pd.crosstab
to pivot/count the months and join.pd.crosstab
来透视/计算月份并加入。 Altogether:共:
s = pd.to_datetime(df['date'].str.split('\s*,\s*').explode()).dt.to_period('M')
out = pd.crosstab(s.index,s )
# this gives the expected output
df.join(out)
Output ( out
): Output(
out
):
date 2016-03 2018-06 2019-01 2019-04 2019-07
row_0
0 0 1 0 0 2
1 1 0 0 1 0
2 0 0 1 0 0
If they are not datetime objects yet, you may want to start by converting the column (series) to datetime: pd.to_datetime()
Note: the format
parameter.如果它们还不是日期时间对象,您可能希望首先将列(系列)转换为日期时间:
pd.to_datetime()
注意: format
参数。
Then you can access the datetime attributes through .dt
然后您可以通过
.dt
访问日期时间属性
ie df[df.COLUMN_NAME.dt.month == 5]
即
df[df.COLUMN_NAME.dt.month == 5]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.