[英]count number of string per row in a column with pandas
Inside my dataframe:在我的数据框中:
no pre_code
1 23, 234, 345
2 234, 345
3 23
4 NaN
I want to count number of string inside pre_code
columns, What I have tried so far was:我想计算pre_code
列中的字符串数,到目前为止我尝试过的是:
df['count'] = df['pre_code'].astype('str').str.split(',').str.len().fillna(0)
but with the code above, it counts NaN as 1. So, I dont get the desired results.但是对于上面的代码,它将 NaN 计为 1。所以,我没有得到想要的结果。
Before, I also have tried this way:之前,我也试过这种方式:
df['count'] = df['pre_code'].str.count(',').add(1).fillna(0)
Unfortunately, the code above also did not work on my dataframe.不幸的是,上面的代码也不适用于我的数据帧。 It gives me 0 for the single entry string.它为单个条目字符串提供了 0。 For your information, I have 2200 rows on my dataframe, and somehow the code could not work perfectly for those number of rows.供您参考,我的数据框中有 2200 行,不知何故,代码无法完美地处理这些行数。 When I tried for only 5 rows, somehow it worked well.当我只尝试 5 行时,不知何故它运行良好。
I expect the result would be like:我希望结果是这样的:
no pre_code count
1 23, 234, 345 3
2 234, 345 2
3 23 1
4 NaN 0
any solution for my case?我的情况有什么解决方案吗?
thanks in advance.提前致谢。
I think you need nan
like np.nan
instead string nan
, then both solutions working correct:我认为你需要nan
喜欢np.nan
而不是字符串nan
,那么这两个解决方案工作正确的:
You need test how looks values without numbers for replacement:您需要测试没有替换数字的值的外观:
print (df.loc[~df['pre_code'].str.contains('\d'), 'pre_code'].unique().tolist())
['nan']
df['count'] = df['pre_code'].replace('nan', np.nan).str.split(',').str.len().fillna(0)
Or:或者:
df['count'] = df['pre_code'].replace('nan', np.nan).str.count(',').add(1).fillna(0)
print (df)
no pre_code count
0 1 23, 234, 345 3.0
1 2 234, 345 2.0
2 3 23 1.0
3 4 NaN 0.0
EDIT:编辑:
EDIT: More general solution is convert values without numbers to NaN
in Series.where
with Series.str.contains
:编辑:更通用的解决方案是将没有数字的值转换为Series.where
NaN
和Series.str.contains
:
df['count'] = (df['pre_code'].where(df['pre_code'].str.contains('\d', na=False))
.str.count(',')
.add(1)
.fillna(0)
.astype(int))
print (df)
no pre_code count
0 1 23, 234, 345 3
1 2 234, 345 2
2 3 23 1
3 4 NaN 0
Try:尝试:
df['count'] = df.loc[df['pre_code'].notna(), 'pre_code'] \
.astype(str).str.split(',').str.len() \
.reindex(df.index, fill_value=0)
print(df)
# Output:
no pre_code count
0 1 23, 234, 345 3
1 2 234, 345 2
2 3 23 1
3 4 NaN 0
I'm not sure you have to convert to str (`astype(str)).我不确定您是否必须转换为 str (`astype(str))。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.