简体   繁体   English

用熊猫计算列中每行的字符串数

[英]count number of string per row in a column with pandas

Inside my dataframe:在我的数据框中:

no    pre_code
1     23, 234, 345
2     234, 345
3     23
4     NaN

I want to count number of string inside pre_code columns, What I have tried so far was:我想计算pre_code列中的字符串数,到目前为止我尝试过的是:

df['count'] = df['pre_code'].astype('str').str.split(',').str.len().fillna(0)

but with the code above, it counts NaN as 1. So, I dont get the desired results.但是对于上面的代码,它将 NaN 计为 1。所以,我没有得到想要的结果。

Before, I also have tried this way:之前,我也试过这种方式:

df['count'] = df['pre_code'].str.count(',').add(1).fillna(0)

Unfortunately, the code above also did not work on my dataframe.不幸的是,上面的代码也不适用于我的数据帧。 It gives me 0 for the single entry string.它为单个条目字符串提供了 0。 For your information, I have 2200 rows on my dataframe, and somehow the code could not work perfectly for those number of rows.供您参考,我的数据框中有 2200 行,不知何故,代码无法完美地处理这些行数。 When I tried for only 5 rows, somehow it worked well.当我只尝试 5 行时,不知何故它运行良好。

I expect the result would be like:我希望结果是这样的:

no    pre_code         count
1     23, 234, 345       3
2     234, 345           2
3     23                 1
4     NaN                0

any solution for my case?我的情况有什么解决方案吗?

thanks in advance.提前致谢。

I think you need nan like np.nan instead string nan , then both solutions working correct:我认为你需要nan喜欢np.nan而不是字符串nan ,那么这两个解决方案工作正确的:

You need test how looks values without numbers for replacement:您需要测试没有替换数字的值的外观:

print (df.loc[~df['pre_code'].str.contains('\d'), 'pre_code'].unique().tolist())
['nan']

df['count'] = df['pre_code'].replace('nan', np.nan).str.split(',').str.len().fillna(0)

Or:或者:

df['count'] = df['pre_code'].replace('nan', np.nan).str.count(',').add(1).fillna(0)

print (df)
   no      pre_code  count
0   1  23, 234, 345    3.0
1   2      234, 345    2.0
2   3            23    1.0
3   4           NaN    0.0

EDIT:编辑:

EDIT: More general solution is convert values without numbers to NaN in Series.where with Series.str.contains :编辑:更通用的解决方案是将没有数字的值转换为Series.where NaNSeries.str.contains

df['count'] = (df['pre_code'].where(df['pre_code'].str.contains('\d', na=False))
                             .str.count(',')
                             .add(1)
                             .fillna(0)
                             .astype(int))
print (df)
   no      pre_code  count
0   1  23, 234, 345      3
1   2      234, 345      2
2   3            23      1
3   4           NaN      0

Try:尝试:

df['count'] = df.loc[df['pre_code'].notna(), 'pre_code'] \
                .astype(str).str.split(',').str.len() \
                .reindex(df.index, fill_value=0)

print(df)

# Output:
   no      pre_code  count
0   1  23, 234, 345      3
1   2      234, 345      2
2   3            23      1
3   4           NaN      0

I'm not sure you have to convert to str (`astype(str)).我不确定您是否必须转换为 str (`astype(str))。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM