[英]How many time a string value of a cell is repeated in other column of pandas data frame?
I am trying to find out the number of times each cell value of column A appears in all the cells of the other column B using pandas.我正在尝试使用 pandas 找出 A 列的每个单元格值出现在另一列 B 的所有单元格中的次数。 for example for cell A1 value, we need to vlookup its value in all cells of column B and to find out in how many cells of column B it's repeated and then put the count value against it in the column C.
例如对于单元格 A1 的值,我们需要在 B 列的所有单元格中查找它的值,并找出它在 B 列的多少个单元格中重复,然后将计数值放在 C 列中。 I checked all the possible solutions such as using contains, extract, groupby, etc but no result.
我检查了所有可能的解决方案,例如使用包含、提取、分组等,但没有结果。 also, the value in the B column has no special text pattern to can define it in the code.
此外,B 列中的值没有特殊的文本模式可以在代码中定义它。
This is what I've as a data frame:这就是我作为数据框的内容:
A B C
============ =============================================== ========
T4561 T4561 (KHO ZAD)
E2962 E2962 (Bat - Rouchan),T5362(asw)
DT2172 T2172 (Masd),T2117 (Masd),T4561(fsd)
T6096 T6096 (Mara),H1005 (BAHH), H1049 (QIEH)
T5362 T5362 (SYMI (ABAI)),E0993,E7523(pwd)
E0993 E0993 (Tean),T4561,E0993(ssdc)
E1834 E1834 (Ahaz),T5362,E0993(sdw)
T2844 T2844 (Varmn),T3798 (QASIN), T3596 (Vara),T4561(qw)
E7523 E7523 (Sabk),E0993(bbz)
T9062 T9062 (Shrz),T5362,E7523(fgf)
And this is what I need:这就是我需要的:
A B C
============ =============================================== ========
T4561 T4561 (KHO ZAD) 4
E2962 E2962 (Bat - Rouchan),T5362(asw) 1
DT2172 T2172 (Masd),T2117 (Masd),T4561(fsd) 0
T6096 T6096 (Mara),H1005 (BAHH), H1049 (QIEH) 1
T5362 T5362 (SYMI (ABAI)),E0993,E7523(pwd) 4
E0993 E0993 (Tean),T4561,E0993(ssdc) 5
E1834 E1834 (Ahaz),T5362,E0993(sdw) 1
T2844 T2844 (Varmn),T3798 (QASIN), T3596 (Vara),T4561(qw) 1
E7523 E7523 (Sabk),E0993(bbz) 3
T9062 T9062 (Shrz),T5362,E7523(fgf) 1
Use Series.str.extractall
along with the regex pattern, then use Series.value_counts
to compute the frequency, then use Series.map
to map the values in column A
to their corresponding frequencies:使用
Series.str.extractall
和正则表达式模式,然后使用Series.value_counts
计算频率,然后使用Series.map
到 map 列A
中的值到它们对应的频率:
m = df['B'].str.extractall(f"({'|'.join(df['A'])})")[0].value_counts()
df['C'] = df['A'].map(m).fillna(0)
Result:结果:
A B C
0 T4561 T4561 (KHO ZAD) 4.0
1 E2962 E2962 (Bat - Rouchan),T5362(asw) 1.0
2 DT2172 T2172 (Masd),T2117 (Masd),T4561(fsd) 0.0
3 T6096 T6096 (Mara),H1005 (BAHH), H1049 (QIEH) 1.0
4 T5362 T5362 (SYMI (ABAI)),E0993,E7523(pwd) 4.0
5 E0993 E0993 (Tean),T4561,E0993(ssdc) 5.0
6 E1834 E1834 (Ahaz),T5362,E0993(sdw) 1.0
7 T2844 T2844 (Varmn),T3798 (QASIN), T3596 (Vara),T4561(qw) 1.0
8 E7523 E7523 (Sabk),E0993(bbz) 3.0
9 T9062 T9062 (Shrz),T5362,E7523(fgf) 1.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.