[英]How to count number of unique strings in two columns?
I have a DataFrame with two columns containing strings, like: 我有一个包含两列包含字符串的DataFrame,例如:
col1 --- col2
col1 --- col2
Ernst --- Jim恩斯特-吉姆
Peter --- Ernst彼得-恩斯特
Bill --- NaN比尔-NaN
NaN --- DougNaN ---道格
Jim --- Jake吉姆-杰克
Now I want to create a new DataFrame with a list of unique strings in the first column and in the second column the number of occurrences of each string in the 2 original columns, like: 现在,我想创建一个新的DataFrame,第一列中包含一个唯一字符串列表,第二列中的两个原始列中每个字符串的出现次数,例如:
str --- occurences
str --- 发生
Ernst --- 2恩斯特-2
Peter --- 1彼得--- 1
Bill --- 1比尔--- 1
Jim --- 2吉姆-2
Jake --- 1杰克--- 1
Doug --- 1道格-1
How do I do that in the most efficient way? 如何以最有效的方式做到这一点? Thanks!
谢谢!
First combine your original two columns into one: 首先将原始的两列合并为一个:
In [127]: s = pd.concat([df.col1, df.col2], ignore_index=True)
In [128]: s
Out[128]:
0 Ernst
1 Peter
2 Bill
3 NaN
4 Jim
5 Jim
6 Ernst
7 NaN
8 Doug
9 Jake
dtype: object
and then use value_counts
: 然后使用
value_counts
:
In [129]: s.value_counts()
Out[129]:
Ernst 2
Jim 2
Bill 1
Doug 1
Jake 1
Peter 1
dtype: int64
I'd do that way (assuming you taking the data from a file your_file.txt
and you want to print out the result): 我会这样做(假设您从文件
your_file.txt
获取数据,并且您想打印出结果):
from collections import Counter;
separator = ' --- '
with open('your_file.txt') as f:
content = f.readlines() # here you got a list of elements corresponding to the lines
people = separator.join(content).split(separator) # here you got a list of all elements
people_count = Counter(people) # you got here a dict-like object with key=name value=count
for name, val in people_count.iteritems():
# print the column the way you want
print '{name}{separator}{value}'.format(name=name, separator=separator, value=val)
The example use the Counter object which allows you to efficiently count element from an iterable. 该示例使用Counter对象,该对象使您可以从可迭代对象中有效地计数元素。 the rest of the code is only string manipulation.
其余代码仅是字符串操作。
Try this: 尝试这个:
df = pd.DataFrame({"col1" : ["Ernst", "Peter","Bill",np.nan,"Jim"],
"col2" : ["Jim","Ernst",np.nan,"Doug","Jake"]})
print df
df1 = df.groupby("col1")["col1"].count()
df2 = df.groupby("col2")["col2"].count()
print df1.add(df2,fill_value=0)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.