I have a.tsv data file. I want to print the count of strings in a certain column. The column looks like this:
column1
A aaa
A, C c
C
D
E ee,F
A aaa, B, C cc
F
E ee
I want distinct counts of A,B,C, A aaa etc. But in the column, there are sometimes spaces after the ",". So my code counts "B" and " B" differently. This is the code I am currently using:
import pandas as pd
import os
# Import data from file into Pandas DataFrame
data= pd.read_csv("data.tsv", encoding='utf-8', delimiter="\t")
pd.set_option('display.max_rows', None)
out = data['Column1'].str.split(',', expand=True).stack().value_counts()
print (out)
Any leads are appreciated.
you need to add ' '
into your split, ie split(', ')
. Try ',\s*'
for ,
followed by optional spaces:
out = df['column1'].str.split(',\s*', expand=True).stack().value_counts()
Output:
F 2
E ee 2
A aaa 2
C c 1
C 1
A 1
C cc 1
B 1
D 1
dtype: int64
Also, you can replace ', '
with ','
and use get_dummies
:
df['column1'].str.replace(',\s*',',').str.get_dummies(',').sum()
Output:
A 1
A aaa 2
B 1
C 1
C c 1
C cc 1
D 1
E ee 2
F 2
dtype: int64
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.