简体   繁体   中英

Can you make value_counts on a specific interval of characters with pandas?

So, I have a column "Names". If I do:

df['Names'].value_counts()

I get this:

Mr. Richard Vance       1
Mrs. Angela Bell        1
Mr. Stewart Randall     1
Mr. Andrew Ogden        1
Mrs. Maria Berry        1
                       ..
Mrs. Lillian Wallace    1
Mr. William Bailey      1
Mr. Paul Ball           1
Miss Pippa Bond         1
Miss Caroline Gray      1

It's ok... Thera are lots of DISTINCT names. But what I want is to do this value_counts() only for the first characters until it get's to the empty character (ie space that devides, for instance Miss or Mrs. from Lillian Wallace) So that the output would be, for example:

Mrs. 1000 Mr. 2000 Miss 2000

Just to know how many distinct variants there are in the column names so that, in a 2nd stage create another variable (namely gender) based on those variants.

If you want to know the unique values and if there's always a space you can do this.

df = pd.DataFrame(['Mr. Richard Vance',
'Mrs. Angela Bell',
'Mr. Stewart Randall',
'Mr. Andrew Ogden',
'Mrs. Maria Berry',
'Mrs. Lillian Wallace',
'Mr. William Bailey',
'Mr. Paul Ball',
'Miss Pippa Bond',
'Miss Caroline Gray'], columns=['names'])

df['names'].str.split(' ').str[0].unique().tolist()

Output is a list:

['Mr.', 'Mrs.', 'Miss']

You can use value_counts(dropna=False) on str[0] after a str.split() :

df = pd.DataFrame({'Names': ['Mr. Richard Vance','Mrs. Angela Bell','Mr. Stewart Randall','Mr. Andrew Ogden','Mrs. Maria Berry','Mrs. Lillian Wallace','Mr. William Bailey','Mr. Paul Ball','Miss Pippa Bond','Miss Caroline Gray','']})

df.Names.str.split().str[0].value_counts(dropna=False)

#  Mr.     5
#  Mrs.    3
#  Miss    2
#  NaN     1
#  Name: Names, dtype: int64

Here is a solution. You can use regex:

#Dataset

    Names
0   Mr. Richard Vance
1   Mrs. Angela Bell
2   Mr. Stewart Randall
3   Mr. Andrew Ogden
4   Mrs. Maria Berry
5   Mrs. Lillian Wallace

df['Names'].str.extract(r'(\w+\.\s)').value_counts()

#Output:

Mr.      3
Mrs.     3

Note: (\w+\.\s) will extract Mr. and Mrs. parts (or any title like Dr.) from the names

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM