简体   繁体   中英

How to locate and count the number of words in a column

So want to count the occurrences of contaminants but some cases has more than one contaminants so when I use the value_counts it counts them as one. For example "Gasoline, Diesel = 8" How would I count the them as separate without doing it manually.

And would it be possible to create a function that would make it easier to categorize them into lets say 4 types of contaminant? I just need a clue or a simple explanation on what I need to do.

data=pd.read_csv('Data gathered.csv') data

data['CONTAMINANTS'].value_counts().plot(kind = 'barh').invert_yaxis()

Assuming the contaminants are always separated by commas in your data, you can use pandas.Series.str.split() to get them into lists. Then you can get them into distinct rows with pandas.DataFrame.explode() , which finally allows using the value_counts() method.

For example:

import pandas as pd

data = pd.DataFrame({'File Number': [1, 2, 3, 4],
                     'CONTAMINANTS': ['ACENAPHTENE, ANTHRACENE, BENZ-A-ANTHRACENE', 
                                      'CHLORINATED SOLVENTS', 
                                      'DIESEL, GASOLINE, ACENAPHTENE', 
                                      'GASOLINE, ACENAPHTENE']})
data
    File Number     CONTAMINANTS
0   1               ACENAPHTENE, ANTHRACENE, BENZ-A-ANTHRACENE
1   2               CHLORINATED SOLVENTS
2   3               DIESEL, GASOLINE, ACENAPHTENE
3   4               GASOLINE, ACENAPHTENE
data['CONTAMINANTS'] = data['CONTAMINANTS'].str.split(pat=', ')
data_long = data.explode('CONTAMINANTS')
data_long['CONTAMINANTS'].value_counts()
ACENAPHTENE             3
GASOLINE                2
DIESEL                  1
ANTHRACENE              1
BENZ-A-ANTHRACENE       1
CHLORINATED SOLVENTS    1
Name: CONTAMINANTS, dtype: int64

To categorize the contaminants, you could define a dictionary that maps them to types. Then you can use that dictionary to add a column of types to the exploded dataframe:

types = {'ACENAPHTENE': 1, 
         'GASOLINE': 2,
         'DIESEL': 2, 
         'ANTHRACENE': 1,
         'BENZ-A-ANTHRACENE': 1,
         'CHLORINATED SOLVENTS': 3}

data_long['contaminant type'] = data_long['CONTAMINANTS'].apply(lambda x: types[x])
data_long
    File Number     CONTAMINANTS            contaminant type
0   1               ACENAPHTENE             1
0   1               ANTHRACENE              1
0   1               BENZ-A-ANTHRACENE       1
1   2               CHLORINATED SOLVENTS    3
2   3               DIESEL                  2
2   3               GASOLINE                2
2   3               ACENAPHTENE             1
3   4               GASOLINE                2
3   4               ACENAPHTENE             1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM