How to locate and count the number of words in a column

Question

So want to count the occurrences of contaminants but some cases has more than one contaminants so when I use the value_counts it counts them as one. For example "Gasoline, Diesel = 8" How would I count the them as separate without doing it manually.

And would it be possible to create a function that would make it easier to categorize them into lets say 4 types of contaminant? I just need a clue or a simple explanation on what I need to do.

data=pd.read_csv('Data gathered.csv') data

data['CONTAMINANTS'].value_counts().plot(kind = 'barh').invert_yaxis()

Answer 1

Assuming the contaminants are always separated by commas in your data, you can use pandas.Series.str.split() to get them into lists. Then you can get them into distinct rows with pandas.DataFrame.explode() , which finally allows using the value_counts() method.

For example:

import pandas as pd

data = pd.DataFrame({'File Number': [1, 2, 3, 4],
                     'CONTAMINANTS': ['ACENAPHTENE, ANTHRACENE, BENZ-A-ANTHRACENE', 
                                      'CHLORINATED SOLVENTS', 
                                      'DIESEL, GASOLINE, ACENAPHTENE', 
                                      'GASOLINE, ACENAPHTENE']})
data

    File Number     CONTAMINANTS
0   1               ACENAPHTENE, ANTHRACENE, BENZ-A-ANTHRACENE
1   2               CHLORINATED SOLVENTS
2   3               DIESEL, GASOLINE, ACENAPHTENE
3   4               GASOLINE, ACENAPHTENE

data['CONTAMINANTS'] = data['CONTAMINANTS'].str.split(pat=', ')
data_long = data.explode('CONTAMINANTS')
data_long['CONTAMINANTS'].value_counts()

ACENAPHTENE             3
GASOLINE                2
DIESEL                  1
ANTHRACENE              1
BENZ-A-ANTHRACENE       1
CHLORINATED SOLVENTS    1
Name: CONTAMINANTS, dtype: int64

To categorize the contaminants, you could define a dictionary that maps them to types. Then you can use that dictionary to add a column of types to the exploded dataframe:

types = {'ACENAPHTENE': 1, 
         'GASOLINE': 2,
         'DIESEL': 2, 
         'ANTHRACENE': 1,
         'BENZ-A-ANTHRACENE': 1,
         'CHLORINATED SOLVENTS': 3}

data_long['contaminant type'] = data_long['CONTAMINANTS'].apply(lambda x: types[x])
data_long

    File Number     CONTAMINANTS            contaminant type
0   1               ACENAPHTENE             1
0   1               ANTHRACENE              1
0   1               BENZ-A-ANTHRACENE       1
1   2               CHLORINATED SOLVENTS    3
2   3               DIESEL                  2
2   3               GASOLINE                2
2   3               ACENAPHTENE             1
3   4               GASOLINE                2
3   4               ACENAPHTENE             1

How to locate and count the number of words in a column

Question

1 answers

solution1
1 2021-04-26 00:43:08

How to locate and count the number of words in a column

Question

1 answers

solution1 1 2021-04-26 00:43:08

solution1
1 2021-04-26 00:43:08