Python: how to convert text data types in one column to be able to perform some analysis (such as counts)

Question

I'm trying to count the number of times something comes up in one column, and group it by another. For example, I have the following:

import pandas as pd
import numpy as np
import matplotlib as plt

df = pd.read_csv("C:\\Users\\user1\\Desktop\\genre_testing.csv")

This gives me the following data example set:

What I would like to be able to do later is count the number of "Adventure" shows/movies, and have both mandalorian and zombieland counted. I believe the first issue is that both columns are stored as objects but I may need them as arrays?

Using something like df.groupby('genre')['show_name'].nunique() provides the full object rather than the elements, which is what I'm looking for. Any advice on where to start? Thanks!

Answer 1

There's already a thing for this which should be pretty easy to use.

df_coded = df['genre'].str.get_dummies(sep=",")
df_coded['show_name'] = df['show_name']

Answer 2

As already mentionned, you can use .str.split(',') to get the genre as a list, but to further that response, once you have split you can explode your dataframe to have a dataframe more suited for filtering, counting, ...

>>> data = pandas.DataFrame(data=[["mandalorian", "Adventure,Action,Sci-Fi"], ["zombieland", "Comedy,Adventure,Action"]], columns=["show_name", "genre"])
>>> data
     show_name                    genre
0  mandalorian  Adventure,Action,Sci-Fi
1   zombieland  Comedy,Adventure,Action
>>> data['genre'] = data['genre'].str.split(',')
>>> data
     show_name                        genre
0  mandalorian  [Adventure, Action, Sci-Fi]
1   zombieland  [Comedy, Adventure, Action]
>>> data = data.explode('genre')
>>> data
     show_name      genre
0  mandalorian  Adventure
0  mandalorian     Action
0  mandalorian     Sci-Fi
1   zombieland     Comedy
1   zombieland  Adventure
1   zombieland     Action
>>> data[data['genre'] == 'Adventure']['show_name']
0    mandalorian
1     zombieland
>>> data.groupby('genre')['show_name'].nunique()
genre
Action       2
Adventure    2
Comedy       1
Sci-Fi       1
Name: show_name, dtype: int64

Answer 3

Here is an alternative that might put you on the way. Assume your df is defined this way

d = {'Show':["Zombieland","Madalorian","Star Wars","Spiderman"],'genre':["Adventure,SciFi", "Adventure,SciFi,Action","SciFi,Action","Comedy"]}
df = pd.DataFrame(d)

Which gives you

    Show        genre
0   Zombieland  Adventure,SciFi
1   Madalorian  Adventure,SciFi,Action
2   Star Wars   SciFi,Action
3   Spiderman   Comedy

What you wish is to subset this df by choosing only those rows for which the genre column contains, say Action . You can do this this way:

df2 =df[df.genre.astype(str).str.contains('Action')]

which gives

    Show        genre
1   Madalorian  Adventure,SciFi,Action
2   Star Wars   SciFi,Action

You can then do subsetting on that or simply do a row count count_row = df2.shape[0]

Python: how to convert text data types in one column to be able to perform some analysis (such as counts)

Question

3 answers

solution1
1 2020-11-03 20:45:10

solution2
1 ACCPTED 2020-11-03 20:50:46

solution3
1 2020-11-03 20:56:14

Python: how to convert text data types in one column to be able to perform some analysis (such as counts)

Question

3 answers

solution1 1 2020-11-03 20:45:10

solution2 1 ACCPTED 2020-11-03 20:50:46

solution3 1 2020-11-03 20:56:14

solution1
1 2020-11-03 20:45:10

solution2
1 ACCPTED 2020-11-03 20:50:46

solution3
1 2020-11-03 20:56:14