简体   繁体   中英

Python: how to convert text data types in one column to be able to perform some analysis (such as counts)

I'm trying to count the number of times something comes up in one column, and group it by another. For example, I have the following:

import pandas as pd
import numpy as np
import matplotlib as plt

df = pd.read_csv("C:\\Users\\user1\\Desktop\\genre_testing.csv")

This gives me the following data example set:

在此处输入图片说明

What I would like to be able to do later is count the number of "Adventure" shows/movies, and have both mandalorian and zombieland counted. I believe the first issue is that both columns are stored as objects but I may need them as arrays?

Using something like df.groupby('genre')['show_name'].nunique() provides the full object rather than the elements, which is what I'm looking for. Any advice on where to start? Thanks!

There's already a thing for this which should be pretty easy to use.

df_coded = df['genre'].str.get_dummies(sep=",")
df_coded['show_name'] = df['show_name']

As already mentionned, you can use .str.split(',') to get the genre as a list, but to further that response, once you have split you can explode your dataframe to have a dataframe more suited for filtering, counting, ...

>>> data = pandas.DataFrame(data=[["mandalorian", "Adventure,Action,Sci-Fi"], ["zombieland", "Comedy,Adventure,Action"]], columns=["show_name", "genre"])
>>> data
     show_name                    genre
0  mandalorian  Adventure,Action,Sci-Fi
1   zombieland  Comedy,Adventure,Action
>>> data['genre'] = data['genre'].str.split(',')
>>> data
     show_name                        genre
0  mandalorian  [Adventure, Action, Sci-Fi]
1   zombieland  [Comedy, Adventure, Action]
>>> data = data.explode('genre')
>>> data
     show_name      genre
0  mandalorian  Adventure
0  mandalorian     Action
0  mandalorian     Sci-Fi
1   zombieland     Comedy
1   zombieland  Adventure
1   zombieland     Action
>>> data[data['genre'] == 'Adventure']['show_name']
0    mandalorian
1     zombieland
>>> data.groupby('genre')['show_name'].nunique()
genre
Action       2
Adventure    2
Comedy       1
Sci-Fi       1
Name: show_name, dtype: int64

Here is an alternative that might put you on the way. Assume your df is defined this way

d = {'Show':["Zombieland","Madalorian","Star Wars","Spiderman"],'genre':["Adventure,SciFi", "Adventure,SciFi,Action","SciFi,Action","Comedy"]}
df = pd.DataFrame(d)

Which gives you

    Show        genre
0   Zombieland  Adventure,SciFi
1   Madalorian  Adventure,SciFi,Action
2   Star Wars   SciFi,Action
3   Spiderman   Comedy

What you wish is to subset this df by choosing only those rows for which the genre column contains, say Action . You can do this this way:

df2 =df[df.genre.astype(str).str.contains('Action')]

which gives

    Show        genre
1   Madalorian  Adventure,SciFi,Action
2   Star Wars   SciFi,Action

You can then do subsetting on that or simply do a row count count_row = df2.shape[0]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM