简体   繁体   中英

Efficient way to modify a column of textual data based on occurences of substrings for a large dataset?

I'm looking to modify a column in a data-set which contains a comma separated listing of the genders of a group of people. So an entry could be 'male, male' or 'female, female, female, male' or just 'female'. I want to process the data so the categories are 'all male', 'all female', 'majority male', 'majority female', for the purposes of using with sci-kit-learn later on.

However, I am new to data science and can't think of a way to do this other than split each string into sub strings of 'male' and 'female', count the occurrences, and then update the entry based on the result. My data-set has about 600k samples so brute force does not seem like a good idea. Is there a better way to do this using Python and Numpy and/or Pandas?

If i Understand you correctly - you are trying to create a new categorical feature from your column "genders".

The column may contain 4 values - all male, all female, majority male and majority female. (i assume that majority male means count of males>count of females)

def categorical_gender(genders):
    genders_split = genders.split(",")
    male_count = genders_split.count("male")
    female_count = genders_split.count("female")
    if male_count == len(genders_split):
        return "all male"
    if female_count == len(genders_split):
        return "all female"
    if male_count>female_count:
        return "majority male"
    if male_count<female_count:
        return "majority female"
    else:
        return "equal males and females"

You would now apply this function to your dataframe on the genders column.

df["categorical_gender"] = df.genders.apply(categorical_gender)

PS : regarding the concern about speed. You should be fine. Pandas can handle string manipulations quite efficiently for 600k rows. You can however use dask to multiprocess the above apply operations. Although it would be an overkill for this case.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM