简体   繁体   中英

sorting pandas data frame based on multiple values in a single cell

i have a data frame that contain multiple data in one column

    ID                                            classes
4   5792.0  Mechanics (any of the two sessions), Analysis ...
6   5915.0  Mechanics (any of the two sessions), Perspecti...
8   5931.0                        CS, Chemistry, Perspecti...
9   5991.0                                    Perspective, CS
10  5841.0  Mechanics (any of the two sessions), Perspecti...

the "classes" column can have values such as "Perspective, Chemistry, CS" and so on i want to sort the above data frame based on values in the classes string

for example one output might be

        ID  classes                                         
4   5792.0       CS
6   5915.0       CS
8   5931.0       CS
9   5991.0       CS                            
10  5841.0       CS

thus sorted based on classes attended,

I searched google, but all results were sorting data frame based on single value (single value per cell)

thanks in advance

Data

df=pd.DataFrame({'ID':[5792.0,5915.0,5931.0,5991.0,5841.0 ],'classes':['Mechanics (any of the two sessions), Perspecti,CS, Chemistry','Mechanics (any of the two sessions), Perspecti,CS, Chemistry','Mechanics (any of the two sessions), Perspecti,CS, Chemistry','Mechanics (any of the two sessions), Perspecti,CS, Chemistry','Mechanics (any of the two sessions), Perspecti,CS, Chemistry']})
df

Use df.str.split to convert string in classes to list

df['classes']=df.classes.str.split(",")

Explode and sort by classes

df.explode('classes').sort_values(by='classes')

Outcome

在此处输入图像描述

I'm assuming you want to sort each string in the classes column split by commas, you can try this:

df['string'] = df['classes'].apply(lambda x: ' '.join(sorted(x.split(', '))))
print(df)

I'm sure there's a more "pandas" way of doing this, but here's one possible solution:

cs_df = df[df['classes'].apply(lambda x: 'CS' in x)]
cs_df['classes'] = 'CS'

In the above example, cs_df should contain all of the rows from the original dataframe where 'CS' is in classes . You can do this manually or in a loop to get all your dataframes.

Explanation:

When we use df['classes'] , pandas returns a numpy.Series instance. numpy.Series.apply takes a function as an argument and applies it to every item in the series. For example:

# square each item in the series
s1 = numpy.Series([1,2,3]).apply(lambda x: x*x)
#[1,4,9]
# root each item in the series
s2 = s1.apply(sqrt)
#[1,2,3]

Pandas also has special indexing logic. If you use a Series of bool s as an index on a Dataframe , pandas will return a Dataframe where the items in the index series was True .

So in this example

s = numpy.Series([True, False, True])
df2 = df[s]

df2 will only contain the rows 0 and 2 from the original df .

Finally, pandas also has special logic for assignment by index as well. You can set entire columns to a single scalar value without needing an explit loop:

df['best_column'] = "best value"

In the above example, the dataframe will have a new column filled with values "best value" .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM