简体   繁体   中英

Sorting rows in python pandas

I have a dataframe (the sample looks like this)

Type          SKU      Description   FullDescription        Size      Price
Variable       2        Boots          Shoes on sale       XL,S,M       
Variation      2.5      Boots XL                             XL       330
Variation      2.6      Boots S                              S        330
Variation      2.7      Boots M                              M        330
Variable       3        Helmet           Helmet Sizes      E42,E41
Variation      3.8      Helmet E42                          E42       89
Variation      3.2      Helmet E41                          E41       89

What I want to do is sort the values based on Size so the final data frame should look like this:

  Type          SKU      Description   FullDescription        Size      Price
    Variable       2        Boots          Shoes on sale       S,M,XL        
    Variation      2.6      Boots S                             S       330
    Variation      2.7      Boots M                             M        330
    Variation      2.5      Boots XL                            XL        330
    Variable       3        Boots           Helmet Sizes       E41,E42
    Variation      3.2      Helmet E41                          E41       89
    Variation      3.8      Helmet E42                          E42       89

I am able to successfully get the results using this code

sizes, dig = ['S','M','XL','L',], ['000','111','333','222'] #make sure dig values do not exist as a substring anywhere in your dataframe
df = (df.assign(Size=df['Size'].replace(sizes, dig, regex=True))
        .assign(grp=(df['Type'] == 'Variable').cumsum()) 
        .sort_values(['grp', 'Type', 'Size']).drop('grp', axis=1))
df['Size'] = df['Size'].apply(lambda x: ','.join(sorted(x.split(',')))).replace(dig, sizes, regex=True)
df

The issue is that the given code dosen't work on dataframe

Type          SKU      Description   FullDescription        Size      Price
Variable       2        Boots          Shoes on sale       XL,S,3XL       
Variation      2.5      Boots XL                             XL       330
Variation      2.6      Boots 3XL                            3XL        330
Variation      2.7      Boots S                              S        330
Variable       3        Helmet           Helmet Sizes      S19, S9
Variation      3.8      Helmet E42                          S19       89
Variation      3.2      Helmet E41                          S9       89

it gives the results ' S,3XL,XL ' and ' S19,S9 ' whereas I want the results as

Type          SKU      Description   FullDescription        Size      Price
Variable       2        Boots          Shoes on sale       S,XL,3XL       
Variation      2.7      Boots S                             S          330
Variation      2.5      Boots XL                            XL        330
Variation      2.6      Boots 3XL                           3XL        330
Variable       3        Helmet           Helmet Sizes      S9,S19
Variation      3.2      Helmet E41                          S9        89
Variation      3.8      Helmet E42                          S19       89

also in case of more sizes, the order should be ' XXS,XS,S,M,L,XL,XXL,3XL,4XL,5XL ' and in case of second example, ' S9,S19,M9,M19,L9 and so on '

This is what I have done so far but it's not working and showing the wrong order

sizes, dig = ['XS','S','M','L','XL','XXL','3XL','4XL','5XL'], ['000','111','222','333','444','555','666','777','888'] #make sure dig values do not exist as a substring anywhere in your dataframe
df = (df.assign(Size=df['Size'].replace(sizes, dig, regex=True))
        .assign(grp=(df['Type'] == 'variable').cumsum())
        .sort_values(['grp', 'Type', 'Size']).drop('grp', axis=1))
df['Size'] = df['Size'].apply(lambda x: ','.join(sorted(x.split(',')))).replace(dig, sizes, regex=True)

Step 1: recreate the data

import pandas as pd

#----------------------#
# Recreate the dataset #
#----------------------#
# raw input data_1 = """ Variable|2|Boots|Shoes on sale|XL,S,M|  
                         Variation|2.5|Boots XL||XL|330 Variation|2.6|Boots S||S|330 
                         Variation|2.7|Boots M||M|330 Variable|3|Helmet|Helmet Sizes|E42,E41| 
                         Variation|3.8|Helmet E42||E42|89 
                         Variation|3.2|Helmet E41||E41|89"""

data_2 = """ Variable|2|Boots|Shoes on sale|XL,S,3XL| 
             Variation|2.5|Boots XL||XL|330 
             Variation|2.6|Boots 3XL||3XL|330 
             Variation|2.7|Boots S||S|330 
             Variable|3|Helmet|Helmet Sizes|S19, S9| 
             Variation|3.8|Helmet E42||S19|89 
             Variation|3.2|Helmet E41||S9|89"""

# Construct 1 data set
data = 'Type|SKU|Description|FullDescription|Size|Price'
data += data_2 # this can also be data_1  or data_1 + data_2

# pre-process: split the lines and values into a list of lists.
data = [row.split('|') for row in data.split('\n')]

#-------------#
# create a df #
#-------------#
df = pd.DataFrame(data[1:], columns=data[0]) df

Temp Result

Type    SKU     Description     FullDescription          Size   Price
0   Variable    2               Boots   Shoes on sale   XL,S,3XL    
1   Variation   2.5             Boots XL                XL          330
2   Variation   2.6             Boots 3XL               3XL         330
3   Variation   2.7             Boots S                 S           330
4   Variable    3               Helmet  Helmet Sizes    S19, S9     
5   Variation   3.8             Helmet E42              S19         89
6   Variation   3.2             Helmet E41              S9          89

Step 2: Create a priority dict

I'm not really into fashion + I'm also a guy --> (I'm only familiar with SML XL)
But feel free to re order them or add extra sizes into the list

# Prioritize the sizes
# ps, i don't know the order :) 
priority_dict = {k : e for e, k in enumerate([ 'XXS','XS','S','M','L','XL','XXL','3XL','4XL','5XL', 'E41', 'E42', 'S9', 'S19' ])}
priority_dict

Temp Result

{'XXS': 0,
 'XS': 1,
 'S': 2,
 'M': 3,
 'L': 4,
 'XL': 5,
 'XXL': 6,
 '3XL': 7,
 '4XL': 8,
 '5XL': 9,
 'E41': 10,
 'E42': 11,
 'S9': 12,
 'S19': 13}

Step 3: Create a list of tuples from the string of sizes

# Split the string  "SIZE" into a list    "XL,S,M" --> ["XL", "S", "M"]
# And, add the value from our priority dict to it  --> [(5, "XL"), (2, "S"), (3, "M")]
# Last but not least, sort list (by the first value) --> [(2, "S"), (3, "M"), (5, "XL")]
df["TMP_SIZE"] = [ sorted([(priority_dict.get(size.strip()), size.strip())  for size in sizes.split(',')]) for sizes in df.Size]
df

Temp Result

Type    SKU     Description     FullDescription          Size       Price  TMP_SIZE
0   Variable    2               Boots   Shoes on sale   XL,S,3XL           [(2, S), (5, XL), (7, 3XL)]
1   Variation   2.5             Boots XL                XL          330    [(5, XL)]
2   Variation   2.6             Boots 3XL               3XL         330    [(7, 3XL)]
3   Variation   2.7             Boots S                 S           330    [(2, S)]
4   Variable    3               Helmet  Helmet Sizes    S19, S9            [(12, S9), (13, S19)]
5   Variation   3.8             Helmet E42              S19         89     [(13, S19)]
6   Variation   3.2             Helmet E41              S9          89     [(12, S9)]

Step 4: Clean TEMP_SIZE

# Create a new SIZE
# loop over the TMPS_SIZE and create a string from the second value of the tuplelist --> ', '.join( my_list )

df['NEW_SIZE'] = [', '.join([ size[1]for size in sizes ]) for sizes in df["TMP_SIZE"] ]

Temp Result

Type    SKU     Description     ...     Size        Price  TMP_SIZE                       NEW_SIZE
0   Variable    2               ...     XL,S,3XL           [(2, S), (5, XL), (7, 3XL)]  S, XL, 3XL
1   Variation   2.5             ...     XL          330    [(5, XL)]                    XL
2   Variation   2.6             ...     3XL         330    [(7, 3XL)]                   3XL
3   Variation   2.7             ...     S           330    [(2, S)]                     S
4   Variable    3               ...     S19, S9            [(12, S9), (13, S19)]        S9, S19
5   Variation   3.8             ...     S19         89     [(13, S19)]                  S19
6   Variation   3.2             ...     S9          89     [(12, S9)]                   S9

Step 5: grp

add your grp

#grp
df['grp']= (df['Type'] == 'Variable').cumsum()
df

Step 6: sort

in the last step, you can sort everything (I think that you need to sort the TMP_SIZE separately)

# sort the dataset
df = df.sort_values('TMP_SIZE') # notice that we sort on the list of tuples
df.sort_values(by=['grp', 'Type'])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM