在 Pandas 中对包含字符串的列进行排序

Question

I am new to Pandas, and looking to sort a column containing strings and generate a numerical value to uniquely identify the string.我是 Pandas 的新手，希望对包含字符串的列进行排序并生成一个数值来唯一标识该字符串。 My data frame looks something like this:我的数据框看起来像这样：

df = pd.DataFrame({'key': range(8), 'year_week': ['2015_10', '2015_1', '2015_11', '2016_9', '2016_10','2016_3', '2016_9', '2016_10']})

First I like to sort the 'year_week' column to arrange in ascending order (2015_1, 2016_9, '2016_9', 2016_10, 2016_11, 2016_3, 2016_10, 2016_10) and then generate a numerical value for each unique 'year_week' string.首先我喜欢将'year_week'列按升序排列(2015_1, 2016_9, '2016_9', 2016_10, 2016_11, 2016_3, 2016_10, 2016_10)然后为每个唯一'year_week'字符串'year_week'生成一个数值。

Answer 1

You can first convert to_datetime column year_week , then sort it by sort_values and last use factorize :您可以先将to_datetime列转换to_datetime year_week ，然后按sort_values对其进行sort_values ，最后使用factorize ：

df = pd.DataFrame({'key': range(8), 'year_week': ['2015_10', '2015_1', '2015_11', '2016_9', '2016_10','2016_3', '2016_9', '2016_10']})

#http://stackoverflow.com/a/17087427/2901002
df['date'] = pd.to_datetime(df.year_week + '-0', format='%Y_%W-%w')
#sort by column date
df.sort_values('date', inplace=True)
#create numerical values
df['num'] = pd.factorize(df.year_week)[0]
print (df)
   key year_week       date  num
1    1    2015_1 2015-01-11    0
0    0   2015_10 2015-03-15    1
2    2   2015_11 2015-03-22    2
5    5    2016_3 2016-01-24    3
3    3    2016_9 2016-03-06    4
6    6    2016_9 2016-03-06    4
4    4   2016_10 2016-03-13    5
7    7   2016_10 2016-03-13    5

Answer 2

       ## 1st method :-- This apply for large dataset

 ## Split the "year_week" column into 2 columns

             df[['year', 'week']] =df['year_week'].str.split("_",expand=True)

     ## Change the datatype of newly created columns
             df['year'] = df['year'].astype('int')

             df['week'] = df['week'].astype('int')

    ## Sort the dataframe by newly created column

             df= df.sort_values(['year','week'],ascending=True)

   ## Drop years & months column

             df.drop(['year','week'],axis=1,inplace=True)

   ## Sorted dataframe
            df


   ## 2nd method:-- 
        
     ## This apply for small dataset

           ## Change the datatype of column

                df['year_week'] = df['year_week'].astype('str')

          ## Categories the string, the way you want

               cats = ['2015_1', '2015_10','2015_11','2016_3','2016_9', '2016_10']

         # Use pd.categorical() to categories it 

 df['year_week']=pd.Categorical(df['year_week'],categories=cats,ordered=True)

          ## Sort the 'year_week' column

              df= df.sort_values('year_week')

           ## Sorted dataframe
              df

在 Pandas 中对包含字符串的列进行排序

问题描述

2 个解决方案

解决方案1
3 已采纳 2016-08-18 10:44:41

解决方案2
0 2021-01-23 14:23:06

在 Pandas 中对包含字符串的列进行排序

问题描述

2 个解决方案

解决方案1 3 已采纳 2016-08-18 10:44:41

解决方案2 0 2021-01-23 14:23:06

解决方案1
3 已采纳 2016-08-18 10:44:41

解决方案2
0 2021-01-23 14:23:06