简体   繁体   English

在 Pandas 中对包含字符串的列进行排序

[英]Sort a column containing string in Pandas

I am new to Pandas, and looking to sort a column containing strings and generate a numerical value to uniquely identify the string.我是 Pandas 的新手,希望对包含字符串的列进行排序并生成一个数值来唯一标识该字符串。 My data frame looks something like this:我的数据框看起来像这样:

df = pd.DataFrame({'key': range(8), 'year_week': ['2015_10', '2015_1', '2015_11', '2016_9', '2016_10','2016_3', '2016_9', '2016_10']})

First I like to sort the 'year_week' column to arrange in ascending order (2015_1, 2016_9, '2016_9', 2016_10, 2016_11, 2016_3, 2016_10, 2016_10) and then generate a numerical value for each unique 'year_week' string.首先我喜欢将'year_week'列按升序排列(2015_1, 2016_9, '2016_9', 2016_10, 2016_11, 2016_3, 2016_10, 2016_10)然后为每个唯一'year_week'字符串'year_week'生成一个数值。

You can first convert to_datetime column year_week , then sort it by sort_values and last use factorize :您可以先将to_datetime列转换to_datetime year_week ,然后按sort_values对其进行sort_values ,最后使用factorize

df = pd.DataFrame({'key': range(8), 'year_week': ['2015_10', '2015_1', '2015_11', '2016_9', '2016_10','2016_3', '2016_9', '2016_10']})

#http://stackoverflow.com/a/17087427/2901002
df['date'] = pd.to_datetime(df.year_week + '-0', format='%Y_%W-%w')
#sort by column date
df.sort_values('date', inplace=True)
#create numerical values
df['num'] = pd.factorize(df.year_week)[0]
print (df)
   key year_week       date  num
1    1    2015_1 2015-01-11    0
0    0   2015_10 2015-03-15    1
2    2   2015_11 2015-03-22    2
5    5    2016_3 2016-01-24    3
3    3    2016_9 2016-03-06    4
6    6    2016_9 2016-03-06    4
4    4   2016_10 2016-03-13    5
7    7   2016_10 2016-03-13    5
       ## 1st method :-- This apply for large dataset

 ## Split the "year_week" column into 2 columns

             df[['year', 'week']] =df['year_week'].str.split("_",expand=True)

     ## Change the datatype of newly created columns
             df['year'] = df['year'].astype('int')

             df['week'] = df['week'].astype('int')

    ## Sort the dataframe by newly created column

             df= df.sort_values(['year','week'],ascending=True)

   ## Drop years & months column

             df.drop(['year','week'],axis=1,inplace=True)

   ## Sorted dataframe
            df


   ## 2nd method:-- 
        
     ## This apply for small dataset

           ## Change the datatype of column

                df['year_week'] = df['year_week'].astype('str')

          ## Categories the string, the way you want

               cats = ['2015_1', '2015_10','2015_11','2016_3','2016_9', '2016_10']

         # Use pd.categorical() to categories it 

 df['year_week']=pd.Categorical(df['year_week'],categories=cats,ordered=True)

          ## Sort the 'year_week' column

              df= df.sort_values('year_week')

           ## Sorted dataframe
              df

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM