How to create a new column in Python Dataframe by referencing two other columns?

Question

I have a dataframe that looks something like this:

df = pd.DataFrame({'Name':['a','a','a','a','b','b','b'], 'Year':[1999,1999,1999,2000,1999,2000,2000], 'Name_id':[1,1,1,1,2,2,2]})

  Name  Name_id  Year
0    a        1  1999
1    a        1  1999
2    a        1  1999
3    a        1  2000
4    b        2  1999
5    b        2  2000
6    b        2  2000

What I'd like to have is a new column 'yr_name_id' that increases for each unique Name_id-Year combination and then begins anew with each new Name_id.

  Name  Name_id  Year  yr_name_id
0    a        1  1999           1
1    a        1  1999           1
2    a        1  1999           1
3    a        1  2000           2
4    b        2  1999           1
5    b        2  2000           2
6    b        2  2000           2

I've tried a variety of things and looked here , here and at a few posts on groupby and enumerate.

At first I tried creating a unique dictionary after combining Name_id and Year and then using map to assign values, but when I try to combine Name_id and Year as strings via:

df['yr_name_id'] = str(df['Name_id']) + str(df['Year'])

The new column has a non-unique syntax of 0 0 1\\n1 1\\n2 1\\n3 1\\n4 2\\n5 2... which I don't really understand.

A more promising approach that I think I just need help with the lambda is by using groupby

df['yr_name_id'] =  df.groupby(['Name_id', 'Year'])['Name_id'].transform(lambda x: )#unsure from this point

I am very unfamiliar with lambda's so any guidance on how I might do this would be greatly appreciated.

Answer 1

IIUC you can do it this way:

In [99]: df['yr_name_id'] = pd.Categorical(pd.factorize(df['Name_id'].astype(str) + '-' + df['Year'].astype(str))[0] + 1)

In [100]: df
Out[100]:
  Name  Name_id  Year yr_name_id
0    a        1  1999          1
1    a        1  1999          1
2    a        1  1999          1
3    a        1  2000          2
4    b        2  1999          3
5    b        2  2000          4
6    b        2  2000          4

In [101]: df.dtypes
Out[101]:
Name            object
Name_id          int64
Year             int64
yr_name_id    category
dtype: object

But looking at your desired DF, it looks like you want to categorize just a Year column, not a combination of Name_id + Year

In [102]: df['yr_name_id'] = pd.Categorical(pd.factorize(df.Year)[0] + 1)

In [103]: df
Out[103]:
  Name  Name_id  Year yr_name_id
0    a        1  1999          1
1    a        1  1999          1
2    a        1  1999          1
3    a        1  2000          2
4    b        2  1999          1
5    b        2  2000          2
6    b        2  2000          2

In [104]: df.dtypes
Out[104]:
Name            object
Name_id          int64
Year             int64
yr_name_id    category
dtype: object

Answer 2

Use itertools.count :

from itertools import count

counter = count(1)
df['yr_name_id'] = (df.groupby(['Name_id', 'Year'])['Name_id']
                      .transform(lambda x: next(counter)))

Output:

  Name  Name_id  Year  yr_name_id
0    a        1  1999           1
1    a        1  1999           1
2    a        1  1999           1
3    a        1  2000           2
4    b        2  1999           3
5    b        2  2000           4
6    b        2  2000           4

How to create a new column in Python Dataframe by referencing two other columns?

Question

2 answers

solution1
1 ACCPTED 2016-09-22 18:48:41

solution2
0 2016-09-22 19:45:54

How to create a new column in Python Dataframe by referencing two other columns?

Question

2 answers

solution1 1 ACCPTED 2016-09-22 18:48:41

solution2 0 2016-09-22 19:45:54

solution1
1 ACCPTED 2016-09-22 18:48:41

solution2
0 2016-09-22 19:45:54