简体   繁体   中英

Iterate through rows of grouped pandas dataframe to create new columns

I'm new to Python and am trying to get to grips with Pandas for data analysis.

I wondered if anyone can help me loop through rows of grouped data in a dataframe to create new variables.

Suppose I have a dataframe called data, that looks like this:

+----+-----------+--------+
| ID | YearMonth | Status |
+----+-----------+--------+
|  1 |    201506 |      0 |
|  1 |    201507 |      0 |
|  1 |    201508 |      0 |
|  1 |    201509 |      0 |
|  1 |    201510 |      0 |
|  2 |    201506 |      0 |
|  2 |    201507 |      1 |
|  2 |    201508 |      2 |
|  2 |    201509 |      3 |
|  2 |    201510 |      0 |
|  3 |    201506 |      0 |
|  3 |    201507 |      1 |
|  3 |    201508 |      2 |
|  3 |    201509 |      3 |
|  3 |    201510 |      4 |
+----+-----------+--------+

There are multiple rows for each ID, MonthYear is of the form yyyymm, and Status is the status at each MonthYear (takes values 0 to 6)

I have manged to create columns to show me the cumulative maximum status, and an ever3 (to show me if an ID has ever had a status or 3 or more regardless of current status) indicator like this:

data1['Max_Stat'] = data1.groupby(['Custno'])['Status'].cummax()

data1['Ever3'] = np.where(data1['Max_Stat'] >= 3, 1, 0)

What I would also like to do, is create the other columns to create metrics such as the number of times something has happened, or how long since an event. For example

  • Times3Plus : To show how many times the ID has had a status 3 or more at that point in time

  • Into3 : Set to Y the first time the ID has a status of 3 or more (not for subsequent times)

+----+-----------+--------+----------+-------+------------+-------+
| ID | YearMonth | Status | Max_Stat | Ever3 | Times3Plus | Into3 |
+----+-----------+--------+----------+-------+------------+-------+
|  1 |    201506 |      0 |        0 |     0 |          0 |       |
|  1 |    201507 |      0 |        0 |     0 |          0 |       |
|  1 |    201508 |      0 |        0 |     0 |          0 |       |
|  1 |    201509 |      0 |        0 |     0 |          0 |       |
|  1 |    201510 |      0 |        0 |     0 |          0 |       |
|  2 |    201506 |      0 |        0 |     0 |          0 |       |
|  2 |    201507 |      1 |        1 |     0 |          0 |       |
|  2 |    201508 |      2 |        2 |     0 |          0 |       |
|  2 |    201509 |      3 |        3 |     1 |          1 | Y     |
|  2 |    201510 |      0 |        3 |     1 |          1 |       |
|  3 |    201506 |      0 |        0 |     0 |          0 |       |
|  3 |    201507 |      1 |        1 |     0 |          0 |       |
|  3 |    201508 |      2 |        2 |     0 |          0 |       |
|  3 |    201509 |      3 |        3 |     1 |          1 | Y     |
|  3 |    201510 |      4 |        4 |     1 |          2 |       |
+----+-----------+--------+----------+-------+------------+-------+

I can do this quite easily in SAS, using BY and RETAIN statements, but can't work out how to replicate this in Python.

I have managed to do this without iterating over each row, as I'm not sure what I was trying to do was possible. I had wanted to set up counters or indicators at group level,as is possible in SAS, and modify these row by row. Eg something like

Times3Plus=0
if row['Status'] >= 3:
    Times3Plus += 1
Return Times3Plus

In the end, I created a binary 3Plus indicator

data['3Plus'] = np.where(data1['Status'] >= 3, 1, 0)

Then used groupby to summarise these to create Times3Plus at group level

data['Times3Plus'] = data.groupby(['ID'])['3Plus'].cumsum() 

Into3 could then be populated using a function

def into3(row):
    if row['3Plus'] == 1 and row['Times3Plus'] == 1:  #i.e it is the first time
        return 1

 data['Into3'] = data.apply(into3, axis = 1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM