简体   繁体   中英

Fill rows of dataframe based on condition of other row

I have a dataframe like this:

pd.DataFrame({"ID1": ["A", "B", "C", "A", "C", "C", "A"],
              "ID2": ["a", "b", "c", "a", "e", "c", "b"],
              "Month": [1, 4, 7, 4, 2, 9, 3],
              "Value": [10, 20, 40, 60, 20, 30, 10]})
ID1 ID2  Month  Value
A   a      1     10
B   b      4     20
C   c      7     40
A   a      4     60
C   e      2     20
C   c      9     30
A   b      3     10

I want to to fill the values for the missing months by the values of the preceding month of the "ID1"+"ID2"-combination, ie: there is no value for the month 2 and 3 of the combination "A"+"a", so it should take the value of the month 1. At month 4 we have a value for "A"+"a", so this value should be taken till there is another value for a month.

For the combination "C"+"c" the values should start appear at month 7, because it is the first value that appears for the combination.

The end dataframe should look like this:

ID1 ID2  Month  Value
A   a      1     10
A   a      2     10
A   a      3     10
A   a      4     60
A   a      5     60
A   a      6     60
A   a      7     60
A   a      8     60
A   a      9     60
A   a      10    60
A   a      11    60
A   a      12    60
B   b      4     20
C   c      1     0
C   c      2     0
C   c      3     0
C   c      4     0
C   c      5     0
C   c      6     0
C   c      7     40
C   c      8     40
C   c      9     30
C   c      10    30
C   c      11    30
C   c      12    30
... ...    ...   ...

I started my approach kind of inefficient (I guess):

  1. Loop over the months 1:12

  2. Loop over the unique combinations of "ID1"+"ID2"

  3. If a row for "ID1"+"ID2" and month exists

    Then go to the next month

  4. Else look at the month before of the "ID1"+"ID2" combination

    If the value exists

    Then take the value

    Else put the value to 0

Is there a better way to do this or maybe a package that could help me calculate this efficiently?

Define the following function to process each group:

def proc(grp):
    wrk = grp.set_index('Month').Value.reindex(np.arange(1, 13).tolist())\
        .ffill().fillna(0, downcast='infer')
    id1, id2 = grp.iloc[0, :2].tolist()
    wrk.index = pd.MultiIndex.from_product([[id1], [id2], wrk.index],
        names=['ID1', 'ID2', 'Month'])
    return wrk

Then, to get your expected result, group df by ID1 and ID2 and apply the above function:

result = df.groupby(['ID1', 'ID2'], group_keys=False).apply(proc).reset_index()

The last step is reset_index() to convert the resulting (concatenated) Series into a DataFrame.

A fragment of the result for groups ('A', 'a') and ('C', 'c') is:

   ID1 ID2  Month  Value
0    A   a      1     10
1    A   a      2     10
2    A   a      3     10
3    A   a      4     60
4    A   a      5     60
5    A   a      6     60
6    A   a      7     60
7    A   a      8     60
8    A   a      9     60
9    A   a     10     60
10   A   a     11     60
11   A   a     12     60
...
36   C   c      1      0
37   C   c      2      0
38   C   c      3      0
39   C   c      4      0
40   C   c      5      0
41   C   c      6      0
42   C   c      7     40
43   C   c      8     40
44   C   c      9     30
45   C   c     10     30
46   C   c     11     30
47   C   c     12     30

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM