简体   繁体   中英

Iterate over a data frame and combine values ​from different columns

I have two data frame:

Support data:

support_data = {    
    'index_value': [
        100,
        250,
        500,
        30,
        10
    ]
}
support_df = pd.DataFrame(support_data)

index_value
0   100
1   250
2   500
3   30
4   10

Main data:

data = {
    'link_index': [
        '0', '0',
        '0', '1',
        '2', '3',
        '3', '4',
        '4', '4'
    ],
    'value_1': [
        '1', '2',
        '3', '4',
        '5', '6',
        '7', '8',
        '9', '0'
    ],
    'value_2': [
        '11', '28',
        '33', '40',
        '50', '60',
        '70', '80',
        '90', '100'
    ]
}
df = pd.DataFrame(data)

link_index  value_1 value_2
0   0   1   11
1   0   2   28
2   0   3   33
3   1   4   40
4   2   5   50
5   3   6   60
6   3   7   70
7   4   8   80
8   4   9   90
9   4   0   100

I need to slice data frame and to zip value_1 and value_2 and append value from support data frame by link_index.

I have worked solution, but it is slow. Maybe exist more fast decision.

My solution and result:

Function zip values and append value from support data frame.

def write(group):
    value_1 = group.value_1.tolist()
    value_2 = group.value_2.tolist()
    result = [b for a in zip(value_1, value_2) for b in a]
    index = group.link_index.astype(int).iloc[0]
    result.append(support_df.index_value.iloc[index])
    result = ','.join(str(e) for e in result)
    return result

Cycle split data frame on slices with length = nrows and step = overlap:

overlap = 1
nrows = 2
for i in range(0, len(df) - overlap, nrows - overlap):
    row = write(df.iloc[i : i + nrows]) 
    result = result.append(pd.DataFrame({'seq' : [row]}), ignore_index=True)

Result:

seq
0   1,11,2,28,100
1   2,28,3,33,100
2   3,33,4,40,100
3   4,40,5,50,250
4   5,50,6,60,500
5   6,60,7,70,30
6   7,70,8,80,30
7   8,80,9,90,10
8   9,90,0,100,10

I expect more fast solution.

You can try this (I haven't compared speed but this doesn't involve any for loops):

# prepare data type of link_index to merge
support_df = support_df.reset_index().rename(columns={'index':'link_index'})
support_df['link_index'] = support_df['link_index'].astype(str)
merged = pd.merge(df, support_df, on="link_index")

# split data into two halves with an offset
left = merged[['value_1', 'value_2', 'index_value']].iloc[:-1].reset_index(drop=True)
right = merged[['value_1', 'value_2']].iloc[1:].reset_index(drop=True)

# rename duplicate columns before concatenating them
left = left.rename(columns={'value_1':'left_1', 'value_2':'left_2'})
right = right.rename(columns={'value_1':'right_1', 'value_2':'right_2'})

# rejoin data and convert to Series
result = pd.concat([left, right], axis=1)
result = result[['left_1', 'left_2', 'right_1', 'right_2', 'index_value']]
seq = pd.Series(result.values.tolist())
print(seq)

Output:

0    [1, 11, 2, 28, 100]
1    [2, 28, 3, 33, 100]
2    [3, 33, 4, 40, 100]
3    [4, 40, 5, 50, 250]
4    [5, 50, 6, 60, 500]
5     [6, 60, 7, 70, 30]
6     [7, 70, 8, 80, 30]
7     [8, 80, 9, 90, 10]
8    [9, 90, 0, 100, 10]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM