简体   繁体   中英

Find probability of sequence conversion for sub sequences and their combinations

I have data that looks like this

sno     Sequence                                conversion 
1       A-B-C-D-E-B-A                                1
2       A-B-C-D                                      0
3       A-B-C-D                                      1
4       D-E-H-I-A                                    0
5       Z                                            0
6       A-Z                                          0
7       F-E-T-H-S-A-T-J-F-E-D-E-S-X-G-N-N-K-L-D      1
8       H-S-A-T-J-F-E                                0

The data contains Sequences that may start and end with anything randomly. At the end of the sequence, there is a flag that says conversion. It's '1' if the Sequence converts and 0 if the sequence doesn't. I want to find out how individual parts in sequence influence conversion by finding the conditional probability of conversion of each sequence part or combination of these individual sub-sequence. For example, if A is encountered in the sequence conversion probability of the whole sequence goes up by 2%. If ABC is encountered in a combination then the probability of conversion goes up by 13% If ZA is encountered, the probability of conversion goes up by 8%.

How do I make a table like this -

Sno   Sub-sequence    probabilty_of_conversion 
1         A                2%
2         B                1%
3         C                4%
......
4         A-B-C            13%
5.        Z-A              8%

Something like this:

import pandas as pd


# input data
input_ = [('A-B-C-D-E-B-A', 1), ('A-B-C-D', 0), ('A-B-C-D', 1),
        ('D-E-H-I-A', 0), ('Z', 0), ('A-Z', 0),
        ('F-E-T-H-S-A-T-J-F-E-D-E-S-X-G-N-N-K-L-D', 1),
        ('H-S-A-T-J-F-E', 0)]
input_ = pd.DataFrame(input_, columns=['sequence', 'conversion'])


# generate sub-sequences
def get_sub_sequences(sequence):
    total = len(sequence)
    for i in range(total):
        for j in range(i+1, total+1):
            yield sequence[i:j]

            
# populate sub-sequences
sub_sequences = []
for sequence in data.sequence:
    for sub_sequence in get_sub_sequences(sequence.split('-')):
        sub_sequence = '-'.join(sub_sequence)
        if sub_sequence not in sub_sequences:
            sub_sequences.append(sub_sequence)
sub_sequences = sorted(sub_sequences, key=len)
            

# populate probabilities
probabilities = []
for sub_sequence in sub_sequences:
    values = []
    for row in data.itertuples():
        if sub_sequence in row.sequence:
            values.append(row.conversion)
    probability = round((sum(values) / len(values) * 100))
    probabilities.append(f'{probability}%')


# output data
output = pd.DataFrame(zip(sub_sequences, probabilities),
                      columns=['sub_squence', 'probability'])
output

Expected Output:

在此处输入图片说明

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM