I have data that looks like this
sno Sequence conversion
1 A-B-C-D-E-B-A 1
2 A-B-C-D 0
3 A-B-C-D 1
4 D-E-H-I-A 0
5 Z 0
6 A-Z 0
7 F-E-T-H-S-A-T-J-F-E-D-E-S-X-G-N-N-K-L-D 1
8 H-S-A-T-J-F-E 0
The data contains Sequences that may start and end with anything randomly. At the end of the sequence, there is a flag that says conversion. It's '1' if the Sequence converts and 0 if the sequence doesn't. I want to find out how individual parts in sequence influence conversion by finding the conditional probability of conversion of each sequence part or combination of these individual sub-sequence. For example, if A is encountered in the sequence conversion probability of the whole sequence goes up by 2%. If ABC is encountered in a combination then the probability of conversion goes up by 13% If ZA is encountered, the probability of conversion goes up by 8%.
How do I make a table like this -
Sno Sub-sequence probabilty_of_conversion
1 A 2%
2 B 1%
3 C 4%
......
4 A-B-C 13%
5. Z-A 8%
Something like this:
import pandas as pd
# input data
input_ = [('A-B-C-D-E-B-A', 1), ('A-B-C-D', 0), ('A-B-C-D', 1),
('D-E-H-I-A', 0), ('Z', 0), ('A-Z', 0),
('F-E-T-H-S-A-T-J-F-E-D-E-S-X-G-N-N-K-L-D', 1),
('H-S-A-T-J-F-E', 0)]
input_ = pd.DataFrame(input_, columns=['sequence', 'conversion'])
# generate sub-sequences
def get_sub_sequences(sequence):
total = len(sequence)
for i in range(total):
for j in range(i+1, total+1):
yield sequence[i:j]
# populate sub-sequences
sub_sequences = []
for sequence in data.sequence:
for sub_sequence in get_sub_sequences(sequence.split('-')):
sub_sequence = '-'.join(sub_sequence)
if sub_sequence not in sub_sequences:
sub_sequences.append(sub_sequence)
sub_sequences = sorted(sub_sequences, key=len)
# populate probabilities
probabilities = []
for sub_sequence in sub_sequences:
values = []
for row in data.itertuples():
if sub_sequence in row.sequence:
values.append(row.conversion)
probability = round((sum(values) / len(values) * 100))
probabilities.append(f'{probability}%')
# output data
output = pd.DataFrame(zip(sub_sequences, probabilities),
columns=['sub_squence', 'probability'])
output
Expected Output:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.