Let's say I have the two dataframes below.
In reality, both dataframes will be around a million rows each, so I would like to find the most efficient way to compare:
The overall goal is to count the number of times each feature_id is found in a gene, and capture the position information for use downstream.
# break fasta_df sequences and mutation seqs up into kmers
data = [{"gene":"pik3ca", "start":"179148724", "stop":"179148949","seq":"TTTGCTTTATCTTTTGTTTTTGCTTTAGCTGAAGTATTTTAAAGTCAGTTACAG"},
{"gene":"brca1", "start":"179148724", "stop":"179148949","seq":"CAATATCTACCATTTGTTAACTTTGTTCTATTATCATAACTACCAAAATTAACAGA"},
{"gene":"kras1", "start":"179148724", "stop":"179148949","seq":"AAAACCCAGTAGATTTTCAAATTTTCCCAACTCTTCCACCAATGTCTTTTTACATCT"}]
# test dataframe with input seq
df1 = pd.DataFrame(data)
data2 = [{"FeatureID":"1_1_15", "BaseCall":"TTTGTT"},
{"FeatureID":"1_1_15", "BaseCall":"AATATC"},
{"FeatureID":"1_1_16", "BaseCall":"GTTTTT"},
{"FeatureID":"1_1_16", "BaseCall":"GTTCTA"},
]
df2= pd.DataFrame(data2)
The output should look something like:
| gene | feature_id | BaseCall | Position
| pik3ca | 1_1_15 | TTTGTT | 12
| pik3ca | 1_1_16 | GTTTTT | 15
| brca1 | 1_1_16 | GTTCTA | 24
| brca1 | 1_1_15 | AATATC | 1
| brca1 | 1_1_15 | TTTGTT | 12
| brca1 | 1_1_15 | TTTGTT | 21
This ngram function seems to work great when I use just one test basecall on one seq, but I'm having trouble figuring out the most efficient way to use the apply method with one argument coming from two different dataframes. Or perhaps there is an even better way to find matching strings/positions between two dataframes?
def ngrams(string, target):
ngrams = zip(*[string[i:] for i in range(6)])
output = [''.join(ngram)for ngram in ngrams]
indices = [(i,x) for i, x in enumerate(output) if x == target]
return indices
Account for possible multiple occurrences of same BaseCall
in a given seq
, using re.finditer()
and some Pandas hacking :
import re
def match_basecall(pattern, string):
match = re.finditer(pattern, string)
start_pos = [m.start() for m in match]
if not start_pos:
return None
return start_pos
matches = df2.BaseCall.apply(lambda bc: df1.seq.apply(lambda x: match_basecall(bc, x)))
matches.columns = df1.gene
merged = matches.merge(df2, left_index=True, right_index=True)
melted = merged.melt(id_vars=["FeatureID", "BaseCall"],
var_name="gene",
value_name="Position").dropna()
melted
FeatureID BaseCall gene Position
0 1_1_15 TTTGTT pik3ca [12]
2 1_1_16 GTTTTT pik3ca [15]
4 1_1_15 TTTGTT brca1 [12, 21]
5 1_1_15 AATATC brca1 [1]
7 1_1_16 GTTCTA brca1 [24]
Multiple BaseCall
matches are represented as list items in Position
, but our desired output puts each match on a separate row. We can use apply(pd.Series)
to explode a column of lists into multiple columns, and then stack()
to swing the columns into rows:
stacked = (pd.DataFrame(melted.Position.apply(pd.Series).stack())
.reset_index(level=1, drop=True)
.rename(columns={0:"Position"}))
final = melted.drop("Position", 1).merge(stacked, left_index=True, right_index=True)
final
FeatureID BaseCall gene Position
0 1_1_15 TTTGTT pik3ca 12.0
2 1_1_16 GTTTTT pik3ca 15.0
4 1_1_15 TTTGTT brca1 12.0
4 1_1_15 TTTGTT brca1 21.0
5 1_1_15 AATATC brca1 1.0
7 1_1_16 GTTCTA brca1 24.0
We can groupby
FeatureID
and gene
to get occurrence totals:
final.groupby(["FeatureID", "gene"]).Position.count()
FeatureID gene
1_1_15 brca1 3
pik3ca 1
1_1_16 brca1 1
pik3ca 1
Notes: Per OP output, combinations with no matches are excluded.
Also, assuming here that BaseCall
is just one column, and that there are not both Basecall
and BaseCall
separate columns.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.