I have a column in a dataframe that lists DNA sequences, I would like to do the following two things. Below is an example of the data set
d = [['ampC','tacggtctggctgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcAtcgccaatgtaaatccggcccgcc'], ['yifL','acttcataaagagtcgctaaacgcttgcttttacgtcttctcctgcgatgatagaaagcaGaaagcgatgaactttacaggcaat'],['glyW','tcaaaagtggtgaaaaatatcgttgactcatcgcgccaggtaagtagaatgcaacgcatcGaacggcggcactgattgccagacg']]
df = pd.DataFrame(d, columns = ['gene','Sequence'])
gene | Sequence |
---|---|
ampC | tacggtctggctgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcAtcgccaatgtaaatccggcccgcc |
yifL | acttcataaagagtcgctaaacgcttgcttttacgtcttctcctgcgatgatagaaagcaGaaagcgatgaactttacaggcaat |
glyW | tcaaaagtggtgaaaaatatcgttgactcatcgcgccaggtaagtagaatgcaacgcatcGaacggcggcactgattgccagacg |
str.extract(r"(.*?)[AZ]+", expand=True)
I can get everything before the capital letter but I need help figuring out how to get the capital letter as well.Example of what I'm trying to get for ampC: tacggtctggctgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcA
Example of what I'm trying to get for the following 3 genes:
gene | letter |
---|---|
ampC | c |
yifL | g |
glyW | t |
[c, g, t]
You may try:
df["SubSequence"] = df["Sequence"].str.extract(r'^(.*?[A-Z])')
df["letter"] = df["Sequence"].str.extract(r'^[acgt]*([acgt])[acgt]{15}[A-Z]')
Your regular expression is almost what you need. Just move the capital letters inside the group. Try with:
df["substring"] = df["Sequence"].str.extract(r"(.*?[A-Z])")[0]
df["letter"] = df["Sequence"].str.extract(r"(.*?[A-Z])")[0].str[-17]
>>> df[["gene", "letter"]]
gene letter
0 ampC c
1 yifL g
2 glyW t
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.