简体   繁体   中英

How do I extract a certain letter n#s before a specific pattern in a data frame in Python?

I have a column in a dataframe that lists DNA sequences, I would like to do the following two things. Below is an example of the data set

d = [['ampC','tacggtctggctgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcAtcgccaatgtaaatccggcccgcc'], ['yifL','acttcataaagagtcgctaaacgcttgcttttacgtcttctcctgcgatgatagaaagcaGaaagcgatgaactttacaggcaat'],['glyW','tcaaaagtggtgaaaaatatcgttgactcatcgcgccaggtaagtagaatgcaacgcatcGaacggcggcactgattgccagacg']]
df = pd.DataFrame(d, columns = ['gene','Sequence'])
gene Sequence
ampC tacggtctggctgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcAtcgccaatgtaaatccggcccgcc
yifL acttcataaagagtcgctaaacgcttgcttttacgtcttctcctgcgatgatagaaagcaGaaagcgatgaactttacaggcaat
glyW tcaaaagtggtgaaaaatatcgttgactcatcgcgccaggtaagtagaatgcaacgcatcGaacggcggcactgattgccagacg
  1. Extract the capital letter and everything before it. With str.extract(r"(.*?)[AZ]+", expand=True) I can get everything before the capital letter but I need help figuring out how to get the capital letter as well.

Example of what I'm trying to get for ampC: tacggtctggctgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcA

  1. How to extract the 16th letter before the capital letter.

Example of what I'm trying to get for the following 3 genes:

gene letter
ampC c
yifL g
glyW t

[c, g, t]

You may try:

df["SubSequence"] = df["Sequence"].str.extract(r'^(.*?[A-Z])')
df["letter"] = df["Sequence"].str.extract(r'^[acgt]*([acgt])[acgt]{15}[A-Z]')

Your regular expression is almost what you need. Just move the capital letters inside the group. Try with:

df["substring"] = df["Sequence"].str.extract(r"(.*?[A-Z])")[0]
df["letter"] = df["Sequence"].str.extract(r"(.*?[A-Z])")[0].str[-17]

>>> df[["gene", "letter"]]
   gene letter
0  ampC      c
1  yifL      g
2  glyW      t

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM