How do I extract a certain letter n#s before a specific pattern in a data frame in Python?

Question

I have a column in a dataframe that lists DNA sequences, I would like to do the following two things. Below is an example of the data set

d = [['ampC','tacggtctggctgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcAtcgccaatgtaaatccggcccgcc'], ['yifL','acttcataaagagtcgctaaacgcttgcttttacgtcttctcctgcgatgatagaaagcaGaaagcgatgaactttacaggcaat'],['glyW','tcaaaagtggtgaaaaatatcgttgactcatcgcgccaggtaagtagaatgcaacgcatcGaacggcggcactgattgccagacg']]
df = pd.DataFrame(d, columns = ['gene','Sequence'])

gene	Sequence
ampC	tacggtctggctgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcAtcgccaatgtaaatccggcccgcc
yifL	acttcataaagagtcgctaaacgcttgcttttacgtcttctcctgcgatgatagaaagcaGaaagcgatgaactttacaggcaat
glyW	tcaaaagtggtgaaaaatatcgttgactcatcgcgccaggtaagtagaatgcaacgcatcGaacggcggcactgattgccagacg

Extract the capital letter and everything before it. With str.extract(r"(.*?)[AZ]+", expand=True) I can get everything before the capital letter but I need help figuring out how to get the capital letter as well.

Example of what I'm trying to get for ampC: tacggtctggctgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcA

How to extract the 16th letter before the capital letter.

Example of what I'm trying to get for the following 3 genes:

gene	letter
ampC	c
yifL	g
glyW	t

[c, g, t]

Answer 1

You may try:

df["SubSequence"] = df["Sequence"].str.extract(r'^(.*?[A-Z])')
df["letter"] = df["Sequence"].str.extract(r'^[acgt]*([acgt])[acgt]{15}[A-Z]')

Answer 2

Your regular expression is almost what you need. Just move the capital letters inside the group. Try with:

df["substring"] = df["Sequence"].str.extract(r"(.*?[A-Z])")[0]
df["letter"] = df["Sequence"].str.extract(r"(.*?[A-Z])")[0].str[-17]

>>> df[["gene", "letter"]]
   gene letter
0  ampC      c
1  yifL      g
2  glyW      t

How do I extract a certain letter n#s before a specific pattern in a data frame in Python?

Question

2 answers

solution1
1 2021-10-04 16:31:24

solution2
0 ACCPTED 2021-10-04 16:29:34

How do I extract a certain letter n#s before a specific pattern in a data frame in Python?

Question

2 answers

solution1 1 2021-10-04 16:31:24

solution2 0 ACCPTED 2021-10-04 16:29:34

solution1
1 2021-10-04 16:31:24

solution2
0 ACCPTED 2021-10-04 16:29:34