[英]Pandas how to extract specific strings from dataframe
我有以下 dataframe:
d = {'sample1':['REC(CHR=2,,POS=345432,,REF=G,ALT=A,,BAND=ARG), REC(CHR=2,,POS=245332,,REF=T,,ALT=GA,BAND=AA4T)', 'REC(CHR=4,,POS=23332,,REF=A,,ALT=G,BAND=C4T)','REC(CHR=8,,POS=3335332,,REF=G,,ALT=A,BAND=AA4T)'], 'sample2':['REC(CHR=2,,POS=34545432,,REF=T,,ALT=A,,BAND=ARG)','REC(CHR=4,,POS=45332,,REF=G,,ALT=GAGG,BAND=AA4SST)','REC(CHR=8,,POS=445332,,REF=G,,ALT=C,BAND=33T)'], 'sample3':['REC(CHR=2,,POS=87532,,REF=A,ALT=C,,BAND=1243D)','REC(CHR=4,,POS=2453344432,,REF=C,,ALT=T,BAND=EE3)','REC(CHR=8,,POS=23245332,,REF=T,,ALT=A,BAND=AA4T)'], 'sample4':['REC(CHR=2,,POS=4347532,,REF=T,,ALT=G,,BAND=GM34), REC(CHR=2,,POS=4323432,,REF=A,,ALT=T,,BAND=GMA34), REC(CHR=2,,POS=44423432,,REF=G,,ALT=T,,BAND=GSSMA34)','REC(CHR=4,,POS=225332,,REF=G,,ALT=A,BAND=EER4T)','REC(CHR=8,,POS=245332,,REF=A,,ALT=C,BAND=AA4T)']}
df1 = pd.DataFrame(d, index=['PP25','COX4','P53'])
我要做的是提取 POS、REF 和 ALT 信息,即 POS=4323432 並創建另一個 dataframe。 原始文件要大得多,但我很確定原始文件列中的數據不是字符串。
我嘗試了以下方法:
cols = df1.select_dtypes('object').columns
df1[cols] = df1[cols].apply(lambda x: x.astype(str))
df1 = frame.apply(lambda x: x.str.extract('POS=, REF=, ALT='))
但似乎無法得到它。
所需的 output:
POS REF ALT
PP25 345432 G A
PP25 245332 T GA
PP25 34545432 T A
PP25 87532 A C
PP25 4347532 T G
PP25 4323432 A T
PP25 44423432 G T
COX4 23332 A G
COX4 45332 G GAGG
COX4 2453344432 C T
COX4 225332 G A
P53 3335332 G A
P53 445332 G C
P53 23245332 T A
P53 245332 A C
謝謝!
stack
, split
, explode
並使用str.extract
和一個簡短的正則表達式:
out = (df1.stack()
.str.split(',\s+(?=REC)').explode()
.str.extract(r'POS=(\d+).*REF=([ACGT]).*ALT=([ACGT])')
)
替代命名捕獲組並刪除第二級:
out = (df1.stack()
.str.split(',\s+(?=REC)').explode()
.str.extract(r'POS=(?P<POS>\d+).*REF=(?P<REF>[ACGT]).*ALT=(?P<ALT>[ACGT])')
.droplevel(1)
)
注意。 我假設您只想匹配 REF 和 ALT 的 A/T/G/C,如果您有更多字符,您可以將它們添加到組中。
output:
POS REF ALT
PP25 345432 G A
PP25 245332 T G
PP25 34545432 T A
PP25 87532 A C
PP25 4347532 T G
PP25 4323432 A T
PP25 44423432 G T
COX4 23332 A G
COX4 45332 G G
COX4 2453344432 C T
COX4 225332 G A
P53 3335332 G A
P53 445332 G C
P53 23245332 T A
P53 245332 A C
如果字段的順序並不總是相同(POS->REF->ALT),則必須使用extractall
和groupby.agg
:
(df1
.stack().str.split(',\s+(?=REC)').explode()
.str.extractall(r'POS=(?P<POS>\d+)|REF=(?P<REF>[ACGT])|ALT=(?P<ALT>[ACGT])')
.groupby(level=[0,1], sort=False).first()
.droplevel(1)
)
再現性測試:
import pandas as pd
d = {'sample1':['REC(CHR=2,,POS=345432,,REF=G,ALT=A,,BAND=ARG), REC(CHR=2,,POS=245332,,REF=T,,ALT=GA,BAND=AA4T)', 'REC(CHR=4,,POS=23332,,REF=A,,ALT=G,BAND=C4T)','REC(CHR=8,,POS=3335332,,REF=G,,ALT=A,BAND=AA4T)'], 'sample2':['REC(CHR=2,,POS=34545432,,REF=T,,ALT=A,,BAND=ARG)','REC(CHR=4,,POS=45332,,REF=G,,ALT=GAGG,BAND=AA4SST)','REC(CHR=8,,POS=445332,,REF=G,,ALT=C,BAND=33T)'], 'sample3':['REC(CHR=2,,POS=87532,,REF=A,ALT=C,,BAND=1243D)','REC(CHR=4,,POS=2453344432,,REF=C,,ALT=T,BAND=EE3)','REC(CHR=8,,POS=23245332,,REF=T,,ALT=A,BAND=AA4T)'], 'sample4':['REC(CHR=2,,POS=4347532,,REF=T,,ALT=G,,BAND=GM34), REC(CHR=2,,POS=4323432,,REF=A,,ALT=T,,BAND=GMA34), REC(CHR=2,,POS=44423432,,REF=G,,ALT=T,,BAND=GSSMA34)','REC(CHR=4,,POS=225332,,REF=G,,ALT=A,BAND=EER4T)','REC(CHR=8,,POS=245332,,REF=A,,ALT=C,BAND=AA4T)']}
df1 = pd.DataFrame(d, index=['PP25','COX4','P53'])
(df1.stack()
.str.split(',\s+(?=REC)').explode()
.str.extract(r'POS=(?P<POS>\d+).*REF=(?P<REF>[ACGT]).*ALT=(?P<ALT>[ACGT])')
.droplevel(1)
)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.