繁体   English   中英

熊猫根据一组字符串从行中提取值

[英]Pandas extracting values from rows based on set of strings

我正在尝试从具有多个用分号分隔的对的pandas列中提取特定值(以key:value对的形式)。

输入数据框如下所示:

9   114188457   114192289   cast_3_930|cast_1_1069|cast_2_985   0.9510007336163186  -   114188457   114188457   211,111,111 "gene_id ""ENSMUSG00000111734""; gene_version ""1""; transcript_id ""ENSMUST00000214237""; transcript_version ""1""; exon_number ""23""; gene_name ""Gm29825""; gene_source ""havana""; gene_biotype ""lincRNA""; havana_gene ""OTTMUSG00000062514""; havana_gene_version ""1""; transcript_name ""Gm29825-201""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152298""; havana_transcript_version ""1""; exon_id ""ENSMUSE00001401544""; exon_version ""1""; tag ""basic""; transcript_support_level ""5"";"  .
9   114227850   114241851   cast_3_932|cast_1_1071|cast_2_988   1.2516483862692769  +   114227850   114227850   211,111,111 "gene_id ""ENSMUSG00000064299""; gene_version ""6""; transcript_id ""ENSMUST00000213446""; transcript_version ""1""; exon_number ""26""; gene_name ""4921528I07Rik""; gene_source ""ensembl_havana""; gene_biotype ""processed_transcript""; havana_gene ""OTTMUSG00000062515""; havana_gene_version ""1""; transcript_name ""4921528I07Rik-202""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152299""; havana_transcript_version ""1""; exon_id ""ENSMUSE00001400969""; exon_version ""1""; tag ""basic""; transcript_support_level ""1"";" .
9   114227850   114241851   cast_3_932|cast_1_1071|cast_2_988   1.2516483862692769  +   114227850   114227850   211,111,111 "gene_id ""ENSMUSG00000064299""; gene_version ""6""; transcript_id ""ENSMUST00000213446""; transcript_version ""1""; exon_number ""25""; gene_name ""4921528I07Rik""; gene_source ""ensembl_havana""; gene_biotype ""processed_transcript""; havana_gene ""OTTMUSG00000062515""; havana_gene_version ""1""; transcript_name ""4921528I07Rik-202""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152299""; havana_transcript_version ""1""; exon_id ""ENSMUSE00001404576""; exon_version ""1""; tag ""basic""; transcript_support_level ""1"";" .

我正在研究第十栏,看起来像这样:

"gene_id ""ENSMUSG00000111734""; gene_version ""1""; transcript_id ""ENSMUST00000214237""; transcript_version ""1""; gene_name ""Gm29825""; gene_source ""havana""; gene_biotype ""lincRNA""; havana_gene ""OTTMUSG00000062514""; havana_gene_version ""1""; transcript_name ""Gm29825-201""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152298""; havana_transcript_version ""1""; tag ""basic""; transcript_support_level ""5"";"

成对,格式为: identifier ""value""

虽然我可以通过将该列转换为另一个数据框并选择相关的行来提取值,但问题在于该列本身中的数据未正确排序。

在这种情况下,我只对gene_idgene_namegene_biotype感兴趣,但将来可能会更改所需条款的规范。 我本可以使用基于字典的解决方案,但是每个行的值都不都是唯一的,并且在某些行中它们根本不存在(第10列带有.行)。

最终,我希望数据框看起来像这样:

9   114188457   114192289   cast_3_930|cast_1_1069|cast_2_985   0.9510007336163186  -   114188457   114188457   211,111,111 ENSMUSG00000111734  Gm29825 lincRNA .
9   114227850   114241851   cast_3_932|cast_1_1071|cast_2_988   1.2516483862692769  +   114227850   114227850   211,111,111 ENSMUSG00000064299  4921528I07Rik   processed_transcript    .
9   114227850   114241851   cast_3_932|cast_1_1071|cast_2_988   1.2516483862692769  +   114227850   114227850   211,111,111 ENSMUSG00000064299  4921528I07Rik   processed_transcript    .

在大熊猫中,最有效的方法是什么?

正则表达式在熊猫专栏

您可以在列的.str参数后面使用正则表达式

df['gene_id'] = df.iloc[:,9].str.extract('gene_id \"(\w+)\";')
df['gene_name'] = df.iloc[:,9].str.extract('gene_name \"(\w+)\";')
df['gene_biotype'] =df.iloc[:,9].str.extract('gene_biotype \"(\w+)\";')

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM