[英]Pandas extracting values from rows based on set of strings
我正在尝试从具有多个用分号分隔的对的pandas列中提取特定值(以key:value对的形式)。
输入数据框如下所示:
9 114188457 114192289 cast_3_930|cast_1_1069|cast_2_985 0.9510007336163186 - 114188457 114188457 211,111,111 "gene_id ""ENSMUSG00000111734""; gene_version ""1""; transcript_id ""ENSMUST00000214237""; transcript_version ""1""; exon_number ""23""; gene_name ""Gm29825""; gene_source ""havana""; gene_biotype ""lincRNA""; havana_gene ""OTTMUSG00000062514""; havana_gene_version ""1""; transcript_name ""Gm29825-201""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152298""; havana_transcript_version ""1""; exon_id ""ENSMUSE00001401544""; exon_version ""1""; tag ""basic""; transcript_support_level ""5"";" .
9 114227850 114241851 cast_3_932|cast_1_1071|cast_2_988 1.2516483862692769 + 114227850 114227850 211,111,111 "gene_id ""ENSMUSG00000064299""; gene_version ""6""; transcript_id ""ENSMUST00000213446""; transcript_version ""1""; exon_number ""26""; gene_name ""4921528I07Rik""; gene_source ""ensembl_havana""; gene_biotype ""processed_transcript""; havana_gene ""OTTMUSG00000062515""; havana_gene_version ""1""; transcript_name ""4921528I07Rik-202""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152299""; havana_transcript_version ""1""; exon_id ""ENSMUSE00001400969""; exon_version ""1""; tag ""basic""; transcript_support_level ""1"";" .
9 114227850 114241851 cast_3_932|cast_1_1071|cast_2_988 1.2516483862692769 + 114227850 114227850 211,111,111 "gene_id ""ENSMUSG00000064299""; gene_version ""6""; transcript_id ""ENSMUST00000213446""; transcript_version ""1""; exon_number ""25""; gene_name ""4921528I07Rik""; gene_source ""ensembl_havana""; gene_biotype ""processed_transcript""; havana_gene ""OTTMUSG00000062515""; havana_gene_version ""1""; transcript_name ""4921528I07Rik-202""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152299""; havana_transcript_version ""1""; exon_id ""ENSMUSE00001404576""; exon_version ""1""; tag ""basic""; transcript_support_level ""1"";" .
我正在研究第十栏,看起来像这样:
"gene_id ""ENSMUSG00000111734""; gene_version ""1""; transcript_id ""ENSMUST00000214237""; transcript_version ""1""; gene_name ""Gm29825""; gene_source ""havana""; gene_biotype ""lincRNA""; havana_gene ""OTTMUSG00000062514""; havana_gene_version ""1""; transcript_name ""Gm29825-201""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152298""; havana_transcript_version ""1""; tag ""basic""; transcript_support_level ""5"";"
成对,格式为: identifier ""value""
虽然我可以通过将该列转换为另一个数据框并选择相关的行来提取值,但问题在于该列本身中的数据未正确排序。
在这种情况下,我只对gene_id
, gene_name
和gene_biotype
感兴趣,但将来可能会更改所需条款的规范。 我本可以使用基于字典的解决方案,但是每个行的值都不都是唯一的,并且在某些行中它们根本不存在(第10列带有.
行)。
最终,我希望数据框看起来像这样:
9 114188457 114192289 cast_3_930|cast_1_1069|cast_2_985 0.9510007336163186 - 114188457 114188457 211,111,111 ENSMUSG00000111734 Gm29825 lincRNA .
9 114227850 114241851 cast_3_932|cast_1_1071|cast_2_988 1.2516483862692769 + 114227850 114227850 211,111,111 ENSMUSG00000064299 4921528I07Rik processed_transcript .
9 114227850 114241851 cast_3_932|cast_1_1071|cast_2_988 1.2516483862692769 + 114227850 114227850 211,111,111 ENSMUSG00000064299 4921528I07Rik processed_transcript .
在大熊猫中,最有效的方法是什么?
您可以在列的.str
参数后面使用正则表达式
df['gene_id'] = df.iloc[:,9].str.extract('gene_id \"(\w+)\";')
df['gene_name'] = df.iloc[:,9].str.extract('gene_name \"(\w+)\";')
df['gene_biotype'] =df.iloc[:,9].str.extract('gene_biotype \"(\w+)\";')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.