简体   繁体   English

python - 如何选择文本文件中的特定字符串作为python数据框中的列?

[英]How to select specific strings in a text file to be columns in a dataframe in python?

I've got a text file with scores for genes, organised like this:我有一个包含基因分数的文本文件,组织如下:

[{"priorityType":"HIPHIVE_PRIORITY","geneId":367,"geneSymbol":"AR","score":1.0,"humanScore":1.0,"mouseScore":0.0,"fishScore":0.0,"ppiScore":0.0,"candidateGeneMatch":false,"queryPhenotypeTerms":[{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"}],"phenotypeEvidence":[{"score":1.0,"model":{"organism":"HUMAN","entrezGeneId":367,"humanGeneSymbol":"AR","diseaseId":"ORPHA:481","diseaseTerm":"Kennedy disease","phenotypeIds":["HP:0000029","HP:0000144","HP:0000771","HP:0001252","HP:0001260","HP:0001265","HP:0001288","HP:0001618","HP:0003119","HP:0003202","HP:0005978","HP:0100639"],"id":"ORPHA:481_367"},"bestModelPhenotypeMatches":[{"query":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"match":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"lcs":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"ic":4.595759223970171,"simj":1.0,"score":2.1437721949801873}]}],"ppiEvidence":[]},
{"priorityType":"HIPHIVE_PRIORITY","geneId":2200,"geneSymbol":"FBN1","score":1.0,"humanScore":1.0,"mouseScore":0.0,"fishScore":0.0,"ppiScore":0.0,"candidateGeneMatch":false,"queryPhenotypeTerms":[{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"}],"phenotypeEvidence":[{"score":1.0,"model":{"organism":"HUMAN","entrezGeneId":2200,"humanGeneSymbol":"FBN1","diseaseId":"ORPHA:2833","diseaseTerm":"Stiff skin syndrome","phenotypeIds":["HP:0000407","HP:0000486","HP:0000501","HP:0000541","HP:0000787","HP:0000822","HP:0001072","HP:0001324","HP:0001376","HP:0001482","HP:0003119","HP:0004322","HP:0005978","HP:0007328","HP:0008065","HP:0009830","HP:0011800","HP:0100578","HP:0100679"],"id":"ORPHA:2833_2200"},"bestModelPhenotypeMatches":[{"query":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"match":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"lcs":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"ic":4.595759223970171,"simj":1.0,"score":2.1437721949801873}]}],"ppiEvidence":[]},
{"priorityType":"HIPHIVE_PRIORITY","geneId":84823,"geneSymbol":"LMNB2","score":1.0,"humanScore":1.0,"mouseScore":0.0,"fishScore":0.0,"ppiScore":0.0,"candidateGeneMatch":false,"queryPhenotypeTerms":[{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"}],"phenotypeEvidence":[{"score":1.0,"model":{"organism":"HUMAN","entrezGeneId":84823,"humanGeneSymbol":"LMNB2","diseaseId":"OMIM:608709","diseaseTerm":"Lipodystrophy, partial, acquired, susceptibility to","phenotypeIds":["HP:0000006","HP:0000093","HP:0000100","HP:0000147","HP:0000790","HP:0000793","HP:0000819","HP:0001007","HP:0002719","HP:0003119","HP:0003621","HP:0003745","HP:0005421","HP:0009002","HP:0009019","HP:0009056"],"id":"OMIM:608709_84823"},"bestModelPhenotypeMatches":[{"query":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"match":{"id":"HP:0003119","label":"Abnormal concentration"},"lcs":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"ic":4.595759223970171,"simj":1.0,"score":2.1437721949801873}]}],"ppiEvidence":[]}

I want to pull out the geneSymbol, score, humanScore, mouseScore, fishScore, and ppiScore, to have a dataframe like this:我想提取geneSymbol、score、humanScore、mouseScore、fishScore和ppiScore,得到这样的数据框:

geneSymbol   score   humanScore   mouseScore   fishScore   ppiScore
   AR         1.0      1.0          0.0           0.0      0.0 
   FBN1       1.0      1.0          0.0           0.0      0.0 
   LMNB2      1.0      1.0          0.0           0.0      0.0 

I'm not experienced with python, and I've tried searching for similar questions to reapply other's code but I haven't found anything that I can get working.我没有使用 python 的经验,我尝试搜索类似的问题来重新应用其他代码,但我没有找到任何可以工作的东西。

My first problem is when I load my data into python it doesn't load each row as I see them in my text editor.我的第一个问题是,当我将数据加载到 python 中时,它不会像我在文本编辑器中看到的那样加载每一行。

For example I am just running:例如我正在运行:

df = pd.read_csv("data.txt", sep="\t", header = None, encoding = "ISO-8859-1")

which reads in to look like:它看起来像:

在此处输入图像描述

I've noticed most of my rows end in []} - so I also tried defining a lineterminator, but this doesn't work:我注意到我的大部分行都以 []} 结尾 - 所以我也尝试定义一个 lineterminator,但这不起作用:

df = pd.read_csv("data.txt", lineterminator='[]},', header = None, encoding = "ISO-8859-1")
ValueError: Only length-1 line terminators supported

I've also tried just separating rows by } but the error I then get is:我也试过用 } 分隔行,但我得到的错误是:

df = pd.read_csv("data.txt", lineterminator='}', header = None, encoding = "ISO-8859-1")

ParserError: Error tokenizing data. C error: Expected 11 fields in line 2, saw 20

How can I read in my data to then select the columns/create the dataframe I'm interested in?如何读取我的数据然后选择列/创建我感兴趣的数据框?

I'm not sure on the best way to share example data, I've tried with my 3 example rows above to put them in df.to_dict() which outputs it as :我不确定共享示例数据的最佳方式,我尝试使用上面的 3 个示例行将它们放在df.to_dict()中,将其输出为:

{0: {0: '{""priorityType"":""HIPHIVE_PRIORITY"",""geneId"":367,""geneSymbol"":""AR"",""score"":1.0,""humanScore"":1.0,""mouseScore"":0.0,""fishScore"":0.0,""ppiScore"":0.0,""candidateGeneMatch"":false,""queryPhenotypeTerms"":[{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""}],""phenotypeEvidence"":[{""score"":1.0,""model"":{""organism"":""HUMAN"",""entrezGeneId"":367,""humanGeneSymbol"":""AR"",""diseaseId"":""ORPHA:481"",""diseaseTerm"":""Kennedy disease"",""phenotypeIds"":[""HP:0000029"",""HP:0000144"",""HP:0000771"",""HP:0001252"",""HP:0001260"",""HP:0001265"",""HP:0001288"",""HP:0001618"",""HP:0003119"",""HP:0003202"",""HP:0005978"",""HP:0100639""],""id"":""ORPHA:481_367""},""bestModelPhenotypeMatches"":[{""query"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""match"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""lcs"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""ic"":4.595759223970171,""simj"":1.0,""score"":2.1437721949801873}]}],""ppiEvidence"":[]},{""priorityType"":""HIPHIVE_PRIORITY"",""geneId"":2200,""geneSymbol"":""FBN1"",""score"":1.0,""humanScore"":1.0,""mouseScore"":0.0,""fishScore"":0.0,""ppiScore"":0.0,""candidateGeneMatch"":false,""queryPhenotypeTerms"":[{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""}],""phenotypeEvidence"":[{""score"":1.0,""model"":{""organism"":""HUMAN"",""entrezGeneId"":2200,""humanGeneSymbol"":""FBN1"",""diseaseId"":""ORPHA:2833"",""diseaseTerm"":""Stiff skin syndrome"",""phenotypeIds"":[""HP:0000407"",""HP:0000486"",""HP:0000501"",""HP:0000541"",""HP:0000787"",""HP:0000822"",""HP:0001072"",""HP:0001324"",""HP:0001376"",""HP:0001482"",""HP:0003119"",""HP:0004322"",""HP:0005978"",""HP:0007328"",""HP:0008065"",""HP:0009830"",""HP:0011800"",""HP:0100578"",""HP:0100679""],""id"":""ORPHA:2833_2200""},""bestModelPhenotypeMatches"":[{""query"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""match"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""lcs"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""ic"":4.595759223970171,""simj"":1.0,""score"":2.1437721949801873}]}],""ppiEvidence"":[]},{""priorityType"":""HIPHIVE_PRIORITY"",""geneId"":84823,""geneSymbol"":""LMNB2"",""score"":1.0,""humanScore"":1.0,""mouseScore"":0.0,""fishScore"":0.0,""ppiScore"":0.0,""candidateGeneMatch"":false,""queryPhenotypeTerms"":[{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""}],""phenotypeEvidence"":[{""score"":1.0,""model"":{""organism"":""HUMAN"",""entrezGeneId"":84823,""humanGeneSymbol"":""LMNB2"",""diseaseId"":""OMIM:608709"",""diseaseTerm"":""Lipodystrophy, partial, acquired, susceptibility to"",""phenotypeIds"":[""HP:0000006"",""HP:0000093"",""HP:0000100"",""HP:0000147"",""HP:0000790"",""HP:0000793"",""HP:0000819"",""HP:0001007"",""HP:0002719"",""HP:0003119"",""HP:0003621"",""HP:0003745"",""HP:0005421"",""HP:0009002"",""HP:0009019"",""HP:0009056""],""id"":""OMIM:608709_84823""},""bestModelPhenotypeMatches"":[{""query"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""match"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""lcs"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""ic"":4.595759223970171,""simj"":1.0,""score"":2.1437721949801873}]}],""ppiEvidence"":[]'}}

You are looking for pd.DataFrame.from_records() .您正在寻找pd.DataFrame.from_records() Load the file as a list of dictionaries and then pass to this function.将文件加载为字典列表,然后传递给此函数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM