简体   繁体   中英

How to select specific strings in a text file to be columns in a dataframe in python?

I've got a text file with scores for genes, organised like this:

[{"priorityType":"HIPHIVE_PRIORITY","geneId":367,"geneSymbol":"AR","score":1.0,"humanScore":1.0,"mouseScore":0.0,"fishScore":0.0,"ppiScore":0.0,"candidateGeneMatch":false,"queryPhenotypeTerms":[{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"}],"phenotypeEvidence":[{"score":1.0,"model":{"organism":"HUMAN","entrezGeneId":367,"humanGeneSymbol":"AR","diseaseId":"ORPHA:481","diseaseTerm":"Kennedy disease","phenotypeIds":["HP:0000029","HP:0000144","HP:0000771","HP:0001252","HP:0001260","HP:0001265","HP:0001288","HP:0001618","HP:0003119","HP:0003202","HP:0005978","HP:0100639"],"id":"ORPHA:481_367"},"bestModelPhenotypeMatches":[{"query":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"match":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"lcs":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"ic":4.595759223970171,"simj":1.0,"score":2.1437721949801873}]}],"ppiEvidence":[]},
{"priorityType":"HIPHIVE_PRIORITY","geneId":2200,"geneSymbol":"FBN1","score":1.0,"humanScore":1.0,"mouseScore":0.0,"fishScore":0.0,"ppiScore":0.0,"candidateGeneMatch":false,"queryPhenotypeTerms":[{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"}],"phenotypeEvidence":[{"score":1.0,"model":{"organism":"HUMAN","entrezGeneId":2200,"humanGeneSymbol":"FBN1","diseaseId":"ORPHA:2833","diseaseTerm":"Stiff skin syndrome","phenotypeIds":["HP:0000407","HP:0000486","HP:0000501","HP:0000541","HP:0000787","HP:0000822","HP:0001072","HP:0001324","HP:0001376","HP:0001482","HP:0003119","HP:0004322","HP:0005978","HP:0007328","HP:0008065","HP:0009830","HP:0011800","HP:0100578","HP:0100679"],"id":"ORPHA:2833_2200"},"bestModelPhenotypeMatches":[{"query":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"match":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"lcs":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"ic":4.595759223970171,"simj":1.0,"score":2.1437721949801873}]}],"ppiEvidence":[]},
{"priorityType":"HIPHIVE_PRIORITY","geneId":84823,"geneSymbol":"LMNB2","score":1.0,"humanScore":1.0,"mouseScore":0.0,"fishScore":0.0,"ppiScore":0.0,"candidateGeneMatch":false,"queryPhenotypeTerms":[{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"}],"phenotypeEvidence":[{"score":1.0,"model":{"organism":"HUMAN","entrezGeneId":84823,"humanGeneSymbol":"LMNB2","diseaseId":"OMIM:608709","diseaseTerm":"Lipodystrophy, partial, acquired, susceptibility to","phenotypeIds":["HP:0000006","HP:0000093","HP:0000100","HP:0000147","HP:0000790","HP:0000793","HP:0000819","HP:0001007","HP:0002719","HP:0003119","HP:0003621","HP:0003745","HP:0005421","HP:0009002","HP:0009019","HP:0009056"],"id":"OMIM:608709_84823"},"bestModelPhenotypeMatches":[{"query":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"match":{"id":"HP:0003119","label":"Abnormal concentration"},"lcs":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"ic":4.595759223970171,"simj":1.0,"score":2.1437721949801873}]}],"ppiEvidence":[]}

I want to pull out the geneSymbol, score, humanScore, mouseScore, fishScore, and ppiScore, to have a dataframe like this:

geneSymbol   score   humanScore   mouseScore   fishScore   ppiScore
   AR         1.0      1.0          0.0           0.0      0.0 
   FBN1       1.0      1.0          0.0           0.0      0.0 
   LMNB2      1.0      1.0          0.0           0.0      0.0 

I'm not experienced with python, and I've tried searching for similar questions to reapply other's code but I haven't found anything that I can get working.

My first problem is when I load my data into python it doesn't load each row as I see them in my text editor.

For example I am just running:

df = pd.read_csv("data.txt", sep="\t", header = None, encoding = "ISO-8859-1")

which reads in to look like:

在此处输入图像描述

I've noticed most of my rows end in []} - so I also tried defining a lineterminator, but this doesn't work:

df = pd.read_csv("data.txt", lineterminator='[]},', header = None, encoding = "ISO-8859-1")
ValueError: Only length-1 line terminators supported

I've also tried just separating rows by } but the error I then get is:

df = pd.read_csv("data.txt", lineterminator='}', header = None, encoding = "ISO-8859-1")

ParserError: Error tokenizing data. C error: Expected 11 fields in line 2, saw 20

How can I read in my data to then select the columns/create the dataframe I'm interested in?

I'm not sure on the best way to share example data, I've tried with my 3 example rows above to put them in df.to_dict() which outputs it as :

{0: {0: '{""priorityType"":""HIPHIVE_PRIORITY"",""geneId"":367,""geneSymbol"":""AR"",""score"":1.0,""humanScore"":1.0,""mouseScore"":0.0,""fishScore"":0.0,""ppiScore"":0.0,""candidateGeneMatch"":false,""queryPhenotypeTerms"":[{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""}],""phenotypeEvidence"":[{""score"":1.0,""model"":{""organism"":""HUMAN"",""entrezGeneId"":367,""humanGeneSymbol"":""AR"",""diseaseId"":""ORPHA:481"",""diseaseTerm"":""Kennedy disease"",""phenotypeIds"":[""HP:0000029"",""HP:0000144"",""HP:0000771"",""HP:0001252"",""HP:0001260"",""HP:0001265"",""HP:0001288"",""HP:0001618"",""HP:0003119"",""HP:0003202"",""HP:0005978"",""HP:0100639""],""id"":""ORPHA:481_367""},""bestModelPhenotypeMatches"":[{""query"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""match"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""lcs"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""ic"":4.595759223970171,""simj"":1.0,""score"":2.1437721949801873}]}],""ppiEvidence"":[]},{""priorityType"":""HIPHIVE_PRIORITY"",""geneId"":2200,""geneSymbol"":""FBN1"",""score"":1.0,""humanScore"":1.0,""mouseScore"":0.0,""fishScore"":0.0,""ppiScore"":0.0,""candidateGeneMatch"":false,""queryPhenotypeTerms"":[{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""}],""phenotypeEvidence"":[{""score"":1.0,""model"":{""organism"":""HUMAN"",""entrezGeneId"":2200,""humanGeneSymbol"":""FBN1"",""diseaseId"":""ORPHA:2833"",""diseaseTerm"":""Stiff skin syndrome"",""phenotypeIds"":[""HP:0000407"",""HP:0000486"",""HP:0000501"",""HP:0000541"",""HP:0000787"",""HP:0000822"",""HP:0001072"",""HP:0001324"",""HP:0001376"",""HP:0001482"",""HP:0003119"",""HP:0004322"",""HP:0005978"",""HP:0007328"",""HP:0008065"",""HP:0009830"",""HP:0011800"",""HP:0100578"",""HP:0100679""],""id"":""ORPHA:2833_2200""},""bestModelPhenotypeMatches"":[{""query"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""match"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""lcs"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""ic"":4.595759223970171,""simj"":1.0,""score"":2.1437721949801873}]}],""ppiEvidence"":[]},{""priorityType"":""HIPHIVE_PRIORITY"",""geneId"":84823,""geneSymbol"":""LMNB2"",""score"":1.0,""humanScore"":1.0,""mouseScore"":0.0,""fishScore"":0.0,""ppiScore"":0.0,""candidateGeneMatch"":false,""queryPhenotypeTerms"":[{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""}],""phenotypeEvidence"":[{""score"":1.0,""model"":{""organism"":""HUMAN"",""entrezGeneId"":84823,""humanGeneSymbol"":""LMNB2"",""diseaseId"":""OMIM:608709"",""diseaseTerm"":""Lipodystrophy, partial, acquired, susceptibility to"",""phenotypeIds"":[""HP:0000006"",""HP:0000093"",""HP:0000100"",""HP:0000147"",""HP:0000790"",""HP:0000793"",""HP:0000819"",""HP:0001007"",""HP:0002719"",""HP:0003119"",""HP:0003621"",""HP:0003745"",""HP:0005421"",""HP:0009002"",""HP:0009019"",""HP:0009056""],""id"":""OMIM:608709_84823""},""bestModelPhenotypeMatches"":[{""query"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""match"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""lcs"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""ic"":4.595759223970171,""simj"":1.0,""score"":2.1437721949801873}]}],""ppiEvidence"":[]'}}

You are looking for pd.DataFrame.from_records() . Load the file as a list of dictionaries and then pass to this function.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM