簡體   English   中英

python - 如何選擇文本文件中的特定字符串作為python數據框中的列?

[英]How to select specific strings in a text file to be columns in a dataframe in python?

我有一個包含基因分數的文本文件,組織如下:

[{"priorityType":"HIPHIVE_PRIORITY","geneId":367,"geneSymbol":"AR","score":1.0,"humanScore":1.0,"mouseScore":0.0,"fishScore":0.0,"ppiScore":0.0,"candidateGeneMatch":false,"queryPhenotypeTerms":[{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"}],"phenotypeEvidence":[{"score":1.0,"model":{"organism":"HUMAN","entrezGeneId":367,"humanGeneSymbol":"AR","diseaseId":"ORPHA:481","diseaseTerm":"Kennedy disease","phenotypeIds":["HP:0000029","HP:0000144","HP:0000771","HP:0001252","HP:0001260","HP:0001265","HP:0001288","HP:0001618","HP:0003119","HP:0003202","HP:0005978","HP:0100639"],"id":"ORPHA:481_367"},"bestModelPhenotypeMatches":[{"query":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"match":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"lcs":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"ic":4.595759223970171,"simj":1.0,"score":2.1437721949801873}]}],"ppiEvidence":[]},
{"priorityType":"HIPHIVE_PRIORITY","geneId":2200,"geneSymbol":"FBN1","score":1.0,"humanScore":1.0,"mouseScore":0.0,"fishScore":0.0,"ppiScore":0.0,"candidateGeneMatch":false,"queryPhenotypeTerms":[{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"}],"phenotypeEvidence":[{"score":1.0,"model":{"organism":"HUMAN","entrezGeneId":2200,"humanGeneSymbol":"FBN1","diseaseId":"ORPHA:2833","diseaseTerm":"Stiff skin syndrome","phenotypeIds":["HP:0000407","HP:0000486","HP:0000501","HP:0000541","HP:0000787","HP:0000822","HP:0001072","HP:0001324","HP:0001376","HP:0001482","HP:0003119","HP:0004322","HP:0005978","HP:0007328","HP:0008065","HP:0009830","HP:0011800","HP:0100578","HP:0100679"],"id":"ORPHA:2833_2200"},"bestModelPhenotypeMatches":[{"query":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"match":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"lcs":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"ic":4.595759223970171,"simj":1.0,"score":2.1437721949801873}]}],"ppiEvidence":[]},
{"priorityType":"HIPHIVE_PRIORITY","geneId":84823,"geneSymbol":"LMNB2","score":1.0,"humanScore":1.0,"mouseScore":0.0,"fishScore":0.0,"ppiScore":0.0,"candidateGeneMatch":false,"queryPhenotypeTerms":[{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"}],"phenotypeEvidence":[{"score":1.0,"model":{"organism":"HUMAN","entrezGeneId":84823,"humanGeneSymbol":"LMNB2","diseaseId":"OMIM:608709","diseaseTerm":"Lipodystrophy, partial, acquired, susceptibility to","phenotypeIds":["HP:0000006","HP:0000093","HP:0000100","HP:0000147","HP:0000790","HP:0000793","HP:0000819","HP:0001007","HP:0002719","HP:0003119","HP:0003621","HP:0003745","HP:0005421","HP:0009002","HP:0009019","HP:0009056"],"id":"OMIM:608709_84823"},"bestModelPhenotypeMatches":[{"query":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"match":{"id":"HP:0003119","label":"Abnormal concentration"},"lcs":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"ic":4.595759223970171,"simj":1.0,"score":2.1437721949801873}]}],"ppiEvidence":[]}

我想提取geneSymbol、score、humanScore、mouseScore、fishScore和ppiScore,得到這樣的數據框:

geneSymbol   score   humanScore   mouseScore   fishScore   ppiScore
   AR         1.0      1.0          0.0           0.0      0.0 
   FBN1       1.0      1.0          0.0           0.0      0.0 
   LMNB2      1.0      1.0          0.0           0.0      0.0 

我沒有使用 python 的經驗,我嘗試搜索類似的問題來重新應用其他代碼,但我沒有找到任何可以工作的東西。

我的第一個問題是,當我將數據加載到 python 中時,它不會像我在文本編輯器中看到的那樣加載每一行。

例如我正在運行:

df = pd.read_csv("data.txt", sep="\t", header = None, encoding = "ISO-8859-1")

它看起來像:

在此處輸入圖像描述

我注意到我的大部分行都以 []} 結尾 - 所以我也嘗試定義一個 lineterminator,但這不起作用:

df = pd.read_csv("data.txt", lineterminator='[]},', header = None, encoding = "ISO-8859-1")
ValueError: Only length-1 line terminators supported

我也試過用 } 分隔行,但我得到的錯誤是:

df = pd.read_csv("data.txt", lineterminator='}', header = None, encoding = "ISO-8859-1")

ParserError: Error tokenizing data. C error: Expected 11 fields in line 2, saw 20

如何讀取我的數據然后選擇列/創建我感興趣的數據框?

我不確定共享示例數據的最佳方式,我嘗試使用上面的 3 個示例行將它們放在df.to_dict()中,將其輸出為:

{0: {0: '{""priorityType"":""HIPHIVE_PRIORITY"",""geneId"":367,""geneSymbol"":""AR"",""score"":1.0,""humanScore"":1.0,""mouseScore"":0.0,""fishScore"":0.0,""ppiScore"":0.0,""candidateGeneMatch"":false,""queryPhenotypeTerms"":[{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""}],""phenotypeEvidence"":[{""score"":1.0,""model"":{""organism"":""HUMAN"",""entrezGeneId"":367,""humanGeneSymbol"":""AR"",""diseaseId"":""ORPHA:481"",""diseaseTerm"":""Kennedy disease"",""phenotypeIds"":[""HP:0000029"",""HP:0000144"",""HP:0000771"",""HP:0001252"",""HP:0001260"",""HP:0001265"",""HP:0001288"",""HP:0001618"",""HP:0003119"",""HP:0003202"",""HP:0005978"",""HP:0100639""],""id"":""ORPHA:481_367""},""bestModelPhenotypeMatches"":[{""query"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""match"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""lcs"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""ic"":4.595759223970171,""simj"":1.0,""score"":2.1437721949801873}]}],""ppiEvidence"":[]},{""priorityType"":""HIPHIVE_PRIORITY"",""geneId"":2200,""geneSymbol"":""FBN1"",""score"":1.0,""humanScore"":1.0,""mouseScore"":0.0,""fishScore"":0.0,""ppiScore"":0.0,""candidateGeneMatch"":false,""queryPhenotypeTerms"":[{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""}],""phenotypeEvidence"":[{""score"":1.0,""model"":{""organism"":""HUMAN"",""entrezGeneId"":2200,""humanGeneSymbol"":""FBN1"",""diseaseId"":""ORPHA:2833"",""diseaseTerm"":""Stiff skin syndrome"",""phenotypeIds"":[""HP:0000407"",""HP:0000486"",""HP:0000501"",""HP:0000541"",""HP:0000787"",""HP:0000822"",""HP:0001072"",""HP:0001324"",""HP:0001376"",""HP:0001482"",""HP:0003119"",""HP:0004322"",""HP:0005978"",""HP:0007328"",""HP:0008065"",""HP:0009830"",""HP:0011800"",""HP:0100578"",""HP:0100679""],""id"":""ORPHA:2833_2200""},""bestModelPhenotypeMatches"":[{""query"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""match"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""lcs"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""ic"":4.595759223970171,""simj"":1.0,""score"":2.1437721949801873}]}],""ppiEvidence"":[]},{""priorityType"":""HIPHIVE_PRIORITY"",""geneId"":84823,""geneSymbol"":""LMNB2"",""score"":1.0,""humanScore"":1.0,""mouseScore"":0.0,""fishScore"":0.0,""ppiScore"":0.0,""candidateGeneMatch"":false,""queryPhenotypeTerms"":[{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""}],""phenotypeEvidence"":[{""score"":1.0,""model"":{""organism"":""HUMAN"",""entrezGeneId"":84823,""humanGeneSymbol"":""LMNB2"",""diseaseId"":""OMIM:608709"",""diseaseTerm"":""Lipodystrophy, partial, acquired, susceptibility to"",""phenotypeIds"":[""HP:0000006"",""HP:0000093"",""HP:0000100"",""HP:0000147"",""HP:0000790"",""HP:0000793"",""HP:0000819"",""HP:0001007"",""HP:0002719"",""HP:0003119"",""HP:0003621"",""HP:0003745"",""HP:0005421"",""HP:0009002"",""HP:0009019"",""HP:0009056""],""id"":""OMIM:608709_84823""},""bestModelPhenotypeMatches"":[{""query"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""match"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""lcs"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""ic"":4.595759223970171,""simj"":1.0,""score"":2.1437721949801873}]}],""ppiEvidence"":[]'}}

您正在尋找pd.DataFrame.from_records() 將文件加載為字典列表,然后傳遞給此函數。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM