python - 如何选择文本文件中的特定字符串作为python数据框中的列？

Question

I've got a text file with scores for genes, organised like this:我有一个包含基因分数的文本文件，组织如下：

[{"priorityType":"HIPHIVE_PRIORITY","geneId":367,"geneSymbol":"AR","score":1.0,"humanScore":1.0,"mouseScore":0.0,"fishScore":0.0,"ppiScore":0.0,"candidateGeneMatch":false,"queryPhenotypeTerms":[{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"}],"phenotypeEvidence":[{"score":1.0,"model":{"organism":"HUMAN","entrezGeneId":367,"humanGeneSymbol":"AR","diseaseId":"ORPHA:481","diseaseTerm":"Kennedy disease","phenotypeIds":["HP:0000029","HP:0000144","HP:0000771","HP:0001252","HP:0001260","HP:0001265","HP:0001288","HP:0001618","HP:0003119","HP:0003202","HP:0005978","HP:0100639"],"id":"ORPHA:481_367"},"bestModelPhenotypeMatches":[{"query":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"match":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"lcs":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"ic":4.595759223970171,"simj":1.0,"score":2.1437721949801873}]}],"ppiEvidence":[]},
{"priorityType":"HIPHIVE_PRIORITY","geneId":2200,"geneSymbol":"FBN1","score":1.0,"humanScore":1.0,"mouseScore":0.0,"fishScore":0.0,"ppiScore":0.0,"candidateGeneMatch":false,"queryPhenotypeTerms":[{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"}],"phenotypeEvidence":[{"score":1.0,"model":{"organism":"HUMAN","entrezGeneId":2200,"humanGeneSymbol":"FBN1","diseaseId":"ORPHA:2833","diseaseTerm":"Stiff skin syndrome","phenotypeIds":["HP:0000407","HP:0000486","HP:0000501","HP:0000541","HP:0000787","HP:0000822","HP:0001072","HP:0001324","HP:0001376","HP:0001482","HP:0003119","HP:0004322","HP:0005978","HP:0007328","HP:0008065","HP:0009830","HP:0011800","HP:0100578","HP:0100679"],"id":"ORPHA:2833_2200"},"bestModelPhenotypeMatches":[{"query":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"match":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"lcs":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"ic":4.595759223970171,"simj":1.0,"score":2.1437721949801873}]}],"ppiEvidence":[]},
{"priorityType":"HIPHIVE_PRIORITY","geneId":84823,"geneSymbol":"LMNB2","score":1.0,"humanScore":1.0,"mouseScore":0.0,"fishScore":0.0,"ppiScore":0.0,"candidateGeneMatch":false,"queryPhenotypeTerms":[{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"}],"phenotypeEvidence":[{"score":1.0,"model":{"organism":"HUMAN","entrezGeneId":84823,"humanGeneSymbol":"LMNB2","diseaseId":"OMIM:608709","diseaseTerm":"Lipodystrophy, partial, acquired, susceptibility to","phenotypeIds":["HP:0000006","HP:0000093","HP:0000100","HP:0000147","HP:0000790","HP:0000793","HP:0000819","HP:0001007","HP:0002719","HP:0003119","HP:0003621","HP:0003745","HP:0005421","HP:0009002","HP:0009019","HP:0009056"],"id":"OMIM:608709_84823"},"bestModelPhenotypeMatches":[{"query":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"match":{"id":"HP:0003119","label":"Abnormal concentration"},"lcs":{"id":"HP:0003119","label":"Abnormal circulating lipid concentration"},"ic":4.595759223970171,"simj":1.0,"score":2.1437721949801873}]}],"ppiEvidence":[]}

I want to pull out the geneSymbol, score, humanScore, mouseScore, fishScore, and ppiScore, to have a dataframe like this:我想提取geneSymbol、score、humanScore、mouseScore、fishScore和ppiScore，得到这样的数据框：

geneSymbol   score   humanScore   mouseScore   fishScore   ppiScore
   AR         1.0      1.0          0.0           0.0      0.0 
   FBN1       1.0      1.0          0.0           0.0      0.0 
   LMNB2      1.0      1.0          0.0           0.0      0.0

I'm not experienced with python, and I've tried searching for similar questions to reapply other's code but I haven't found anything that I can get working.我没有使用 python 的经验，我尝试搜索类似的问题来重新应用其他代码，但我没有找到任何可以工作的东西。

My first problem is when I load my data into python it doesn't load each row as I see them in my text editor.我的第一个问题是，当我将数据加载到 python 中时，它不会像我在文本编辑器中看到的那样加载每一行。

For example I am just running:例如我正在运行：

df = pd.read_csv("data.txt", sep="\t", header = None, encoding = "ISO-8859-1")

which reads in to look like:它看起来像：

I've noticed most of my rows end in []} - so I also tried defining a lineterminator, but this doesn't work:我注意到我的大部分行都以 []} 结尾 - 所以我也尝试定义一个 lineterminator，但这不起作用：

df = pd.read_csv("data.txt", lineterminator='[]},', header = None, encoding = "ISO-8859-1")
ValueError: Only length-1 line terminators supported

I've also tried just separating rows by } but the error I then get is:我也试过用 } 分隔行，但我得到的错误是：

df = pd.read_csv("data.txt", lineterminator='}', header = None, encoding = "ISO-8859-1")

ParserError: Error tokenizing data. C error: Expected 11 fields in line 2, saw 20

How can I read in my data to then select the columns/create the dataframe I'm interested in?如何读取我的数据然后选择列/创建我感兴趣的数据框？

I'm not sure on the best way to share example data, I've tried with my 3 example rows above to put them in df.to_dict() which outputs it as :我不确定共享示例数据的最佳方式，我尝试使用上面的 3 个示例行将它们放在df.to_dict()中，将其输出为：

{0: {0: '{""priorityType"":""HIPHIVE_PRIORITY"",""geneId"":367,""geneSymbol"":""AR"",""score"":1.0,""humanScore"":1.0,""mouseScore"":0.0,""fishScore"":0.0,""ppiScore"":0.0,""candidateGeneMatch"":false,""queryPhenotypeTerms"":[{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""}],""phenotypeEvidence"":[{""score"":1.0,""model"":{""organism"":""HUMAN"",""entrezGeneId"":367,""humanGeneSymbol"":""AR"",""diseaseId"":""ORPHA:481"",""diseaseTerm"":""Kennedy disease"",""phenotypeIds"":[""HP:0000029"",""HP:0000144"",""HP:0000771"",""HP:0001252"",""HP:0001260"",""HP:0001265"",""HP:0001288"",""HP:0001618"",""HP:0003119"",""HP:0003202"",""HP:0005978"",""HP:0100639""],""id"":""ORPHA:481_367""},""bestModelPhenotypeMatches"":[{""query"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""match"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""lcs"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""ic"":4.595759223970171,""simj"":1.0,""score"":2.1437721949801873}]}],""ppiEvidence"":[]},{""priorityType"":""HIPHIVE_PRIORITY"",""geneId"":2200,""geneSymbol"":""FBN1"",""score"":1.0,""humanScore"":1.0,""mouseScore"":0.0,""fishScore"":0.0,""ppiScore"":0.0,""candidateGeneMatch"":false,""queryPhenotypeTerms"":[{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""}],""phenotypeEvidence"":[{""score"":1.0,""model"":{""organism"":""HUMAN"",""entrezGeneId"":2200,""humanGeneSymbol"":""FBN1"",""diseaseId"":""ORPHA:2833"",""diseaseTerm"":""Stiff skin syndrome"",""phenotypeIds"":[""HP:0000407"",""HP:0000486"",""HP:0000501"",""HP:0000541"",""HP:0000787"",""HP:0000822"",""HP:0001072"",""HP:0001324"",""HP:0001376"",""HP:0001482"",""HP:0003119"",""HP:0004322"",""HP:0005978"",""HP:0007328"",""HP:0008065"",""HP:0009830"",""HP:0011800"",""HP:0100578"",""HP:0100679""],""id"":""ORPHA:2833_2200""},""bestModelPhenotypeMatches"":[{""query"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""match"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""lcs"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""ic"":4.595759223970171,""simj"":1.0,""score"":2.1437721949801873}]}],""ppiEvidence"":[]},{""priorityType"":""HIPHIVE_PRIORITY"",""geneId"":84823,""geneSymbol"":""LMNB2"",""score"":1.0,""humanScore"":1.0,""mouseScore"":0.0,""fishScore"":0.0,""ppiScore"":0.0,""candidateGeneMatch"":false,""queryPhenotypeTerms"":[{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""}],""phenotypeEvidence"":[{""score"":1.0,""model"":{""organism"":""HUMAN"",""entrezGeneId"":84823,""humanGeneSymbol"":""LMNB2"",""diseaseId"":""OMIM:608709"",""diseaseTerm"":""Lipodystrophy, partial, acquired, susceptibility to"",""phenotypeIds"":[""HP:0000006"",""HP:0000093"",""HP:0000100"",""HP:0000147"",""HP:0000790"",""HP:0000793"",""HP:0000819"",""HP:0001007"",""HP:0002719"",""HP:0003119"",""HP:0003621"",""HP:0003745"",""HP:0005421"",""HP:0009002"",""HP:0009019"",""HP:0009056""],""id"":""OMIM:608709_84823""},""bestModelPhenotypeMatches"":[{""query"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""match"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""lcs"":{""id"":""HP:0003119"",""label"":""Abnormal circulating lipid concentration""},""ic"":4.595759223970171,""simj"":1.0,""score"":2.1437721949801873}]}],""ppiEvidence"":[]'}}

Answer 1

You are looking for pd.DataFrame.from_records() .您正在寻找pd.DataFrame.from_records() 。 Load the file as a list of dictionaries and then pass to this function.将文件加载为字典列表，然后传递给此函数。

python - 如何选择文本文件中的特定字符串作为python数据框中的列？

问题描述

1 个解决方案

解决方案1
0 2022-05-19 19:06:33

python - 如何选择文本文件中的特定字符串作为python数据框中的列？

问题描述

1 个解决方案

解决方案1 0 2022-05-19 19:06:33

解决方案1
0 2022-05-19 19:06:33