简体   繁体   English

Python:转换文本文件多级JSON

[英]Python: Convert a text file multi-level JSON

I am writing a script in python to go recursively through each file, create a JSON object from the file that looks like this: 我正在用python编写脚本以递归地遍历每个文件,并从文件中创建一个JSON对象,如下所示:

target_id   length  eff_length  est_counts  tpm
ENST00000619216.1   68  33.8839 2.83333 4.64528
ENST00000473358.1   712 428.88  0   0
ENST00000469289.1   535 306.32  0   0
ENST00000607096.1   138 69.943  0   0
ENST00000417324.1   1187    844.464 0   0
ENST00000461467.1   590 342.551 3.44007 0.557892
ENST00000335137.3   918 588.421 0   0
ENST00000466430.5   2748    2405.46 75.1098 1.73463
ENST00000495576.1   1319    976.464 11.1999 0.637186

This is my script: 这是我的脚本:

import glob
import os
import json

# define datasets
# Dataset name
datasets = ['pnoc']

# open file in append mode
f = open('mydict','a')

# define a new object
data={}

# traverse through folders of datasets
for d in datasets:
    samples = glob.glob(d + "/data"  + "/*.tsv")
    for s in samples:
        # get the SampleName without extension and path
        fname = os.path.splitext(os.path.basename(s))[0]

        # split the basename to get sample name and norm method
        sname, keyword, norm = fname.partition('.')

        # determing Normalization method based on filename
        if norm == "abundance":
            norm = "kallisto"
        elif norm == "rsem_genes.results":
            norm = "rsem_genes"
        else:
            norm = "rsem_isoforms"

        # read each file
        with open(s) as samp:
            next(samp)
            for line in samp:
                sp = line.split('\t')
                data.setdefault(sname,[]).append({"ID": sp[0],"Expression": sp[4]})
                json.dump(data, f)
f.close()

I want a JSON object on the following lines: 我需要以下行上的JSON对象:

# 20000 Sample names, 3 Normalization methods and 60000 IDs in each file.
DatasetName1 {
    SampleName1 {
        Type {
            Normalization1 {
                { ID1: value, Expression: value },
                { ID2: value, Expression: value },
                ...
                { ID60000: value, Expression: value }
            },
            Normalization2 {
                { ID1: value, Expression: value },
                { ID2: value, Expression: value },
                ...
                { ID60000: value, Expression: value }
            },
            Normalization3 {
                { ID1: value, Expression: value },
                { ID2: value, Expression: value },
                ...
                { ID60000: value, Expression: value }
            }
        }   
    },
    SampleName2 {
        Type {
            Normalization1 {
                { ID1: value, Expression: value },
                { ID2: value, Expression: value },
                ...
                { ID60000: value, Expression: value }
            },
            Normalization2 {
                { ID1: value, Expression: value },
                { ID2: value, Expression: value },
                ...
                { ID60000: value, Expression: value }
            },
            Normalization3 {
                { ID1: value, Expression: value },
                { ID2: value, Expression: value },
                ...
                { ID60000: value, Expression: value }
            }
        }   
    },
    ...
    SampleName20000{
        Type {
            Normalization1 {
                { ID1: value, Expression: value },
                { ID2: value, Expression: value },
                ...
                { ID60000: value, Expression: value }
            },
            Normalization2 {
                { ID1: value, Expression: value },
                { ID2: value, Expression: value },
                ...
                { ID60000: value, Expression: value }
            },
            Normalization3 {
                { ID1: value, Expression: value },
                { ID2: value, Expression: value },
                ...
                { ID60000: value, Expression: value }
            }
        }
    }
}

So my question is - When converting a text file to JSON, how do I set the levels in my JSON output? 所以我的问题是-将文本文件转换为JSON时,如何在JSON输出中设置级别?

Thanks! 谢谢!

First, instead of setting the default value over and over, you should make use of defaultdict . 首先,不要一遍又一遍地设置默认值,而应该使用defaultdict

Secondly, I think your proposed structure is off, and you should be using some arrays within (JSON-like structure): 其次,我认为您建议的结构已关闭,您应该在其中使用一些数组(类似JSON的结构):

{
    DatasetName1: {
        SampleName1: {
            Type: {
                Normalization1: [
                    { ID1: value, Expression: value },
                    { ID2: value, Expression: value },
                    ...
                    { ID60000: value, Expression: value }
                ],
                Normalization2: [
                    { ID1: value, Expression: value },
                    { ID2: value, Expression: value },
                    ...
                    { ID60000: value, Expression: value }
                ],
                Normalization3: [
                    { ID1: value, Expression: value },
                    { ID2: value, Expression: value },
                    ...
                    { ID60000: value, Expression: value }
                ]
            }
        },
        SampleName2: {
            Type: {
                Normalization1: [
                    { ID1: value, Expression: value },
                    { ID2: value, Expression: value },
                    ...
                    { ID60000: value, Expression: value }
                ],
                Normalization2: [
                    { ID1: value, Expression: value },
                    { ID2: value, Expression: value },
                    ...
                    { ID60000: value, Expression: value }
                ],
                Normalization3: [
                    { ID1: value, Expression: value },
                    { ID2: value, Expression: value },
                    ...
                    { ID60000: value, Expression: value }
                ]
            }
        },
        ...
        SampleName20000: {
            Type: {
                Normalization1: [
                    { ID1: value, Expression: value },
                    { ID2: value, Expression: value },
                    ...
                    { ID60000: value, Expression: value }
                ],
                Normalization2: [
                    { ID1: value, Expression: value },
                    { ID2: value, Expression: value },
                    ...
                    { ID60000: value, Expression: value }
                ],
                Normalization3: [
                    { ID1: value, Expression: value },
                    { ID2: value, Expression: value },
                    ...
                    { ID60000: value, Expression: value }
                ]
            }
        }
    }, {
    DatasetName2: {
        ...
    }, ...
}

So your resulting code (untested) should look like this (long as your norm method logic is correct): 因此,您得到的代码(未经测试)应如下所示(只要您的规范方法逻辑正确):

from glob import glob
from os import path
from json import dump
from collections import defaultdict

# define datasets, and result dict
datasets, results = ['pnoc'], defaultdict(dict)

# open file in append mode
with open('mydict','a') as f:
    # traverse through folders of datasets
    for d in datasets:
        for s in glob(d + "/data"  + "/*.tsv"):
            sample = {"Type": defaultdict(list)}

            # get the basename without extension and path
            fname = path.splitext(path.basename(s))[0]

            # split the basename to get sample name and norm method
            sname, keyword, norm = fname.partition('.')

            # determing norm method based on filename
            if norm == "abundance":
                norm = "kallisto"
            elif norm == "rsem_genes.results":
                norm = "rsem_genes"
            else:
                norm = "rsem_isoforms"

            # read each file
            with open(s) as samp:
                next(samp)              # Skip first line of file

                # Loop through each line and extract the ID and TPM
                for (id, _, __, ___, tpm) in (line.split('\t') for line in samp):
                    # Add this line to the list for respective normalization method
                    sample['Type'][norm].append({"ID": id, "Expression": float(tpm)})
            # Add sample to dataset
            results[d][sname] = sample
    dump(results, f)

This will save the result in a JSON format. 这会将结果保存为JSON格式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM