简体   繁体   English

通过CSV文件进行解析以转换为JSON格式的文件

[英]Parsing through CSV file to convert to JSON format file

I am given the following CSV file which I extracted from an excel spreadsheet. 我从Excel电子表格中提取了以下CSV文件。 Just to give some background information which could be of assistance, it discusses AGI Numbers (think of it as protein identifiers), unmodified peptide sequences for those protein identifiers, and then modified peptide sequences with modifications made on the unmodified sequences, the index/indeces of those modifications, and then the combined spectral count for repeated peptides. 仅提供一些可能有用的背景信息,它讨论了AGI编号(将其视为蛋白质标识符),这些蛋白质标识符的未修饰肽序列,然后讨论了对未修饰序列进行修饰的修饰肽序列,索引/索引这些修饰,然后对重复的肽段进行组合光谱计数。 The text file is called MASP.GlycoModReader.txt and the information is in the following format below: 文本文件名为MASP.GlycoModReader.txt,信息的格式如下:

AGI,UnMd Peptide (M) = x,Mod Peptide (oM) = Ox,Index/Indeces of Modification,counts,Combined 
Spectral count for repeated Peptides

AT1G56070.1,NMSVIAHVDHGKSTLTDSLVAAAGIIAQEVAGDVR,NoMSVIAHVDHGKSTLTDSLVAAAGIIAQEVAGDVR,2,17
AT1G56070.1,LYMEARPMEEGLAEAIDDGR,LYoMEARPoMEEGLAEAIDDGR,"3, 9",1
AT1G56070.1,EAMTPLSEFEDKL,EAoMTPLSEFEDKL,3,7
AT1G56070.1,LYMEARPMEEGLAEAIDDGR,LYoMEARPoMEEGLAEAIDDGR,"3, 9",2
AT1G56070.1,EGPLAEENMR,EGPLAEENoMR,9,2
AT1G56070.1,DLQDDFMGGAEIIK,DLQDDFoMGGAEIIK,7,1

The output file that needs to result after extracting the above is in the following format below: 提取以上内容后需要生成的输出文件的格式如下:

AT1G56070.1,{"peptides": [{"sequence": "NMSVIAHVDHGKSTLTDSLVAAAGIIAQEVAGDVR", "mod_sequence":    
"NoMSVIAHVDHGKSTLTDSLVAAAGIIAQEVAGDVR" , "mod_indeces": 2, "spectral_count": 17}, {"sequence": 
"LYMEARPMEEGLAEAIDDGR" , "mod_sequence": "LYoMEARPoMEEGLAEAIDDGR", "mod_indeces": [3, 9], 
"spectral_count": 3}, {"sequence": "EAMTPLSEFEDKL" , "mod_sequence": "EAoMTPLSEFEDKL", 
"mod_indeces": [3,9], "spectral_count": 7}, {"sequence": "EGPLAEENMR", "mod_sequence": 
"EGPLAEENoMR", "mod_indeces": 9, "spectral_count": 2}, {"sequence": "DLQDDFMGGAEIIK", 
"mod_sequence": "DLQDDFoMGGAEIIK", "mod_indeces": [7], "spectral_count": 1}]}

I have provided my solution below: If anyone has a better solution in another language or can possibly analyze mine and let me know if there are more efficient methods of coming about this, then please comment below. 我在下面提供了我的解决方案:如果有人用另一种语言有更好的解决方案,或者可以分析我的解决方案,并告诉我是否有解决此问题的更有效方法,请在下面评论。 Thank you. 谢谢。

    #!/usr/bin/env node

    var fs = require('fs');
    var csv = require('csv');
    var data ="proteins.csv";

    /* Uses csv nodejs module to parse the proteins.csv file.
    * Parses the csv file row by row and updates the peptide_arr.
    * For new entries creates a peptide object, for similar entries it updates the
    * counts in the peptide object with the same AGI#.
    * Uses a peptide object to store protein ID AGI#, and the associated data.
    * Writes all formatted peptide objects to a txt file - output.txt.
    */

    // Tracks current row
    var x = 0;
    // An array of peptide objects stores the information from the csv file
    var peptide_arr = [];

    // csv module reads row by row from data 
    csv()
    .from(data)
    .to('debug.csv')
    .transform(function(row, index) {
        // For the first entry push a new peptide object with the AGI# (row[0]) 
        if(x == 0) {
        // cur is the current peptide read into row by csv module
        Peptide cur = new Peptide( row[0] );

        // Add the assoicated data from row (1-5) to cur
        cur.data.peptides.push({
            "sequence" : row[1];
            "mod_sequence" : row[2];
            if(row[5]){
            "mod_indeces" : "[" + row[3] + ", " + row[4] + "]";
            "spectral_count" : row[5];  
            } else {
            "mod_indeces" : row[3];
            "spectral_count" : row[4];  
            }
        });

        // Add the current peptide to the array
        peptide_arr.push(cur);
        }

        // Move to the next row
        x++;
    });

    // Loop through peptide_arr and append output with each peptide's AGI# and its data
    String output = "";
    for(var peptide in peptide_arr) 
    {
        output = output + peptide.toString()
    }
    // Write the output to output.txt
    fs.writeFile("output.txt", output);

    /* Peptide Object :
     *  - id:AGI#
     *  - data: JSON Array associated
     */
    function Peptide(id) // this is the actual function that does the ID retrieving and data 
                        // storage
{
    this.id = id;
    this.data = {
        peptides: []
    };
}

/* Peptide methods :
 *  - toJson : Returns the properly formatted string
 */
Peptide.prototype = {
    toString: function(){
        return this.id + "," + JSON.stringify(this.data, null, " ") + "/n"
    }
};

Edited note: It seems when I run this solution I posted, I am getting a memory leak error; 编辑说明:似乎在运行发布的此解决方案时,出现内存泄漏错误; it is infinitely running while not producing any substantial, readable output. 它无限运行,而不会产生任何实质性的可读输出。 If anyone could be willing to assist in assessing why this is occurring, that would be great. 如果有人愿意协助评估这种情况的发生,那将是很好的。

Does your version work? 您的版本可以使用吗? It looks like you only ever create one Peptide object. 看来您只能创建一个肽对象。 Also, what is the "if(row[5])" statement doing? 另外,“ if(row [5])”语句在做什么? In your example data there are always 5 elements. 在示例数据中,总是有5个元素。 Also, mod_indeces is always supposed to be a list, correct? 另外,mod_indeces总是应该是列表,对吗? Because in your example output file mod_indeces isn't a list in the first peptide. 因为在您的示例输出文件中,mod_indeces不是第一个肽段中的列表。 Anyway, here is what I came up with in python: 无论如何,这是我在python中想到的:

import csv
import json
data = {}
with open('proteins.csv','rb') as f:
    reader = csv.reader(f)
    for row in reader:
        name = row[0]
        sequence = row[1]
        mod_sequence = row[2]
        mod_indeces = map(int,row[3].split(', '))
        spectral_count = int(row[4])
        peptide = {'sequence':sequence,'mod_sequence':mod_sequence,
                   'mod_indeces':mod_indeces,'spectral_count':spectral_count}
        if name in data:
            data[name]['peptides'].append(peptide)
        else:
            data[name] = {'peptides':[peptide]}
    f.close()

f = open('output.txt','wb')
for protein in data:
    f.write(protein)
    f.write(',')
    f.write(json.dumps(data[protein]))
    f.write('\n')
f.close()

If you are on windows and want to view the file as plain text, you may want to replace '\\n' with '\\r\\n' or os.linesep. 如果您在Windows上并希望以纯文本格式查看文件,则可能需要将'\\ n'替换为'\\ r \\ n'或os.linesep。

If you want to skip some rows (if there is a header or something), you can do something like this: 如果要跳过某些行(如果有标题或其他内容),则可以执行以下操作:

import csv
import json
data = {}
rows_to_skip = 1
rows_read = 0
with open('proteins.csv','rb') as f:
    reader = csv.reader(f)
    for row in reader:
        if rows_read >= rows_to_skip:
            name = row[0]
            sequence = row[1]
            mod_sequence = row[2]
            mod_indeces = map(int,row[3].split(', '))
            spectral_count = int(row[4])
            peptide = {'sequence':sequence,'mod_sequence':mod_sequence,
                       'mod_indeces':mod_indeces,'spectral_count':spectral_count}
            if name in data:
                data[name]['peptides'].append(peptide)
            else:
                data[name] = {'peptides':[peptide]}
        rows_read += 1
    f.close()

f = open('output.txt','wb')
for protein in data:
    f.write(protein)
    f.write(',')
    f.write(json.dumps(data[protein]))
    f.write('\n')
f.close()

If you want the keys of the dictionary to be in a particular order, you can use an orderedDict instead of the default dict. 如果希望字典的键按特定顺序排列,则可以使用orderedDict而不是默认字典。 Just replace the peptide line with the following: 只需将肽线替换为以下内容:

peptide = OrderedDict([('sequence',sequence),
                       ('mod_sequence',mod_sequence),
                       ('mod_indeces',mod_indeces),
                       ('spectral_count',spectral_count)])

Now the order is preserved. 现在订单被保留。 That is, sequence is followed by mod_sequence followed by mod_indeces followed by spectral_count . 也就是说, sequence之后mod_sequence随后mod_indeces其次spectral_count To change the order, just change the order of elements in the OrderedDict. 要更改顺序,只需更改OrderedDict中元素的顺序即可。

Note that you will also have to add from collections import OrderedDict in order to be able to use OrderedDict. 请注意,还必须from collections import OrderedDict才能使用OrderedDict。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM