简体   繁体   English

将 JSON 转换为 CSV/TSV

[英]Convert JSON to CSV/TSV

I'm trying to convert this data ( https://rest.kegg.jp/get/br:ko00001/json ) in JSON to CSV/TSV.我正在尝试将 JSON 格式的这些数据( https://rest.kegg.jp/get/br:ko00001/json )转换为 CSV/TSV。 I've been able to do so in awk and sed but I'm learning Perl for bigger projects so it'd be helpful to learn to do it without the JSON module.我已经能够在 awk 和 sed 中做到这一点,但我正在为更大的项目学习 Perl,所以在没有 JSON 模块的情况下学习这样做会很有帮助。

sed -E 's/^\t{2}"name"/\t\t"level 1"/g;s/^\t{3}"name"/\t\t\t"level 2"/g;s/^\t{4}"name"/\t\t\t\t"level 3"/g;s/^\t{5}"name"/\t\t\t\t\t"level 4"/g' json.json | awk 'BEGIN {OFS="\t"} NR > 4 {match($0, /"([^"]+)": *("[^"]*")/, a)} {tag = a[1]; val = gensub(/^"|"$/, "", "g", a[2]); f[tag] = val; if (tag == "level 4") {print f["level 1"], f["level 2"], f["level 3"], f["level 4"]}}' > table.tsv

Above is how I made it by awk and sed.以上是我通过 awk 和 sed 制作的。 json.json is downloaded from the link. json.json 从链接下载。

Here is what I've been trying so far in Perl without the JSON module.这是迄今为止我在没有 JSON 模块的 Perl 中一直在尝试的。 I'd like to do it this way to learn about the data structure and how Perl works.我想通过这种方式了解数据结构以及 Perl 的工作原理。

use strict;

my $brite_hierarchy_filepath = shift @ARGV;

open my $brite_hierarchy, '<:utf8', $brite_hierarchy_filepath or die q{Can't open $brite_hierarchy_filepath: $!\n};

while (my $line = <$brite_hierarchy>) {
    next if $. == 4;
    chomp $line;

    $line =~ s/\A\t{2}"name"/"level_1"/;           
    $line =~ s/\A\t{3}"name"/"level_2"/;         
    $line =~ s/\A\t{4}"name"/"level_3"/;
    $line =~ s/\A\t{5}"name"/"level_4"/;

    my ($tag) = $line =~ /\A"(.*?)"/; 
    my ($value) = $line =~ /\A"level_[1-4]":"(.*?)"/;
    my %field = ($tag => $value) unless $tag eq "" && $value eq "";

    for (keys %field) {
        print join("\t", $field{"level_1"}, $field{"level_2"}, $field{"level_3"}, $field{"level_4"}, "\n");
    };
    last if eof $brite_hierarchy;
};

This is how the data looks like briefly.这就是数据的简要外观。

    {
        "name":"ko00001",
        "children":[
        {
            "name":"09100 Metabolism",
            "children":[
            {
                "name":"09101 Carbohydrate metabolism",
                "children":[
                {
                    "name":"00010 Glycolysis \/ Gluconeogenesis [PATH:ko00010]",
                    "children":[
                    {
                        "name":"K00844  HK; hexokinase [EC:2.7.1.1]"
                    },
                    {
                        "name":"K12407  GCK; glucokinase [EC:2.7.1.2]"
                    },
                    {
                        "name":"K00845  glk; glucokinase [EC:2.7.1.2]"
...

And the desired output in TSV format.以及 TSV 格式的所需输出。

09100 Metabolism    09101 Carbohydrate metabolism   00010 Glycolysis \/ Gluconeogenesis [PATH:ko00010]  K00844  HK; hexokinase [EC:2.7.1.1]
09100 Metabolism    09101 Carbohydrate metabolism   00010 Glycolysis \/ Gluconeogenesis [PATH:ko00010]  K12407  GCK; glucokinase [EC:2.7.1.2]
09100 Metabolism    09101 Carbohydrate metabolism   00010 Glycolysis \/ Gluconeogenesis [PATH:ko00010]  K00845  glk; glucokinase [EC:2.7.1.2]

I would always suggest a JSON parser, but you can indeed treat this as just a fixed text file if you can guarantee that format never changes.我总是建议使用 JSON 解析器,但如果你能保证格式永远不会改变,你确实可以把它当作一个固定的文本文件。 In production, you usually can't.在生产中,您通常不能。 But if it's a one-off, then it certainly works.但如果它是一次性的,那么它肯定有效。

The example input you pasted into the question has spaces, not tabs, so your code would not work on it.您粘贴到问题中的示例输入有空格,而不是制表符,因此您的代码将无法使用它。 Neither would mine.我的也不会。 My input is copied from your link, and has tabs.我的输入是从您的链接中复制的,并且有标签。

Your regex patterns seem a bit complicated.您的正则表达式模式似乎有点复杂。 You can always have the same trivial pattern, but just need to vary the number of tabs before each name.您始终可以使用相同的琐碎模式,但只需要改变每个名称前的制表符数量即可。 The trick is to skip to the next line whenever you find a name that is not the final column, and to reset the entire structure on the first column.诀窍是每当您找到一个不是最后一列的名称时跳到下一行,并重置第一列的整个结构。 I chose to use an array rather than a hash, as that makes more sense and we can just join later when we output.我选择使用数组而不是哈希,因为这样更有意义,我们可以稍后在输出时join Finally, say is like print but with a built-in newline.最后, sayprint类似,但带有内置换行符。

use strict;
use warnings;
use feature 'say';

my @names;
while (<DATA>) {
    if ( m/^\t"name":"(.+)"/) {
        undef @names;
        $names[0] = $1;
        next;
    }
    if (m/^\t\t"name":"(.+)"/) {
        $names[1] = $1;
        next;
    }
    if (m/^\t\t\t"name":"(.+)"/) {
        $names[2] = $1;
        next;
    }
    if (m/^\t\t\t\t"name":"(.+)"/) {
        $names[3] = $1;
        next;
    }
    if (m/^\t\t\t\t\t"name":"(.+)"/) {
        $names[4] = $1;
        say join "\t", @names;
    }
}

__DATA__
{
    "name":"ko00001",
    "children":[
    {
        "name":"09100 Metabolism",
        "children":[
        {
            "name":"09101 Carbohydrate metabolism",
            "children":[
            {
                "name":"00010 Glycolysis \/ Gluconeogenesis [PATH:ko00010]",
                "children":[
                {
                    "name":"K00844  HK; hexokinase [EC:2.7.1.1]"
                },
                {
                    "name":"K12407  GCK; glucokinase [EC:2.7.1.2]"
                },
use v5.14;
use warnings;
use open ":std", ":encoding(UTF-8)";

my @names;
while ( <> ) {
   my ( $tabs, $name ) = /^\t{2}(\t*)"name": "(.*)"/
      or next;

   my $level = length( $tabs );
   $names[ $level ] = $name;

   say join "\t", @names if $level == 4;
}

It's horrible not to use a JSON parser.不使用 JSON 解析器太可怕了。

Although the code doesn't look very clean, I manage to create the table in TSV format perfectly like the one produced by sed and awk.虽然代码看起来不是很干净,但我设法创建了 TSV 格式的表格,与 sed 和 awk 生成的表格完全一样。

Thanks all for the info on using the module JSON, but with this way, I learn a bit more about using the variable outside of the loop block, we can store it for the next turn in the loop.感谢所有关于使用模块 JSON 的信息,但是通过这种方式,我了解了更多关于在循环块之外使用变量的信息,我们可以将它存储在循环中的下一轮。

use strict;

my $brite_hierarchy_filepath = shift @ARGV;

open my $brite_hierarchy, '<:utf8', $brite_hierarchy_filepath or die q{Can't open $brite_hierarchy_filepath: $!\n};

my $previous_1;
my $previous_2;
my $previous_3;

while (my $line = <$brite_hierarchy>) {
    next if $. == 4;
    chomp $line;
    
    # change accordingly to the hierarchical levels
    $line =~ s/\A\t{2}"name"/"level_1"/;           
    $line =~ s/\A\t{3}"name"/"level_2"/;         
    $line =~ s/\A\t{4}"name"/"level_3"/;
    $line =~ s/\A\t{5}"name"/"level_4"/;

    # find the categories and put them into a hash
    my ($tag) = $line =~ /\A"(.*?)"/; 
    my ($value) = $line =~ /\A"level_[1-4]":"(.*?)"/;
    my %field = ($tag => $value) unless $tag eq "" && $value eq "";

    for (keys %field) {
        $previous_1 = $field{"level_1"} if $_ eq "level_1" && $field{"level_1"} ne "";
        $previous_2 = $field{"level_2"} if $_ eq "level_2" && $field{"level_2"} ne "";
        $previous_3 = $field{"level_3"} if $_ eq "level_3" && $field{"level_3"} ne "";
        print join("\t", $previous_1, $previous_2, $previous_3, $field{"level_4"}, "\n") unless $field{"level_4"} eq "";
    };
    last if eof $brite_hierarchy;
};

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM