如何在Python中合并多个JSON文件

Question

I have had to create multple JSON files for reasons of processing a corpus (using GNRD http://gnrd.globalnames.org/ for scientific name extraction). 由于处理语料库，我不得不创建多个JSON文件（使用GNRD http://gnrd.globalnames.org/进行科学名称提取）。 I now want to use these JSON files to annotate said corpus as a whole. 现在，我想使用这些JSON文件来注释整个所说的语料库。

I am trying to merge the multiple JSON files in Python. 我正在尝试在Python中合并多个JSON文件。 The contents of each JSON files are arrays of just scientific_name (key) and the name found (value). 每个JSON文件的内容都是由Scientific_name（键）和找到的名称（值）组成的数组。 Below is an example of one of the shorter files: 下面是较短文件之一的示例：

{  
  "file":"biodiversity_trophic_9.txt",
  "names":[  
    {  
      "scientificName":"Bufo"
    },
    {  
      "scientificName":"Eleutherodactylus talamancae"
    },
    {  
      "scientificName":"E. punctariolus"
    },
    {  
      "scientificName":"Norops lionotus"
    },
    {  
      "scientificName":"Centrolenella prosoblepon"
    },
    {  
      "scientificName":"Sibon annulatus"
    },
    {  
      "scientificName":"Colostethus flotator"
    },
    {  
      "scientificName":"C. inguinalis"
    },
    {  
      "scientificName":"Eleutherodactylus"
    },
    {  
      "scientificName":"Hyla columba"
    },
    {  
      "scientificName":"Bufo haematiticus"
    },
    {  
      "scientificName":"S. annulatus"
    },
    {  
      "scientificName":"Leptodeira septentrionalis"
    },
    {  
      "scientificName":"Imantodes cenchoa"
    },
    {  
      "scientificName":"Oxybelis brevirostris"
    },
    {  
      "scientificName":"Cressa"
    },
    {  
      "scientificName":"Coloma"
    },
    {  
      "scientificName":"Perlidae"
    },
    {  
      "scientificName":"Hydropsychidae"
    },
    {  
      "scientificName":"Hyla"
    },
    {  
      "scientificName":"Norops"
    },
    {  
      "scientificName":"Hyla colymbiphyllum"
    },
    {  
      "scientificName":"Colostethus inguinalis"
    },
    {  
      "scientificName":"Oxybelis"
    },
    {  
      "scientificName":"Rana warszewitschii"
    },
    {  
      "scientificName":"R. warszewitschii"
    },
    {  
      "scientificName":"Rhyacophilidae"
    },
    {  
      "scientificName":"Daphnia magna"
    },
    {  
      "scientificName":"Hyla colymba"
    },
    {  
      "scientificName":"Centrolenella"
    },
    {  
      "scientificName":"Orconectes nais"
    },
    {  
      "scientificName":"Orconectes neglectus"
    },
    {  
      "scientificName":"Campostoma anomalum"
    },
    {  
      "scientificName":"Caridina"
    },
    {  
      "scientificName":"Decapoda"
    },
    {  
      "scientificName":"Atyidae"
    },
    {  
      "scientificName":"Cerastoderma edule"
    },
    {  
      "scientificName":"Rana aurora"
    },
    {  
      "scientificName":"Riffle"
    },
    {  
      "scientificName":"Calopterygidae"
    },
    {  
      "scientificName":"Elmidae"
    },
    {  
      "scientificName":"Gyrinidae"
    },
    {  
      "scientificName":"Gerridae"
    },
    {  
      "scientificName":"Naucoridae"
    },
    {  
      "scientificName":"Oligochaeta"
    },
    {  
      "scientificName":"Veliidae"
    },
    {  
      "scientificName":"Libellulidae"
    },
    {  
      "scientificName":"Philopotamidae"
    },
    {  
      "scientificName":"Ephemeroptera"
    },
    {  
      "scientificName":"Psephenidae"
    },
    {  
      "scientificName":"Baetidae"
    },
    {  
      "scientificName":"Corduliidae"
    },
    {  
      "scientificName":"Zygoptera"
    },
    {  
      "scientificName":"B. buto"
    },
    {  
      "scientificName":"C. euknemos"
    },
    {  
      "scientificName":"C. ilex"
    },
    {  
      "scientificName":"E. padi noblei"
    },
    {  
      "scientificName":"E. padi"
    },
    {  
      "scientificName":"E. bufo"
    },
    {  
      "scientificName":"E. butoni"
    },
    {  
      "scientificName":"E. crassi"
    },
    {  
      "scientificName":"E. cruentus"
    },
    {  
      "scientificName":"H. colymbiphyllum"
    },
    {  
      "scientificName":"N. aterina"
    },
    {  
      "scientificName":"S. ilex"
    },
    {  
      "scientificName":"Anisoptera"
    },
    {  
      "scientificName":"Riffle delta"
    }
  ],
  "total":67,
  "status":200,
  "unique":true,
  "engines":[  
    "TaxonFinder",
    "NetiNeti"
  ],
  "verbatim":false,
  "input_url":null,
  "token_url":"http://gnrd.globalnames.org/name_finder.html?token=2rtc4e70st",
  "parameters":{  
    "engine":0,
    "return_content":false,
    "best_match_only":false,
    "data_source_ids":[  

    ],
    "detect_language":true,
    "all_data_sources":false,
    "preferred_data_sources":[  

    ]
  },
  "execution_time":{  
    "total_duration":3.1727607250213623,
    "find_names_duration":1.9656541347503662,
    "text_preparation_duration":1.000107765197754
  },
  "english_detected":true
}

The issue I have is that there may be duplicates across the files, which I want to remove (otherwise I could just concatenate the files I guess). 我的问题是文件之间可能有重复项，我想删除这些文件（否则我可以串联我猜到的文件）。 The queries I have seen otherwise are referring to merging extra keys and values to extend the arrays themselves. 否则，我看到的查询是指合并额外的键和值以扩展数组本身。

Can anyone give me guidance on how to overcome this issue? 谁能给我指导以解决这个问题？

Answer 1

If I understand correctly, you want to get all "scientificNames" values in the "names" elements of a batch of files. 如果我理解正确，则希望在一批文件的“名称”元素中获取所有“ scientificNames”值。 If I'm wrong, you should give an expected output to make things easier to understand. 如果我错了，则应该给出预期的输出，以使事情更容易理解。

I'd do something like that: 我会做这样的事情：

all_names = set() # use a set to avoid duplicates

# put all your files in there
for filename in ('file1.json', 'file2.json', ....):
    try:
        with open(filename, 'rt') as finput:
            data = json.load(finput)
        for name in data.get('names'):
            all_names.add(name.get('scientificName')
    except Exception as exc:
        print("Skipped file {} because exception {}".format(filename, str(exc))

print(all_names)

And in case you want to get a similar format than the initial files, add: 并且如果您想获得与初始文件相似的格式，请添加：

import pprint
pprint({"names:": {"scientificName": name for name in all_names}, "total": len(all_names)})

如何在Python中合并多个JSON文件

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-10-05 11:45:58

如何在Python中合并多个JSON文件

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-10-05 11:45:58

解决方案1
1 已采纳 2017-10-05 11:45:58