简体   繁体   English

如何使用 Python 计算嵌套字典 in.json 文件中某个项目的出现次数(并可能迭代)?

[英]How can I count occurrence of an item in nested dictionaries in .json file using Python (and possibly iterate over)?

I'm trying for some time but since I'm still quite a begginer I'm having a hard time.我正在尝试一段时间,但由于我仍然是一个初学者,所以我很难过。 I have a file with jsons and all of them have this structure:我有一个带有 jsons 的文件,它们都具有以下结构:

{
   "cds":{
      "ENSLAFT00000035968.1":{
         "A":407,
         "C":312,
         "G":320,
         "T":320,
         "Y":0,
         "M":0,
         "S":0,
         "R":0,
         "W":0,
         "K":0,
         "N":0,
         "D":0,
         "B":0,
         "H":0,
         "V":0,
         "all":1359
      },
      "cdna":{
         "ENSLAFT00000034174.1":{
            "A":825,
            "C":700,
            "G":663,
            "T":584,
            "Y":0,
            "M":0,
            "S":0,
            "R":0,
            "W":0,
            "K":0,
            "N":0,
            "D":0,
            "B":0,
            "H":0,
            "V":0,
            "all":2772
         }
      }
   }

The first keys (cds and cdna) have each about over 1000 values (genes, the ENSLAFT+number).第一个键(cds 和 cdna)每个都有大约 1000 多个值(基因,ENSLAFT+数字)。 I would like to count all of the "N" occurrences (if some has fe 50 and some has 10, add them together and have 60).我想计算所有“N”次出现(如果有些有 50 次,有些有 10 次,则将它们加在一起并有 60 次)。 Shall I use Counter from collections or sum() or len() or some combination of them somehow...?collections Counter sum()len()或它们的某种组合以某种方式...? And how to make a cycle like that for each file in my folder with jsons with the same structure?以及如何使用具有相同结构的 jsons 为我的文件夹中的每个文件创建一个这样的循环? It sounds easy for me but I don't have much experience, so far I'm only able to count using pandas DataFrame or with not so complicated data...这对我来说听起来很容易,但我没有太多经验,到目前为止,我只能使用 pandas DataFrame 或不那么复杂的数据来计算......

I appreciate any help and further study recommendations!我感谢任何帮助和进一步的学习建议!

You can use a for-loop:您可以使用 for 循环:

data = {"cds": {"ENSLAFT00000035968.1": {"A": 407, "C": 312, "G": 320, "T": 320, "Y": 0, "M": 0, "S": 0, "R": 0, "W": 0, "K": 0, "N": 0, "D": 0, "B": 0, "H": 0, "V": 0, "all": 1359}}, "cdna": {"ENSLAFT00000034174.1": {"A": 825, "C": 700, "G": 663, "T": 584, "Y": 0, "M": 0, "S": 0, "R": 0, "W": 0, "K": 0, "N": 0, "D": 0, "B": 0, "H": 0, "V": 0, "all": 2772}}}
counter = 0
for value in data.values():
    # key would be cds or cdna, value is the dict of genes
    for gene in value.values():
        # key would be ENSLAFT00000035968.1, ...
        if 'N' in gene:
            counter += gene['N']
print(counter)

You can check the key to only count some:您可以检查密钥以仅计算一些:

counter = 0
for key, value in data.items():
    # key would be cds or cdna, value is the dict of genes
    if key == "cds":
        for gene in value.values():
            # key would be ENSLAFT00000035968.1, ...
            if 'N' in gene:
                counter += gene['N']
print(counter)

You could go about it by brute force looking for a regex in a string version of your JSON, eg您可以通过蛮力寻找 go 在 JSON 的字符串版本中寻找正则表达式,例如

import json
import re

s = {"cds": {"ENSLAFT00000035968.1": {"A": 407, "C": 312, "G": 320, "T": 320, "Y": 0, "M": 0, "S": 0, "R": 0, "W": 0, "K": 0, "N": 0, "D": 0, "B": 0, "H": 0, "V": 0, "all": 1359}, "cdna": {"ENSLAFT00000034174.1": {"A": 825, "C": 700, "G": 663, "T": 584, "Y": 0, "M": 0, "S": 0, "R": 0, "W": 0, "K": 0, "N": 0, "D": 0, "B": 0, "H": 0, "V": 0, "all": 2772}}}}


s_str = json.dumps(s)
m = re.findall(r'"N":\s(\d+)', s_str)
print(m)  # prints ['0', '0']
print(len(m))  # prints 2

Or go the cleaner, longer route of a recursive function...或者 go 是递归 function 的更清洁、更长的路线...

def rec_find(s, cur=0):
  if type(s) not in (dict, ):
    return 0
  resp = 0
  if "N" in s.keys():
    resp += 1
  for k in s.keys():
    resp += rec_find(s[k], resp)
  return resp

print(rec_find(s))  # prints 2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM