How to remove extra statistical mean results from JSON dictionary in python?

Question

I'm working in python3 - I'm trying to determine the mean from measurements in a JSON dictionary of contaminants in a well. When I return the code its shows the mean of the data for each line. Essentially I want to find one mean for all results of one contaminant. There are multiple results for the same contaminant within each year.

for plants in data:

  for year in ["2010", "2011", "2012", "2013", "2014":

  arsenic_values = []
  manganese_values = []

  all_year_data = data[plants][year]

    for measurement in all_year_data:
    if measurement['contaminent'] == "arsenic":

      arsenic_values.append(float(measurement["concentration"]))
      arsenic_mean = statistics.mean(arsenic_values)

        print(plants, year, arsenic_mean)

Here's an example of what the JSON looks like for 2 years.

  "well1": {
    "2010": [],
    "2011": [
      {
        "contaminent": "arsenic",
        "concentration": "0.0420000000"
      },
      {
        "contaminent": "arsenic",
        "concentration": "0.0200000000"
      },
      {
        "contaminent": "arsenic",
        "concentration": "0.0150000000"
      },
      {
        "contaminent": "arsenic",
        "concentration": "0.0320000000"
      },
      {
        "contaminent": "manganese",
        "concentration": "0.8700000000"
      },
      {
        "contaminent": "manganese",
        "concentration": "0.8400000000"
      }
    ],

Example of what it returns with my notes in ()

well1 2011 0.042
well1 2011 0.031   (this is the mean of the measurement before)
well1 2011 0.025666666666666667    (this is the mean of the measurement before and before that)    
well1 2011 0.0272    (**THIS IS WHAT I WANT** but I can't write like a counter function because the result I want is different for each well I am looking at.

IN summation:
There are multiple results for each year of the same containment and I want to find the average. But my code as it is written returns almost a triangular data that grows with each line. SO its finding's the average of each line for the containment rather than grouping all together and taking one average.

Answer 1

We can iterate over the top-level keys and groupby the contaminent to achieve the desired result.

from statistics import mean
from operator import itemgetter
from itertools import groupby

cnt = itemgetter('concentration')
cmt = itemgetter('contaminent')

d = {'well1': {'2010': [],
  '2011': [{'concentration': '0.0420000000', 'contaminent': 'arsenic'},
   {'concentration': '0.0200000000', 'contaminent': 'arsenic'},
   {'concentration': '0.0150000000', 'contaminent': 'arsenic'},
   {'concentration': '0.0320000000', 'contaminent': 'arsenic'},
   {'concentration': '0.8700000000', 'contaminent': 'manganese'},
   {'concentration': '0.8400000000', 'contaminent': 'manganese'}]}}

top_level = d.keys()
for key in top_level:
    for year, value in d.get(key).items():
        if not value:
            print('The year {} has no values to compute'.format(year))
        else:
            for k, v in groupby(sorted(value, key=cmt), key=cmt):
                mean_ = mean(map(float, map(cnt, v)))
                print('{} {} {} {}'.format(key, year, k, mean_))

The year 2010 has no values to compute
well1 2011 arsenic 0.02725
well1 2011 manganese 0.855

Links to some concepts that are used that you might not be familiar with:

map

itemgetter

groupby

Answer 2

If you have a lot of measures, you should avoid itertools.groupby since it needs a sorted list and sorting is expensive. It's easy to build a dictionary with the values grouped by well , year and contaminent using setdefault :

>>> import json
>>> data_by_year_by_well = json.loads(text)
>>> d = {}
>>> for w, data_by_year in data_by_year_by_well.items():
...     for y, data in data_by_year.items():
...         for item in data:
...             d.setdefault(w, {}).setdefault(y, {}).setdefault(item['contaminent'], []).append(float(item['concentration']))
...
>>> d
{'well1': {'2011': {'arsenic': [0.042, 0.02, 0.015, 0.032], 'manganese': [0.87, 0.84]}}}

Now, compute the mean (or the median, or any aggregate value):

>>> from statistics import mean
>>> {w: {y: {c: mean(v) for c, v in v_by_c.items()} for y, v_by_c in d_by_y.items()} for w, d_by_y in d.items()}
{'well1': {'2011': {'arsenic': 0.02725, 'manganese': 0.855}}}

How to remove extra statistical mean results from JSON dictionary in python?

Question

2 answers

solution1
0 2019-04-23 01:37:18

solution2
0 2019-04-23 15:05:24

How to remove extra statistical mean results from JSON dictionary in python?

Question

2 answers

solution1 0 2019-04-23 01:37:18

solution2 0 2019-04-23 15:05:24

solution1
0 2019-04-23 01:37:18

solution2
0 2019-04-23 15:05:24