简体   繁体   中英

calculate mean using numpy ndarray

The text file look like:

david weight_2005 50
david weight_2012 60
david height_2005 150
david height_2012 160
mark weight_2005 90
mark weight_2012 85
mark height_2005 160
mark height_2012 170

How to calculate mean of weight and height for david and mark as follows:

david>> mean(weight_2005 and weight_2012), mean (height_2005 and height_2012)
mark>> mean(weight_2005 and weight_2012), mean (height_2005 and height_2012)

my incomplete code is:

 import numpy as np
 import csv
 with open ('data.txt','r') as infile:
   contents = csv.reader(infile, delimiter=' ')
   c1,c2,c3 = zip(*contents)
   data = np.array(c3,dtype=float)

Then how to apply np.mean??

The mean function is for computing the average of an array of numbers. You will need to come up with a way to select the values of c3 by applying a condition to c2 .

What would probably suit your needs better would be splitting up the data into a hierarchical structure, I prefer using dictionaries. Something like

data = {}
with open('data.txt') as f:
    contents = csv.reader(f, delimiter=' ')
for (name, attribute, value) in contents:
    data[name] = data.get(name, {})  # Default value is a new dict
    attr_name, attr_year = attribute.split('_')
    attr_year = int(attr_year)
    data[name][attr_name] = data[name].get(attr_name, {})
    data[name][attr_name][attr_year] = value

Now data will look like

{
    "david": {
        "weight": {
            2005: 50,
            2012: 60
        },
        "height": {
            2005: 150,
            2012: 160
        }
    },
    "mark": {
        "weight": {
            2005, 90,
            2012, 85
        },
        "height": {
            2005: 160,
            2012: 170
        }
    }
}

Then what you can do is

david_avg_weight = np.mean(data['david']['weight'].values())
mark_avg_height = np.mean([v for k, v in data['mark']['height'].iteritems() if 2008 < k])

Here I'm still using np.mean , but only calling it on a normal Python list.

I'll make this community wiki, because it's more "here's how I think you should do it instead" than "here's the answer to the question you asked". For something like this I'd probably use pandas instead of numpy , as its grouping tools are much better. It'll also be useful to compare with numpy -based approaches.

import pandas as pd
df = pd.read_csv("data.txt", sep="[ _]", header=None, 
                 names=["name", "property", "year", "value"])
means = df.groupby(["name", "property"])["value"].mean()

.. and, er, that's it.


First, read in the data into a DataFrame , letting either whitespace or _ separate columns:

>>> import pandas as pd
>>> df = pd.read_csv("data.txt", sep="[ _]", header=None, 
                 names=["name", "property", "year", "value"])
>>> df
    name property  year  value
0  david   weight  2005     50
1  david   weight  2012     60
2  david   height  2005    150
3  david   height  2012    160
4   mark   weight  2005     90
5   mark   weight  2012     85
6   mark   height  2005    160
7   mark   height  2012    170

Then group by name and property , take the value column, and compute the mean:

>>> means = df.groupby(["name", "property"])["value"].mean()
>>> means
name   property
david  height      155.0
       weight       55.0
mark   height      165.0
       weight       87.5
Name: value, dtype: float64

.. okay, the sep="[ _]" trick is a little too cute for real code, though it works well enough here. In practice I'd use a whitespace separator, read in the second column as property_year and then do

df["property"], df["year"] = zip(*df["property_year"].str.split("_"))
del df["property_year"]

to allow underscores in other columns.

You can read your data directly in a numpy array with:

data = np.recfromcsv("data.txt", delimiter=" ", names=['name', 'type', 'value'])

then you can find appropriate indices with np.where :

indices = np.where((data.name == 'david') * data.type.startswith('height'))

and perform the mean on thoses indices :

np.mean(data.value[indices])

If your data is always in the format provided. Then you could do this using array slicing:

(data[:-1:2] + data[1::2]) / 2

Results in:

[  55.   155.    87.5  165. ]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM