简体   繁体   中英

Split one column of csv file based on another column

I am trying to split a csv file of temperature data into smaller dictionaries so I can calculate the mean temperature of each month. The csv file is of the format below:

AirTemperature  AirHumidity SoilTemperature SoilMoisture    LightIntensity  WindSpeed   Year    Month   Day Hour    Minute  Second  TimeStamp   MonthCategorical
12  68  19  65  60  2   2016    1   1   0   1   1   10100   January
18  34  14  42  19  0   2016    1   1   1   1   1   10101   January
19  98  14  41  30  4   2016    1   1   2   1   1   10102   January
16  88  16  68  54  4   2016    1   1   3   1   1   10103   January
16  44  20  41  10  1   2016    1   1   4   1   1   10104   January
22  54  18  65  94  0   2016    1   1   5   1   1   10105   January
18  84  17  41  40  4   2016    1   1   6   1   1   10106   January
20  88  22  92  31  0   2016    1   1   7   1   1   10107   January
23  1   22  59  3   0   2016    1   1   8   1   1   10108   January
23  3   22  72  41  4   2016    1   1   9   1   1   10109   January
24  63  23  83  85  0   2016    1   1   10  1   1   10110   January
29  73  27  50  1   4   2016    1   1   11  1   1   10111   January
28  37  30  46  29  3   2016    1   1   12  1   1   10112   January
30  99  32  78  73  4   2016    1   1   13  1   1   10113   January
32  72  31  80  80  1   2016    1   1   14  1   1   10114   January

Where there are 24 readings per day over a 6 month period.

I can get half way there with the following code:

for row in df['AirTemperature']:
    for equivalentRow in df['MonthCategorical']:
        if equivalentRow == "January":
            JanuaryAirTemperatures.append(row)

But the output of this has every AirTemp value duplicated by the number of rows containing the value January. Ie rather than 12,18,19 etc it goes 12, 12, 12, 12, 12, 18, 18, 18, 18, 18, 19, 19, 19, 19

I tried the following:

for row in df['AirTemperature']:
    if df['MonthCategorical'] == "January":
        JanuaryAirTemperatures.append(row)

But I get the following error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). I assume because it is trying to look at the whole column rather than the equivalent row.

IIUC, you can groupby by month and get the mean value of the Air Temperature per month with:

g = df.groupby('MonthCategorical')['AirTemperature'].mean().reset_index(name='MeanAirTemperature')

this returns:

  MonthCategorical  MeanAirTemperature
0          January                  22

Then you can choose on what columns you want to groupby (ie instead of MonthCategorical you can group by Month only...).

EDIT: You can also use transform to get a new column to append to the original dataframe with:

df['MeanAirTemperature'] = df.groupby('MonthCategorical')['AirTemperature'].transform('mean')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM