Pandas Dataframe: selecting multiple column by value does not work, why?

Question

I build a Dataframe from an sql query (data for may 2014 to june 2015) and try to build 2 different data sets - train uses all data except June 2015 - test uses only data for June 2015 When I try to use : train= df[(df.month!=6) & (df.year!=2015)]

It seems that I am using OR instead of AND, because I do not get any values for month=6 (also not for 2014) and no values for year=2015 at all, so also not for other months in 2015.

I do not understand what is wrong with the code.

mycurs.execute("""SELECT day, month, year, cloudcover, moonphase, precipintensity,
precipaccumulation, preciptype2, humidity, pressure, windspeed,
uvindex, visibility, temperaturehigh, weekend_2, crimecount
FROM open.nyc_weatherhist_2""")
f=mycurs.fetchall() #returns tuple with first row (unordered list)

df=pd.DataFrame(f, columns=feature_names)
print(df)

     day  month  year  ...  temperaturehigh  weekend  crimecount
0     28      4  2015  ...            20.85        0          56
1     14      4  2015  ...            18.25        0         103
2     13      7  2014  ...            27.44        0          89
3      4      1  2015  ...            12.94        0          99
4     21      9  2014  ...            24.15        0          66
..   ...    ...   ...  ...              ...      ...         ...
390    4      7  2014  ...            23.37        1          84
391    8      8  2014  ...            27.98        1          97
392   26      4  2015  ...            15.78        0          82
393    3      8  2014  ...            24.50        0          80
394    5      6  2015  ...            20.65        1          87

[395 rows x 16 columns]

train= df[(df.month!=6) & (df.year!=2015)]
print(train)

     day  month  year  ...  temperaturehigh  weekend  crimecount
2     13      7  2014  ...            27.44        0          89
4     21      9  2014  ...            24.15        0          66
8     10     11  2014  ...            16.27        0          76
9      5     11  2014  ...            17.76        0         101
11    10      7  2014  ...            28.06        0          99
..   ...    ...   ...  ...              ...      ...         ...
382   10      8  2014  ...            30.51        0         119
389   21     11  2014  ...             2.65        1         110
390    4      7  2014  ...            23.37        1          84
391    8      8  2014  ...            27.98        1          97
393    3      8  2014  ...            24.50        0          80

[184 rows x 16 columns]

Answer 1

Just elaborating on Francis answer, you're looking at the condition in the wrong manner.

You want all the data except for a certain month in a certain year. So, when would the value be "right"? If either of them is not in the month or year.

So, this is exactly the OR condition.

Another way to look at it - you have conditions X and Y . You're "wrong" when both X and Y exist, or simply X & Y .
So, when are you "right"? When !(X & Y) == !X | !Y !X | !Y .

It's your choice how you tackle this, but you can do something like:

train = df[(df.month != 6) | (df.year != 2015)]
or
train = df[~((df.month == 6) & (df.year == 2015))]
Which are equivalent

Answer 2

You need to use | for OR as suspected.

    >>> df = pd.DataFrame({'year':[2014,2015,2015],'month':[6,5,6]})
    >>> df
       year  month
    0  2014      6
    1  2015      5
    2  2015      6
    >>> train = df[(df['year']!=2015) | (df['month']!=6)]
    >>> train
       year  month
    0  2014      6
    1  2015      5

Pandas Dataframe: selecting multiple column by value does not work, why?

Question

2 answers

solution1
2 2020-03-31 09:45:34

solution2
1 2020-03-31 09:32:49

Pandas Dataframe: selecting multiple column by value does not work, why?

Question

2 answers

solution1 2 2020-03-31 09:45:34

solution2 1 2020-03-31 09:32:49

solution1
2 2020-03-31 09:45:34

solution2
1 2020-03-31 09:32:49