简体   繁体   中英

pandas: how to select an all rows of a dataframe that meet a condition (ValueError: “Arrays were different lengths”)

Python 2.7

I have a Dataframe with two columns, coordinates and loc . coordinates contains 10 lat/long pairs and loc contains 10 strings.

The following code leads to a ValueError, arrays were different lengths. Seems like I'm not writing the condition correctly.

lst_10_cords = [['37.09024, -95.712891'], ['-37.605, 145.146'], ['43.0481962, -76.0488458'], ['29.7604267, -95.3698028'], ['47.6062095, -122.3320708'], ['34.0232431, -84.3615555'], ['31.9685988, -99.9018131'], ['37.226582, -95.70522299999999'], ['40.289918, -83.036372'], ['37.226582, -95.70522299999999']]
lst_10_locs = [['United States'], ['Doreen, Melbourne'], ['Upstate NY'], ['Houston, TX'], ['Seattle, WA'], ['Roswell, GA'], ['Texas'], ['null'], ['??, passing by...'], ['null']]
df = pd.DataFrame(columns=['coordinates', 'locs'])
df['coordinates'] = lst_10_cords
df['locs'] = lst_10_locs
print df
df = df[df['coordinates'] !=  ['37.226582', '-95.70522299999999']] #ValueError

The error message is

File "C:\\Users...\\Miniconda3\\envs\\py2.7\\lib\\site-packages\\pandas\\core\\ops.py", lin e 1283, in wrapper res = na_op(values, other) File "C:\\Users...\\Miniconda3\\envs\\py2.7\\lib\\site-packages\\pandas\\core\\ops.py", lin e 1143, in na_op result = _comp_method_OBJECT_ARRAY(op, x, y) File "C:...\\biney\\Miniconda3\\envs\\py2.7\\lib\\site-packages\\pandas\\core\\ops.py", lin e 1120, in _comp_method_OBJECT_ARRAY result = libops.vec_compare(x, y, op) File "pandas/_libs/ops.pyx", line 128, in pandas._libs.ops.vec_compare ValueError: Arrays were different lengths: 10 vs 2

My goal here is to actually check and eliminate all entries in the coordinates column that are equal to the list [37.226582, -95.70522299999999] so I want df['coordinates'] to print out [['37.09024, -95.712891'], ['-37.605, 145.146'], ['43.0481962, -76.0488458'], ['29.7604267, -95.3698028'], ['47.6062095, -122.3320708'], ['34.0232431, -84.3615555'], ['31.9685988, -99.9018131'], ['37.226582, -95.70522299999999'], ['40.289918, -83.036372']

I was hoping that this documentation would help, particularly the part that shows: "You may select rows from a DataFrame using a boolean vector the same length as the DataFrame's index (for example, something derived from one of the columns of the DataFrame):" df[df['A'] > 0]

so it seems like I'm not quite getting the syntax right... But I'm stuck. How do I write set a condition for the cell value of a certain column and return a dataframe only containing rows with cells that meet that condition?

can you consider this?:

df
    coordinates                 locs
0   [37.09024, -95.712891]      [United States]
1   [-37.605, 145.146]          [Doreen, Melbourne]
2   [43.0481962, -76.0488458]   [Upstate NY]
3   [29.7604267, -95.3698028]   [Houston, TX]
4   [47.6062095, -122.3320708]  [Seattle, WA]
5   [34.0232431, -84.3615555]   [Roswell, GA]
6   [31.9685988, -99.9018131]   [Texas]
7   [37.226582, -95.705222999]  [null]
8   [40.289918, -83.036372]     [??, passing by...]
9   [37.226582, -95.7052229999] [null]


df['lat'] = df['coordinates'].map(lambda x: np.float(x[0].split(",")[0]))
df['lon'] = df['coordinates'].map(lambda x: np.float(x[0].split(",")[1]))
df[~((np.isclose(df['lat'],37.226582)) & (np.isclose(df['lon'],-95.70522299999999)))]


    coordinates                 locs                 lat        lon
0   [37.09024, -95.712891]      [United States]      37.090240  -95.712891
1   [-37.605, 145.146]          [Doreen, Melbourne] -37.605000  145.146000
2   [43.0481962, -76.0488458]   [Upstate NY]         43.048196  -76.048846
3   [29.7604267, -95.3698028]   [Houston, TX]        29.760427  -95.369803
4   [47.6062095, -122.3320708]  [Seattle, WA]        47.606209  -122.332071
5   [34.0232431, -84.3615555]   [Roswell, GA]        34.023243  -84.361555
6   [31.9685988, -99.9018131]   [Texas]              31.968599  -99.901813
8   [40.289918, -83.036372]     [??, passing by...]  40.289918  -83.036372

One issue if you look into the objects your dataframe is storing the coords as you see that it is a single string. the issue with the error you are getting seems to be that it is comparing the 10 element series .coordinates with a 2 element list and there is obviously a mismatch. using .values seemed to get around that.

df2 = pd.DataFrame([row if row[0]!= ['37.226582, -95.70522299999999'] else [np.nan, np.nan] for row in df.values ], columns=['coords', 'locs']).dropna()

ok here is an approach to ensure you have clean data to operate on.

let's assume 4 entries with a dirty coordinate entry.

lst_4_cords = [['37.09024, -95.712891'], ['-37.605, 145.146'], ['43.0481962, -76.0488458'], ['null']]
lst_4_locs = [['United States'], ['Doreen, Melbourne'], ['Upstate NY'], ['Houston, TX']]
df = pd.DataFrame(columns=['coordinates', 'locs'])
df['coordinates'] = lst_4_cords
df['locs'] = lst_4_locs


    coordinates                     locs
0   [37.09024, -95.712891]      [United States]
1   [-37.605, 145.146]          [Doreen, Melbourne]
2   [43.0481962, -76.0488458]   [Upstate NY]
3   [null]                      [Houston, TX]

now we make a cleaning method. You would really want to test the values using:

type(value) is list.
type(value[0]) is string.
value[0].split(",") has two elements 
each element can cast to float - etc. 
Each is valid to be a lat or a lon

However we will do it the dirty way using a try except.

def scrubber_drainer(value):
    try:
        # we assume value is a list, with a single string in position zero, that string has a comma, that we can split into a tuple of two floats
        return tuple([float(value[0].split(",")[0]),float(value[0].split(",")[1])])
    except:
        # return tuple (38.9072,77.0396) # swamp
        return tuple([0.0,0.0]) # some default

so the return is typically a tuple with 2 floats. If it can't become that we return a default (0.,0.).

now update the coordinates

df['coordinates'] = df['coordinates'].map(scrubber_drainer)

then we use this cool technique to split out the tuple

df[['lat', 'lon']] = df['coordinates'].apply(pd.Series)

and now you can use the np.isclose() to filter

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM