Python 2.7
I have a Dataframe with two columns, coordinates
and loc
. coordinates
contains 10 lat/long pairs and loc contains 10 strings.
The following code leads to a ValueError, arrays were different lengths. Seems like I'm not writing the condition correctly.
lst_10_cords = [['37.09024, -95.712891'], ['-37.605, 145.146'], ['43.0481962, -76.0488458'], ['29.7604267, -95.3698028'], ['47.6062095, -122.3320708'], ['34.0232431, -84.3615555'], ['31.9685988, -99.9018131'], ['37.226582, -95.70522299999999'], ['40.289918, -83.036372'], ['37.226582, -95.70522299999999']]
lst_10_locs = [['United States'], ['Doreen, Melbourne'], ['Upstate NY'], ['Houston, TX'], ['Seattle, WA'], ['Roswell, GA'], ['Texas'], ['null'], ['??, passing by...'], ['null']]
df = pd.DataFrame(columns=['coordinates', 'locs'])
df['coordinates'] = lst_10_cords
df['locs'] = lst_10_locs
print df
df = df[df['coordinates'] != ['37.226582', '-95.70522299999999']] #ValueError
The error message is
File "C:\\Users...\\Miniconda3\\envs\\py2.7\\lib\\site-packages\\pandas\\core\\ops.py", lin e 1283, in wrapper res = na_op(values, other) File "C:\\Users...\\Miniconda3\\envs\\py2.7\\lib\\site-packages\\pandas\\core\\ops.py", lin e 1143, in na_op result = _comp_method_OBJECT_ARRAY(op, x, y) File "C:...\\biney\\Miniconda3\\envs\\py2.7\\lib\\site-packages\\pandas\\core\\ops.py", lin e 1120, in _comp_method_OBJECT_ARRAY result = libops.vec_compare(x, y, op) File "pandas/_libs/ops.pyx", line 128, in pandas._libs.ops.vec_compare ValueError: Arrays were different lengths: 10 vs 2
My goal here is to actually check and eliminate all entries in the coordinates column that are equal to the list [37.226582, -95.70522299999999]
so I want df['coordinates']
to print out [['37.09024, -95.712891'], ['-37.605, 145.146'], ['43.0481962, -76.0488458'], ['29.7604267, -95.3698028'], ['47.6062095, -122.3320708'], ['34.0232431, -84.3615555'], ['31.9685988, -99.9018131'], ['37.226582, -95.70522299999999'], ['40.289918, -83.036372']
I was hoping that this documentation would help, particularly the part that shows: "You may select rows from a DataFrame using a boolean vector the same length as the DataFrame's index (for example, something derived from one of the columns of the DataFrame):" df[df['A'] > 0]
so it seems like I'm not quite getting the syntax right... But I'm stuck. How do I write set a condition for the cell value of a certain column and return a dataframe only containing rows with cells that meet that condition?
can you consider this?:
df
coordinates locs
0 [37.09024, -95.712891] [United States]
1 [-37.605, 145.146] [Doreen, Melbourne]
2 [43.0481962, -76.0488458] [Upstate NY]
3 [29.7604267, -95.3698028] [Houston, TX]
4 [47.6062095, -122.3320708] [Seattle, WA]
5 [34.0232431, -84.3615555] [Roswell, GA]
6 [31.9685988, -99.9018131] [Texas]
7 [37.226582, -95.705222999] [null]
8 [40.289918, -83.036372] [??, passing by...]
9 [37.226582, -95.7052229999] [null]
df['lat'] = df['coordinates'].map(lambda x: np.float(x[0].split(",")[0]))
df['lon'] = df['coordinates'].map(lambda x: np.float(x[0].split(",")[1]))
df[~((np.isclose(df['lat'],37.226582)) & (np.isclose(df['lon'],-95.70522299999999)))]
coordinates locs lat lon
0 [37.09024, -95.712891] [United States] 37.090240 -95.712891
1 [-37.605, 145.146] [Doreen, Melbourne] -37.605000 145.146000
2 [43.0481962, -76.0488458] [Upstate NY] 43.048196 -76.048846
3 [29.7604267, -95.3698028] [Houston, TX] 29.760427 -95.369803
4 [47.6062095, -122.3320708] [Seattle, WA] 47.606209 -122.332071
5 [34.0232431, -84.3615555] [Roswell, GA] 34.023243 -84.361555
6 [31.9685988, -99.9018131] [Texas] 31.968599 -99.901813
8 [40.289918, -83.036372] [??, passing by...] 40.289918 -83.036372
One issue if you look into the objects your dataframe is storing the coords as you see that it is a single string. the issue with the error you are getting seems to be that it is comparing the 10 element series .coordinates with a 2 element list and there is obviously a mismatch. using .values seemed to get around that.
df2 = pd.DataFrame([row if row[0]!= ['37.226582, -95.70522299999999'] else [np.nan, np.nan] for row in df.values ], columns=['coords', 'locs']).dropna()
ok here is an approach to ensure you have clean data to operate on.
let's assume 4 entries with a dirty coordinate entry.
lst_4_cords = [['37.09024, -95.712891'], ['-37.605, 145.146'], ['43.0481962, -76.0488458'], ['null']]
lst_4_locs = [['United States'], ['Doreen, Melbourne'], ['Upstate NY'], ['Houston, TX']]
df = pd.DataFrame(columns=['coordinates', 'locs'])
df['coordinates'] = lst_4_cords
df['locs'] = lst_4_locs
coordinates locs
0 [37.09024, -95.712891] [United States]
1 [-37.605, 145.146] [Doreen, Melbourne]
2 [43.0481962, -76.0488458] [Upstate NY]
3 [null] [Houston, TX]
now we make a cleaning method. You would really want to test the values using:
type(value) is list.
type(value[0]) is string.
value[0].split(",") has two elements
each element can cast to float - etc.
Each is valid to be a lat or a lon
However we will do it the dirty way using a try except.
def scrubber_drainer(value):
try:
# we assume value is a list, with a single string in position zero, that string has a comma, that we can split into a tuple of two floats
return tuple([float(value[0].split(",")[0]),float(value[0].split(",")[1])])
except:
# return tuple (38.9072,77.0396) # swamp
return tuple([0.0,0.0]) # some default
so the return is typically a tuple with 2 floats. If it can't become that we return a default (0.,0.).
now update the coordinates
df['coordinates'] = df['coordinates'].map(scrubber_drainer)
then we use this cool technique to split out the tuple
df[['lat', 'lon']] = df['coordinates'].apply(pd.Series)
and now you can use the np.isclose() to filter
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.