简体   繁体   中英

Computing values for a column in pandas using other columns

I have a data-frame containing 3 columns: 'longitude', 'latitude', and 'country'. For some longitude and latitudes, the value in the country columns is 'unknown'. Here is an overview of the data-frame:

  longitude   latitude  country
-76.250000  83.083333  China
-76.166667  83.083333  unknown
-76.083333  83.083333  USA
-76.000000  83.083333  India
-75.916667  83.083333  unknown
-68.166667 -55.500000   unknown
-67.666667 -55.500000   UK
-68.166667 -55.583333   Chile
-68.083333 -55.583333   Canada
-67.500000 -55.666667   unknown

For the unknown countries, I want to calculate the minimum euclidean distance for longitudes and latitudes containing a country name and replace 'unknown' with that country name(minimum distance). Is there an efficient way to do that?

Your example is not representative. The only country value you have is Chile. However, something like the following should work:

from scipy.spatial import distance

def euclidean(point, others):
    return others[distance.cdist(point[None,:-1].astype(float), others[:,:-1].astype(float)).argmin(),2]

unknown = df[df["country"].eq("unknown")]
known = df[df["country"].ne("unknown")]

matches = unknown.apply(lambda row: scipy_euclidean(row.to_numpy(), known.to_numpy()), axis=1)
df["country"] = df["country"].where(df["country"].ne("unknown"), matches)

>>> df
   longitude   latitude country
0 -76.250000  83.083333   China
1 -76.166667  83.083333   China
2 -76.083333  83.083333     USA
3 -76.000000  83.083333   India
4 -75.916667  83.083333   India
5 -68.166667 -55.500000   Chile
6 -67.666667 -55.500000      UK
7 -68.166667 -55.583333   Chile
8 -68.083333 -55.583333  Canada
9 -67.500000 -55.666667      UK
Performance:
big_df = pd.concat([df]*1000)
unknown = big_df[big_df["country"].eq("unknown")]
known = big_df[big_df["country"].ne("unknown")]

>>> %timeit unknown.apply(lambda row: euclidean(row.to_numpy(), known.to_numpy()), axis=1)
847 µs ± 26.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM