简体   繁体   中英

How to calculate the difference of a row's column values against all other rows with multiple values in a dataframe? Iterate for every row

Main objective: Find cities that sell toys the most different from one another (top 10 differentials). For example Los Angeles sells the most Toys 3 and 4 and the city most opposite of that would be Salt Lake City, which sells Toy 9 and 15 the most and Toys 3 and 4 the least.

I have a CSV that I have put in a dataframe.

It has hundreds of rows currently and each row has 15 columns... Example:

City Toy1 Toy2 Toy3 ToyN
Los Angeles 15 20 1 44
Miami 33 2 545 15
Dallas 111 222 545 448
City N 15 555 44 987

So I need Los Angeles to compare Toy1 to all other cities, Toy2, through ToyN. And then so on for each city against the rest of the rows in the dataframe.

I am having trouble structuring this as I need a calculation difference on every column and doing a comparison between each city.

Expected Output: A new column with a difference score for City vs City. Example: |City|Toy1|Toy2|Toy3|ToyN|DiffMiami|Diff Dallas| |----|----|----|----|----|----|----| |Los Angeles|15|20|1|44|-17|15|

I have been trying to use DataFrame.diff() but not sure how to structure to use it in this scenario. Any suggestions would be gladly taken. Thanks.

In my proposed solution, for each pair of cities A,B we calculate sum_i(abs(toy_i(A) - toy_i(B))) where toy_i(A) is the number of toys i sold in city A etc

we report the results as a matrix of cities

This is easiest done in numpy

First we load the data

from io import StringIO
data = StringIO('''
City    Toy1    Toy2    Toy3    ToyN
LosAngeles  15  20  1   44
Miami   33  2   545 15
Dallas  111 222 545 448
CityN   15  555 44  987
''')
df = pd.read_csv(data, sep = '\s+')
df2 = df.set_index('City')
v = df2.values

Then a bit of numpy wizardy, inspired by https://stackoverflow.com/a/46266707/14551426 , to calculate pairwise sum of abs differences, and transforming back into a df

res = np.sum(np.abs(v - v[:, None]),axis=2)
df3 = pd.DataFrame(data = res, index = df2.index, columns = df2.index)
df3

output:

City        LosAngeles  Miami   Dallas  CityN
City                
LosAngeles  0          609      1246    1521
Miami       609        0        731     2044
Dallas      1246       731      0       1469
CityN       1521       2044     1469    0

we see the largest value is for the Miami/CityN combination hence this are the two cities with the largest differences

it would not be too difficult to find the top 10 largest numbers here either:

df3.unstack().sort_values()

produces

City        City      
LosAngeles  LosAngeles       0
Miami       Miami            0
Dallas      Dallas           0
CityN       CityN            0
LosAngeles  Miami          609
Miami       LosAngeles     609
            Dallas         731
Dallas      Miami          731
LosAngeles  Dallas        1246
Dallas      LosAngeles    1246
            CityN         1469
CityN       Dallas        1469
LosAngeles  CityN         1521
CityN       LosAngeles    1521
Miami       CityN         2044
CityN       Miami         2044

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM