如何计算行的列值与 dataframe 中具有多个值的所有其他行的差异？迭代每一行

Question

Main objective: Find cities that sell toys the most different from one another (top 10 differentials).主要目标：找到销售玩具的城市彼此差异最大（前 10 名差异）。 For example Los Angeles sells the most Toys 3 and 4 and the city most opposite of that would be Salt Lake City, which sells Toy 9 and 15 the most and Toys 3 and 4 the least.例如，洛杉矶销售的玩具 3 和 4 最多，与之最相反的城市是盐湖城，它销售玩具 9 和 15 最多，玩具 3 和 4 最少。

I have a CSV that I have put in a dataframe.我有一个 CSV 已放入 dataframe。

It has hundreds of rows currently and each row has 15 columns... Example:它目前有数百行，每行有 15 列......示例：

City城市	Toy1玩具1	Toy2玩具2	Toy3玩具3	ToyN玩具N
Los Angeles洛杉矶	15 15	20 20	1 1	44 44
Miami迈阿密	33 33	2 2	545 545	15 15
Dallas达拉斯	111 111	222 222	545 545	448 448
City N N市	15 15	555 555	44 44	987 987

So I need Los Angeles to compare Toy1 to all other cities, Toy2, through ToyN.所以我需要洛杉矶通过 ToyN 将 Toy1 与所有其他城市 Toy2 进行比较。 And then so on for each city against the rest of the rows in the dataframe.然后针对每个城市的 dataframe 中的行的 rest 以此类推。

I am having trouble structuring this as I need a calculation difference on every column and doing a comparison between each city.我在构建这个时遇到了麻烦，因为我需要对每一列进行计算差异并在每个城市之间进行比较。

Expected Output: A new column with a difference score for City vs City.预期 Output：一个新列，城市与城市的得分不同。 Example: |City|Toy1|Toy2|Toy3|ToyN|DiffMiami|Diff Dallas|示例：|City|Toy1|Toy2|Toy3|ToyN|DiffMiami|Diff Dallas| |----|----|----|----|----|----|----| |----|----|----|----|----|----|----| |Los Angeles|15|20|1|44|-17|15| |洛杉矶|15|20|1|44|-17|15|

I have been trying to use DataFrame.diff() but not sure how to structure to use it in this scenario.我一直在尝试使用 DataFrame.diff() 但不确定如何在这种情况下使用它。 Any suggestions would be gladly taken.任何建议都会很乐意采纳。 Thanks.谢谢。

Answer 1

In my proposed solution, for each pair of cities A,B we calculate sum_i(abs(toy_i(A) - toy_i(B))) where toy_i(A) is the number of toys i sold in city A etc在我提出的解决方案中，对于每对城市 A，B，我们计算 sum_i(abs(toy_i(A) - toy_i(B))) 其中 toy_i(A) 是我在城市 A 销售的玩具数量等

we report the results as a matrix of cities我们将结果报告为城市矩阵

This is easiest done in numpy这是在numpy中最容易完成的

First we load the data首先我们加载数据

from io import StringIO
data = StringIO('''
City    Toy1    Toy2    Toy3    ToyN
LosAngeles  15  20  1   44
Miami   33  2   545 15
Dallas  111 222 545 448
CityN   15  555 44  987
''')
df = pd.read_csv(data, sep = '\s+')
df2 = df.set_index('City')
v = df2.values

Then a bit of numpy wizardy, inspired by https://stackoverflow.com/a/46266707/14551426 , to calculate pairwise sum of abs differences, and transforming back into a df然后是一点 numpy 魔法，灵感来自https://stackoverflow.com/a/46266707/14551426 ，计算 abs 差异的成对和，并转换回 df

res = np.sum(np.abs(v - v[:, None]),axis=2)
df3 = pd.DataFrame(data = res, index = df2.index, columns = df2.index)
df3

output: output：

City        LosAngeles  Miami   Dallas  CityN
City                
LosAngeles  0          609      1246    1521
Miami       609        0        731     2044
Dallas      1246       731      0       1469
CityN       1521       2044     1469    0

we see the largest value is for the Miami/CityN combination hence this are the two cities with the largest differences我们看到最大值是迈阿密/CityN 组合，因此这是差异最大的两个城市

it would not be too difficult to find the top 10 largest numbers here either:在这里找到前 10 个最大的数字也不会太难：

df3.unstack().sort_values()

produces生产

City        City      
LosAngeles  LosAngeles       0
Miami       Miami            0
Dallas      Dallas           0
CityN       CityN            0
LosAngeles  Miami          609
Miami       LosAngeles     609
            Dallas         731
Dallas      Miami          731
LosAngeles  Dallas        1246
Dallas      LosAngeles    1246
            CityN         1469
CityN       Dallas        1469
LosAngeles  CityN         1521
CityN       LosAngeles    1521
Miami       CityN         2044
CityN       Miami         2044

如何计算行的列值与 dataframe 中具有多个值的所有其他行的差异？迭代每一行

问题描述

1 个解决方案

解决方案1
0 2022-01-10 17:45:31

如何计算行的列值与 dataframe 中具有多个值的所有其他行的差异？ 迭代每一行

问题描述

1 个解决方案

解决方案1 0 2022-01-10 17:45:31

如何计算行的列值与 dataframe 中具有多个值的所有其他行的差异？迭代每一行

解决方案1
0 2022-01-10 17:45:31