[英]How to calculate the difference of a row's column values against all other rows with multiple values in a dataframe? Iterate for every row
Main objective: Find cities that sell toys the most different from one another (top 10 differentials).主要目标:找到销售玩具的城市彼此差异最大(前 10 名差异)。 For example Los Angeles sells the most Toys 3 and 4 and the city most opposite of that would be Salt Lake City, which sells Toy 9 and 15 the most and Toys 3 and 4 the least.
例如,洛杉矶销售的玩具 3 和 4 最多,与之最相反的城市是盐湖城,它销售玩具 9 和 15 最多,玩具 3 和 4 最少。
I have a CSV that I have put in a dataframe.我有一个 CSV 已放入 dataframe。
It has hundreds of rows currently and each row has 15 columns... Example:它目前有数百行,每行有 15 列......示例:
City![]() |
Toy1![]() |
Toy2![]() |
Toy3![]() |
ToyN![]() |
---|---|---|---|---|
Los Angeles![]() |
15 ![]() |
20 ![]() |
1 ![]() |
44 ![]() |
Miami![]() |
33 ![]() |
2 ![]() |
545 ![]() |
15 ![]() |
Dallas![]() |
111 ![]() |
222 ![]() |
545 ![]() |
448 ![]() |
City N ![]() |
15 ![]() |
555 ![]() |
44 ![]() |
987 ![]() |
So I need Los Angeles to compare Toy1 to all other cities, Toy2, through ToyN.所以我需要洛杉矶通过 ToyN 将 Toy1 与所有其他城市 Toy2 进行比较。 And then so on for each city against the rest of the rows in the dataframe.
然后针对每个城市的 dataframe 中的行的 rest 以此类推。
I am having trouble structuring this as I need a calculation difference on every column and doing a comparison between each city.我在构建这个时遇到了麻烦,因为我需要对每一列进行计算差异并在每个城市之间进行比较。
Expected Output: A new column with a difference score for City vs City.预期 Output:一个新列,城市与城市的得分不同。 Example: |City|Toy1|Toy2|Toy3|ToyN|DiffMiami|Diff Dallas|
示例:|City|Toy1|Toy2|Toy3|ToyN|DiffMiami|Diff Dallas| |----|----|----|----|----|----|----|
|----|----|----|----|----|----|----| |Los Angeles|15|20|1|44|-17|15|
|洛杉矶|15|20|1|44|-17|15|
I have been trying to use DataFrame.diff() but not sure how to structure to use it in this scenario.我一直在尝试使用 DataFrame.diff() 但不确定如何在这种情况下使用它。 Any suggestions would be gladly taken.
任何建议都会很乐意采纳。 Thanks.
谢谢。
In my proposed solution, for each pair of cities A,B we calculate sum_i(abs(toy_i(A) - toy_i(B))) where toy_i(A) is the number of toys i sold in city A etc在我提出的解决方案中,对于每对城市 A,B,我们计算 sum_i(abs(toy_i(A) - toy_i(B))) 其中 toy_i(A) 是我在城市 A 销售的玩具数量等
we report the results as a matrix of cities我们将结果报告为城市矩阵
This is easiest done in numpy
这是在
numpy
中最容易完成的
First we load the data首先我们加载数据
from io import StringIO
data = StringIO('''
City Toy1 Toy2 Toy3 ToyN
LosAngeles 15 20 1 44
Miami 33 2 545 15
Dallas 111 222 545 448
CityN 15 555 44 987
''')
df = pd.read_csv(data, sep = '\s+')
df2 = df.set_index('City')
v = df2.values
Then a bit of numpy wizardy, inspired by https://stackoverflow.com/a/46266707/14551426 , to calculate pairwise sum of abs differences, and transforming back into a df然后是一点 numpy 魔法,灵感来自https://stackoverflow.com/a/46266707/14551426 ,计算 abs 差异的成对和,并转换回 df
res = np.sum(np.abs(v - v[:, None]),axis=2)
df3 = pd.DataFrame(data = res, index = df2.index, columns = df2.index)
df3
output: output:
City LosAngeles Miami Dallas CityN
City
LosAngeles 0 609 1246 1521
Miami 609 0 731 2044
Dallas 1246 731 0 1469
CityN 1521 2044 1469 0
we see the largest value is for the Miami/CityN combination hence this are the two cities with the largest differences我们看到最大值是迈阿密/CityN 组合,因此这是差异最大的两个城市
it would not be too difficult to find the top 10 largest numbers here either:在这里找到前 10 个最大的数字也不会太难:
df3.unstack().sort_values()
produces生产
City City
LosAngeles LosAngeles 0
Miami Miami 0
Dallas Dallas 0
CityN CityN 0
LosAngeles Miami 609
Miami LosAngeles 609
Dallas 731
Dallas Miami 731
LosAngeles Dallas 1246
Dallas LosAngeles 1246
CityN 1469
CityN Dallas 1469
LosAngeles CityN 1521
CityN LosAngeles 1521
Miami CityN 2044
CityN Miami 2044
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.