简体   繁体   English

如何计算行的列值与 dataframe 中具有多个值的所有其他行的差异? 迭代每一行

[英]How to calculate the difference of a row's column values against all other rows with multiple values in a dataframe? Iterate for every row

Main objective: Find cities that sell toys the most different from one another (top 10 differentials).主要目标:找到销售玩具的城市彼此差异最大(前 10 名差异)。 For example Los Angeles sells the most Toys 3 and 4 and the city most opposite of that would be Salt Lake City, which sells Toy 9 and 15 the most and Toys 3 and 4 the least.例如,洛杉矶销售的玩具 3 和 4 最多,与之最相反的城市是盐湖城,它销售玩具 9 和 15 最多,玩具 3 和 4 最少。

I have a CSV that I have put in a dataframe.我有一个 CSV 已放入 dataframe。

It has hundreds of rows currently and each row has 15 columns... Example:它目前有数百行,每行有 15 列......示例:

City城市 Toy1玩具1 Toy2玩具2 Toy3玩具3 ToyN玩具N
Los Angeles洛杉矶 15 15 20 20 1 1 44 44
Miami迈阿密 33 33 2 2 545 545 15 15
Dallas达拉斯 111 111 222 222 545 545 448 448
City N N市 15 15 555 555 44 44 987 987

So I need Los Angeles to compare Toy1 to all other cities, Toy2, through ToyN.所以我需要洛杉矶通过 ToyN 将 Toy1 与所有其他城市 Toy2 进行比较。 And then so on for each city against the rest of the rows in the dataframe.然后针对每个城市的 dataframe 中的行的 rest 以此类推。

I am having trouble structuring this as I need a calculation difference on every column and doing a comparison between each city.我在构建这个时遇到了麻烦,因为我需要对每一列进行计算差异并在每个城市之间进行比较。

Expected Output: A new column with a difference score for City vs City.预期 Output:一个新列,城市与城市的得分不同。 Example: |City|Toy1|Toy2|Toy3|ToyN|DiffMiami|Diff Dallas|示例:|City|Toy1|Toy2|Toy3|ToyN|DiffMiami|Diff Dallas| |----|----|----|----|----|----|----| |----|----|----|----|----|----|----| |Los Angeles|15|20|1|44|-17|15| |洛杉矶|15|20|1|44|-17|15|

I have been trying to use DataFrame.diff() but not sure how to structure to use it in this scenario.我一直在尝试使用 DataFrame.diff() 但不确定如何在这种情况下使用它。 Any suggestions would be gladly taken.任何建议都会很乐意采纳。 Thanks.谢谢。

In my proposed solution, for each pair of cities A,B we calculate sum_i(abs(toy_i(A) - toy_i(B))) where toy_i(A) is the number of toys i sold in city A etc在我提出的解决方案中,对于每对城市 A,B,我们计算 sum_i(abs(toy_i(A) - toy_i(B))) 其中 toy_i(A) 是我在城市 A 销售的玩具数量等

we report the results as a matrix of cities我们将结果报告为城市矩阵

This is easiest done in numpy这是在numpy中最容易完成的

First we load the data首先我们加载数据

from io import StringIO
data = StringIO('''
City    Toy1    Toy2    Toy3    ToyN
LosAngeles  15  20  1   44
Miami   33  2   545 15
Dallas  111 222 545 448
CityN   15  555 44  987
''')
df = pd.read_csv(data, sep = '\s+')
df2 = df.set_index('City')
v = df2.values

Then a bit of numpy wizardy, inspired by https://stackoverflow.com/a/46266707/14551426 , to calculate pairwise sum of abs differences, and transforming back into a df然后是一点 numpy 魔法,灵感来自https://stackoverflow.com/a/46266707/14551426 ,计算 abs 差异的成对和,并转换回 df

res = np.sum(np.abs(v - v[:, None]),axis=2)
df3 = pd.DataFrame(data = res, index = df2.index, columns = df2.index)
df3

output: output:

City        LosAngeles  Miami   Dallas  CityN
City                
LosAngeles  0          609      1246    1521
Miami       609        0        731     2044
Dallas      1246       731      0       1469
CityN       1521       2044     1469    0

we see the largest value is for the Miami/CityN combination hence this are the two cities with the largest differences我们看到最大值是迈阿密/CityN 组合,因此这是差异最大的两个城市

it would not be too difficult to find the top 10 largest numbers here either:在这里找到前 10 个最大的数字也不会太难:

df3.unstack().sort_values()

produces生产

City        City      
LosAngeles  LosAngeles       0
Miami       Miami            0
Dallas      Dallas           0
CityN       CityN            0
LosAngeles  Miami          609
Miami       LosAngeles     609
            Dallas         731
Dallas      Miami          731
LosAngeles  Dallas        1246
Dallas      LosAngeles    1246
            CityN         1469
CityN       Dallas        1469
LosAngeles  CityN         1521
CityN       LosAngeles    1521
Miami       CityN         2044
CityN       Miami         2044

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将一个数据框的每一行与另一数据框的所有行进行比较,并计算距离度量? - How to compare each row from one dataframe against all the rows from other dataframe and calculate distance measure? 根据其他列中的行值计算数据框中行值之间的差异 - Calculate difference between row values in dataframe based on row value in other column 遍历pandas数据帧中的每一行,并将所有行值乘以同一数据帧中的行值之一 - Iterate over every row in pandas dataframe and multiply all row values by one of the row values in same dataframe 如何遍历数据框中的行,并为每一行剪切每 3 个值并垂直堆叠这些值? - How can I iterate over rows in a dataframe, and for each row, cut every 3 values and stack the values vertically? 如何将 dataframe 中的每一行与另一个 dataframe 中的每一行进行比较,并查看值之间的差异? - How can I compare each row from a dataframe against every row from another dataframe and see the difference between values? 如何将 dataframe 行合并为单行,每列的所有行值都集中在一起? - How to merge dataframe rows to a single row with all row values concenated for each column? 具有一列的数据框,每一行都是一个值列表 - Dataframe with a column which every row is a list of values 比较数据框中每一列的前一行值 - Comparing previous row values of every column in a dataframe Python - 如何循环遍历 dataframe 中的每一行以更改列中的值? - Python - How to Loop over every row in a dataframe to change the values in a column? 如何获取数据框每一行中特定值的列名 - How to get the column name for a specific values in every row of a dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM