在 Python 中查找未知值

Question

I have a dataset with 500 restaurant orders, and their totals.我有一个包含 500 个餐厅订单及其总数的数据集。 I want to identify the outliers in the dataset.我想识别数据集中的异常值。 And then decide if they are valid data points or wrong values.然后确定它们是有效数据点还是错误值。 And then remove the invalid ones.然后删除无效的。

The problem is I only have the total price of the orders, the item name, and the quantity ordered.问题是我只有订单的总价格、商品名称和订购数量。 I was wondering if it is possible to get the price of each item.我想知道是否有可能获得每件商品的价格。

Each item in the dictionary presents one order.字典中的每一项都表示一个顺序。 The Key is the total price, and the value is a list of tuples, each tuple present the item name and the quantity ordered. Key 是总价，value 是一个元组列表，每个元组表示商品名称和订购数量。

Sample of my dataset in a dictionary format(I have it as two columns in a dataframe too):字典格式的数据集示例（我也将它作为 dataframe 中的两列）：

{1215.5: [('Shrimp', 10), ('Fish&Chips', 6), ('Salmon', 8), ('Pasta', 5)],
 1230.0: [('Shrimp', 10), ('Salmon', 10), ('Fish&Chips', 8)],
 1234.0: [('Salmon', 9), ('Fish&Chips', 3), ('Pasta', 8), ('Shrimp', 10)],
 1292.5: [('Pasta', 7), ('Salmon', 9), ('Fish&Chips', 7), ('Shrimp', 9)],
 1301.5: [('Pasta', 5), ('Shrimp', 9), ('Salmon', 8), ('Fish&Chips', 10)],
 1314.5: [('Shrimp', 10), ('Pasta', 5), ('Fish&Chips', 10), ('Salmon', 7)],
 1343.5: [('Shrimp', 8), ('Fish&Chips', 10), ('Salmon', 9), ('Pasta', 7)]}

My desired output is to obtain the price of each item.我想要的 output 是获取每件商品的价格。 By doing this I hope I can be able to decide if the total is a valid data point or an outlier.通过这样做，我希望能够确定总数是有效数据点还是异常值。

I tried taking the third line and store the value in a list A我尝试采用第三行并将值存储在列表A

[('Salmon', 9), ('Fish&Chips', 3), ('Pasta', 8), ('Shrimp', 10)]

And the total price of these items B这些物品的总价格B

[1234.0]

Then I tried converting my first list to an array然后我尝试将我的第一个列表转换为数组

    A=np.array(lst)

The output output

array([['Salmon', '9'],
       ['Fish&Chips', '3'],
       ['Pasta', '8'],
       ['Shrimp', '10']], dtype='<U10')

The shapes形状

A.shape
(4,2)
B.shape
(1,)

Then applied the function然后应用了 function

X, _, _, _ = np.linalg.lstsq(A, B)

but the output returns an error message但 output 返回错误消息

LinAlgError: Incompatible dimensions

I know that m has to be equal for the function to work.我知道 m 必须相等才能使 function 工作。 But I am not sure how to change the shape of A .但我不确定如何改变A的形状。

Any input is appreciated.任何输入表示赞赏。 Thank you,谢谢，

Answer 1

A possible solution would be to use construct a [possibly overdetermined] system of linear equations and solve it.一个可能的解决方案是使用构建一个[可能超定的]线性方程组并求解它。 For example, the first list becomes 1215.5=10*Shrimp+6*Fish+8*Salmon+5*Pasta .例如，第一个列表变为1215.5=10*Shrimp+6*Fish+8*Salmon+5*Pasta 。

Assuming that the name of your dictionary is d , the matrix A of the system is given by:假设您的字典的名称是d ，则系统的矩阵A由下式给出：

A = pd.concat([pd.DataFrame(d[x]).set_index(0) for x in d], axis=1)\
                 .fillna(0).T

(And I strongly suggest that you do not use dict as the storage container.) The vector B is the list of keys: （而且我强烈建议您不要使用dict作为存储容器。）向量B是键列表：

B = list(d.keys())

The answer is:答案是：

X, _, _, _ = numpy.linalg.lstsq(A, B)
#array([35. , 27.5, 41. , 54. ])

You only need the first part of the returned tuple.您只需要返回元组的第一部分。

在 Python 中查找未知值

问题描述

1 个解决方案

解决方案1
1 2019-09-29 07:01:42

在 Python 中查找未知值

问题描述

1 个解决方案

解决方案1 1 2019-09-29 07:01:42

解决方案1
1 2019-09-29 07:01:42