[英]How to store mongoDB's nested documents in pandas without duplication
我正在從mongoDB讀取數據,並將其存儲在pandas數據框中,以進行進一步的探索性分析和機器學習。mongoDB文檔如下所示。
{
"user_id" : "user_9",
"order_id" : "order_9",
"meals" : 5,
"order_area" : "London",
"dish" : [
{
"dish_id" : "012" ,
"dish_name" : "ABC",
"dish_type" : "Non-Veg",
"dish_price" : 135,
"dish_quantity" : 2,
"ratings" : 4,
"reviews" : "blah blah blah",
"coupon_type" : "Rs 20 off"
},
{
"dish_id" : "013" ,
"dish_name" : "XYZ",
"dish_type" : "Non-Veg",
"dish_price" : 125,
"dish_quantity" : 3,
"ratings" : 4,
"reviews" : "blah blah blah",
"coupon_type" : "Rs 20 off"
},
],
}
一旦我在python中獲得數據,就使用json_normalize在將其插入數據框時拆分與菜相關的屬性
df= json_normalize(db.dataset2.find(), 'dish',
['_id','user_id','order_id','order_time','meals','order_area']
這讓我跟隨大熊貓
coupon_type dish_id dish_name dish_price dish_quantity
0 Rs 20 off 012 ABC 135 2
1 Rs 20 off 013 XYZ 125 3
ratings reviews coupon_type user_id order_id meals order_area
0 4 blah blah blah Rs 20 off 9 9 5 London
1 4 blah blah blah Rs 20 off 9 9 5 London
問題在於數據是在(user_id,order_id,meals,_id和order_area)中復制的嗎?還有什么其他方法可以在不重復的情況下將數據存儲在數據框中?
您可能正在尋找一個MultiIndex
,它至少看上去避免了duplication
- (請參閱docs ):
df = json_normalize(data, 'dish', ['user_id', 'order_id', 'meals', 'order_area'])
df = df.set_index(['user_id','order_id', 'meals', 'order_area'])
coupon_type dish_id dish_name dish_price \
user_id order_id meals order_area
user_9 order_9 5 London Rs 20 off 012 ABC 135
Rs 20 off 013 XYZ 125
dish_quantity dish_type ratings \
user_id order_id meals order_area
user_9 order_9 5 London 2 Non-Veg 4
3 Non-Veg 4
reviews
user_id order_id meals order_area
user_9 order_9 5 London blah blah blah
blah blah blah
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.