[英]Extracting values from existing dataset
I need to gather information from an existing dataset.我需要从现有数据集中收集信息。 The dataset looks as follows:
数据集如下所示:
Source Target Label_S Weight Prop_1 Prop_2 Mer_1 Mer_2
car airplane 0.5 0.2 1 0 0 0
car train 0.5 0.5 1 1 0 1
car bike 0.5 0.2 1 1 0 0
bike motorbike 1 0.7 1 1 0 1
bike car 1 0.2 1 1 0 0
airplane car -1 0.2 0 1 0 0
train car 1 0.5 1 1 1 0
motorbike car 1 0.7 1 1 1 0
motorbike toy 1 0.6 1 0 1 1
Label_S, Prop_1
and Mer_1
are Source
's properties; Label_S, Prop_1
和Mer_1
是Source
的属性; Prop_2
and Mer_2
are Target
's properties. Prop_2
和Mer_2
是Target
的属性。 I am trying to create a list of unique nodes from both Source
and Target
, including their properties;我正在尝试从
Source
和Target
创建一个唯一节点列表,包括它们的属性; something like this:像这样:
Node Label_S Property Merchandising
car 0.5 1 0
airplane -1 0 0
train 1 1 1
bike 1 1 0
motorbike 1 1 1
toy 0 1
I had not problem to create the list including all the nodes:创建包含所有节点的列表没有问题:
source = df['Source'].unique().tolist()
target = df['Target'].unique().tolist()
all_nodes=list(source + target)
but I am not actually understanding how to get information from properties columns based on Source
/ Target
information.但我实际上并不了解如何根据
Source
/ Target
信息从属性列中获取信息。
I think I should first split the dataframe in two dataframes: one with Source
plus the properties of Source
;我想我应该先两个dataframes拆分数据帧:一个
Source
加的属性Source
; the other one with Target
elements plus the properties of Target.另一个带有
Target
元素加上Target
的属性。 Once got this information, maybe it could be good to append the two dataframes and remove duplicates under the column Node
.一旦获得此信息,也许最好附加两个数据帧并删除列
Node
下的重复项。 But I feel that something is wrong: for example, I have Label_S
which is a property of Source
and not of Target...但我觉得有些不对劲:例如,我有
Label_S
,它是Source
的属性,而不是 Target 的属性...
I'm not sure I understand correctly but you can first create two distinct dataframes for source and target and aggregate the properties in lists:我不确定我是否理解正确,但您可以先为源和目标创建两个不同的数据框,然后在列表中聚合属性:
df_s = df[["Source", "Label_S", "Prop_1", "Mer_1"]].groupby("Source").agg(list)
df_t = df[["Target", "Weight", "Prop_2", "Mer_2"]].groupby("Target").agg(list)
print(df_s)
print(df_t)
Output:输出:
Label_S Prop_1 Mer_1
Source
airplane [-1.0] [0] [0]
bike [1.0, 1.0] [1, 1] [0, 0]
car [0.5, 0.5, 0.5] [1, 1, 1] [0, 0, 0]
motorbike [1.0, 1.0] [1, 1] [1, 1]
train [1.0] [1] [1]
Weight Prop_2 Mer_2
Target
airplane [0.2] [0] [0]
bike [0.2] [1] [0]
car [0.2, 0.2, 0.5, 0.7] [1, 1, 1, 1] [0, 0, 0, 0]
motorbike [0.7] [1] [1]
toy [0.6] [0] [1]
train [0.5] [1] [1]
You can aggregate the properties differently to keep only one value (eg max
), then merge you dataframes:您可以不同地聚合属性以仅保留一个值(例如
max
),然后合并您的数据帧:
df_s = df[["Source", "Label_S", "Prop_1", "Mer_1"]].groupby("Source", as_index=False).agg(max)
df_t = df[["Target", "Weight", "Prop_2", "Mer_2"]].groupby("Target", as_index=False).agg(max)
df_s.columns = ["Node", "Label_S", "Property", "Merchandising"]
df_t.columns = ["Node", "Weight", "Property", "Merchandising"]
print(df_s.merge(df_t, how="outer").set_index("Node"))
Output:输出:
Label_S Property Merchandising Weight
Node
airplane -1.0 0 0 0.2
bike 1.0 1 0 0.2
car 0.5 1 0 0.7
motorbike 1.0 1 1 0.7
train 1.0 1 1 0.5
toy NaN 0 1 0.6
If you want to exclude the Weight
column:如果要排除权
Weight
列:
print(df_s.merge(df_t[["Node", "Property", "Merchandising"]], how="outer").set_index("Node"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.