简体   繁体   English

从现有数据集中提取值

[英]Extracting values from existing dataset

I need to gather information from an existing dataset.我需要从现有数据集中收集信息。 The dataset looks as follows:数据集如下所示:

Source      Target     Label_S  Weight   Prop_1    Prop_2  Mer_1   Mer_2
car         airplane     0.5       0.2     1         0      0       0
car         train        0.5       0.5     1         1      0       1
car         bike         0.5       0.2     1         1      0       0
bike        motorbike    1       0.7       1         1      0       1
bike        car          1       0.2       1         1      0       0
airplane    car          -1      0.2       0         1      0       0         
train       car          1       0.5       1         1      1       0  
motorbike   car          1       0.7       1         1      1       0
motorbike   toy          1       0.6       1         0      1       1

Label_S, Prop_1 and Mer_1 are Source 's properties; Label_S, Prop_1Mer_1Source的属性; Prop_2 and Mer_2 are Target 's properties. Prop_2Mer_2Target的属性。 I am trying to create a list of unique nodes from both Source and Target , including their properties;我正在尝试从SourceTarget创建一个唯一节点列表,包括它们的属性; something like this:像这样:

Node       Label_S  Property   Merchandising
car        0.5        1             0
airplane   -1         0             0
train      1          1             1
bike       1          1             0 
motorbike  1          1             1
toy                   0             1

I had not problem to create the list including all the nodes:创建包含所有节点的列表没有问题:

source = df['Source'].unique().tolist()
target = df['Target'].unique().tolist()
all_nodes=list(source + target)

but I am not actually understanding how to get information from properties columns based on Source / Target information.但我实际上并不了解如何根据Source / Target信息从属性列中获取信息。

I think I should first split the dataframe in two dataframes: one with Source plus the properties of Source ;我想我应该先两个dataframes拆分数据帧:一个Source加的属性Source ; the other one with Target elements plus the properties of Target.另一个带有Target元素加上Target的属性。 Once got this information, maybe it could be good to append the two dataframes and remove duplicates under the column Node .一旦获得此信息,也许最好附加两个数据帧并删除列Node下的重复项。 But I feel that something is wrong: for example, I have Label_S which is a property of Source and not of Target...但我觉得有些不对劲:例如,我有Label_S ,它是Source的属性,而不是 Target 的属性...

I'm not sure I understand correctly but you can first create two distinct dataframes for source and target and aggregate the properties in lists:我不确定我是否理解正确,但您可以先为源和目标创建两个不同的数据框,然后在列表中聚合属性:

df_s = df[["Source", "Label_S", "Prop_1", "Mer_1"]].groupby("Source").agg(list)
df_t = df[["Target", "Weight", "Prop_2", "Mer_2"]].groupby("Target").agg(list)

print(df_s)
print(df_t)

Output:输出:

                   Label_S     Prop_1      Mer_1
Source                                          
airplane            [-1.0]        [0]        [0]
bike            [1.0, 1.0]     [1, 1]     [0, 0]
car        [0.5, 0.5, 0.5]  [1, 1, 1]  [0, 0, 0]
motorbike       [1.0, 1.0]     [1, 1]     [1, 1]
train                [1.0]        [1]        [1]
                         Weight        Prop_2         Mer_2
Target                                                     
airplane                  [0.2]           [0]           [0]
bike                      [0.2]           [1]           [0]
car        [0.2, 0.2, 0.5, 0.7]  [1, 1, 1, 1]  [0, 0, 0, 0]
motorbike                 [0.7]           [1]           [1]
toy                       [0.6]           [0]           [1]
train                     [0.5]           [1]           [1]
Edit编辑

You can aggregate the properties differently to keep only one value (eg max ), then merge you dataframes:您可以不同地聚合属性以仅保留一个值(例如max ),然后合并您的数据帧:

df_s = df[["Source", "Label_S", "Prop_1", "Mer_1"]].groupby("Source", as_index=False).agg(max)
df_t = df[["Target", "Weight", "Prop_2", "Mer_2"]].groupby("Target", as_index=False).agg(max)

df_s.columns = ["Node", "Label_S", "Property", "Merchandising"]
df_t.columns = ["Node", "Weight", "Property", "Merchandising"]

print(df_s.merge(df_t, how="outer").set_index("Node"))

Output:输出:

           Label_S  Property  Merchandising  Weight
Node                                               
airplane      -1.0         0              0     0.2
bike           1.0         1              0     0.2
car            0.5         1              0     0.7
motorbike      1.0         1              1     0.7
train          1.0         1              1     0.5
toy            NaN         0              1     0.6

If you want to exclude the Weight column:如果要排除权Weight列:

print(df_s.merge(df_t[["Node", "Property", "Merchandising"]], how="outer").set_index("Node"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM