简体   繁体   English

使用 Python 探索和查找数据中的模式/相似性

[英]Exploring and finding patterns/similarity in Data with Python

as I was learning Python,was working on one of the dataset like this:当我学习 Python 时,正在处理这样的数据集之一:

**Col1**                                 **Col2**      **Col3**        
dog                                        Z             st02          
dog,cat                                    Z             st02          
dog,bat,cat                                Z             st02          
bat,cat,elephant                           Y             st02          
dog,bat,cat,elephant                       Y             st02          
tiger                                      Z             st01          
pigeon                                     Z             st01          
pigeon,parrot                              Z             st01          
dove,parrot                                Z             st01          
pigeon,parrot                              Z             st01          
pigeon,parrot,dove                         Z             st01          
lion,leopard,cheetah                       Z             st01          
tiger,lion,leopard,cheetah                 Z             st01          
dog,tiger,cheetah                          Y             st01          
dog,tiger,leopard,cheetah                  Y             st01          
eagle,jaguar,Kangaroo,zebra                Z             st02          
cheetah,eagle,jaguar,Kangaroo,zebra        Z             st02          

The expected output is:预期的输出是:

**Col1**                                 **Col2**       **Col3**      
dog,bat,cat                                Z              st02          
dog,bat,cat,elephant                       Y              st02          
tiger,lion,leopard,cheetah                 Z              st01          
dog,tiger,leopard,cheetah                  Y              st01          
cheetah,eagle,jaguar,Kangaroo,zebra        Z              st02          
pigeon,parrot,dove                         Z              st01          

In order to extract the above rows as output, I tried tracing the patterns and using the below logic:为了提取上述行作为输出,我尝试跟踪模式并使用以下逻辑:

data = pd.read_excel("data.xlsx")
data['Col4'] = data['Col1'].str.count(',')
v1 = []
v2 = []
v1.append(0)
v2.append(0)
for i in range(0,data.shape[0]-1):
    x = data['Col2'][i]
    y = data['Col2'][i+1]
    t1 = data['Col3'][i]
    t2 = data['Col3'][i+1]
    g1 = (x == y) & (t1==t2)
    d1 = data['Col1'][i]
    d2 = data['Col1'][i+1]
    c1 = data['Col4'][i]
    c2 = data['Col4'][i+1]
    flag = 0
    if(all(x in d2 for x in d1)):
      flag = 1
    g2 = (flag == 1)&(c2>c1)
    v1.append(g1)
    v2.append(g2)
    data['new_cond1'] = v1   
    data['new_cond2'] = v2   
    data['Final_flag'] = (data['new_cond1']==True)&(data['new_cond2']==True) 
    data_output = data[data['Final_flag']==True]  

But I didn't end up getting the expected output, rather few additional rows are also present in output.但我最终没有得到预期的输出,而输出中也存在很少的额外行。 Could someone please help me extracting the rows mentioned in expected output.有人可以帮我提取预期输出中提到的行。


From the dataset, I am trying to extract 1) Rows which has maximum number of animals separated by commas (or consider birds wherever pigeon/parrot/dove is mentioned).从数据集中,我试图提取 1) 以逗号分隔的动物数量最多的行(或者在提到鸽子/鹦鹉/鸽子的地方考虑鸟类)。 2) Need not be the case that there should be only one maximum number of animals per Col2 or Col3, there might be even more than one Example as in case of row no. 2) 不必是每个 Col2 或 Col3 应该只有一个最大数量的动物,甚至可能有多个示例,如第 1 行。 1 and row no. 1 和行号5 with same value in Col 2 and Col 3. This is because category of animals is different in row no.1 and row no. 5 在第 2 列和第 3 列中具有相同的值。这是因为第 1 行和第 1 行的动物类别不同。 5. Hope it's clear. 5. 希望清楚。

Thanks in advance!提前致谢!

My comment is not formatted well so I'll post an answer to make it more readable.我的评论格式不正确,所以我会发布一个答案以使其更具可读性。 I know it's not your desired output but it's de-duplicating the rows.我知道这不是您想要的输出,但它正在对行进行重复数据删除。

#assuming your data is currently in a frame called df...
df_reduce = df.copy()
df_reduce = df_reduce.groupby(['**Col2**','**Col3**'])['**Col1**'].apply(','.join).reset_index()
for index, row in df_reduce.iterrows():
    animals = row['**Col1**'].split(',')
    animals = set(animals)
    row['**Col1**'] = str(animals)[1:-1]
print(df_reduce)

Output:输出:

  **Col2** **Col3**                                           **Col1**
0        Y     st01               'tiger', 'leopard', 'cheetah', 'dog'
1        Y     st02                    'bat', 'cat', 'elephant', 'dog'
2        Z     st01  'parrot', 'pigeon', 'tiger', 'lion', 'leopard'...
3        Z     st02  'zebra', 'Kangaroo', 'dog', 'eagle', 'cheetah'...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM