简体   繁体   English

使用 Python 查找序列

[英]Finding sequence using Python

I have a problem statement for which i am looking for some guidance.我有一个问题陈述,我正在寻找一些指导。 I have a table like the one below-我有一张像下面这样的桌子-源数据集

Now for every name, we have a dependecy.现在对于每个名字,我们都有一个依赖。 For some of the items in a name, there is no dependency and for some we can see they are dependent on 2 or 3 items from name column.对于名称中的某些项目,没有依赖关系,对于某些我们可以看到它们依赖于名称列中的 2 或 3 个项目。 I want a target dataset, in which there should be another column named sequence and values of sequence should be derived in this way- If there is no dependency for a value in name- sequence should be 1 If there is 1 dependecy for a particular item in name and that dependency value does not have any other dependency further, then value of sequence should be 2 Similarly, if we have an item in name, that is having 2 dependencies like country is having city and address and then city is further dependent on pincode which further does not have any dependency, so value of sequence should be 3 and so on.我想要一个目标数据集,其中应该有另一个名为序列的列,并且序列的值应该以这种方式派生 - 如果名称中的值没有依赖关系 - 序列应该是 1 如果特定项目有 1 依赖关系在名称中并且该依赖值进一步没有任何其他依赖,那么序列的值应该是 2 同样,如果我们在名称中有一个项目,即具有 2 个依赖项,例如国家有城市和地址,然后城市进一步依赖pincode 进一步没有任何依赖关系,因此序列的值应该是 3 等等。 Hete is what i want the target dataset to look like- Hete 是我希望目标数据集看起来像的样子-目标数据集

Input Dataset for Boris: enter image description here Boris 的输入数据集:在此处输入图像描述

You can use CSV library and find calculate the count of data using row and column loop您可以使用 CSV 库并找到使用行和列循环计算数据的计数

import csv

with open('testdata1.csv', 'r') as csvfile:
 csvreader = csv.reader(csvfile)
 next(csvreader) #skip the first row
 for row in csvreader:
   i = 0
   for col in row:            
     if col in (None, ""):
        continue     
     if col.find(',') != -1:
        i = 1 + len(col.split(","))
     else:
        i = i + 1
   print(i)

Using pandas the solution can look as following:使用pandas解决方案如下所示:

import pandas as pd

data = pd.read_excel(r'D:\Desktop\data.xlsx')

sequence = []
for i in range(len(data['Name'])):
    # Here we store heads of 
    # the chains we are currently on
    # [<name>, <length>]
    deps_chains = [[data['Name'][i], 1]]

    # Currently maximal length
    # of the dependency chain
    max_dep_length = 1

    # Whether there are dependencies
    # to proceed in the next iteration
    is_longer = True

    while is_longer:
        # Here are the heads we will
        # consider in the next iteration
        next_deps_chain = []

        for dep, length in deps_chains:
            dep_idx = data[data['Name'] == dep].index[0]

            # Dependencies of the current dependency
            dependencies = data['Dependency'][dep_idx]

            if pd.isnull(dependencies):
                # If the current dependency 
                # have no dependencies of 
                # its own, then find out
                # whether length of the chain
                # is the maximal
                max_dep_length = max(max_dep_length, length)
            else:
                # Dependencies of the current
                # dependency will be considered
                # in the next iteration
                next_deps_chain += [
                    [d, length + 1] for d in dependencies.split(',')
                ]

        # Change for the next iteration
        deps_chains = next_deps_chain

        # Whether there are dependencies
        # for the next iteration
        is_longer = len(next_deps_chain) > 0

    # We found the longest dependency chain
    sequence.append(max_dep_length)

# Here we set the column 'sequence' 
# to our result
data['sequence'] = sequence

data.to_excel(r'D:\Desktop\data.xlsx', index=False)

Lacking specifics, some of this will have to be pseudo-code.由于缺乏细节,其中一些必须是伪代码。

Unlike the other answers, I believe the OP is asking how to calculate the sequence # given the dependencies and names.与其他答案不同,我相信 OP 正在询问如何计算序列 # 给定依赖项和名称。

One approach would be to use recursive calling, made more efficient by a dict of previously calculated sequences.一种方法是使用递归调用,通过先前计算的序列的字典来提高效率。 General idea is that if the dependencies was empty, the sequence # is 1, otherwise it is the maximum sequence # of the dependencies plus 1. If you wanted to, you could even implement this in excel.一般的想法是,如果依赖项为空,则序列号为 1,否则为依赖项的最大序列号加 1。如果您愿意,您甚至可以在 excel 中实现这一点。

class DepSeqTable:
    def __init__(self, datasource):
        self.seqlookup = dict()
        self.deplookup = dict()
        #for loop over each data line in datasource:
            #name = text from name column of datasource
            #parse the dependency column of datasource into a list called listOfDeps
            self.deplookup.update(name,listOfDeps)
        for name in self.deplookup:
            self.SeqOf(name)
    def SeqOf(self, name):
        if self.seqlookup.get(name) != None:
            return self.seqlookup.get(name)
        deps = self.deplookup.get(name)
        if deps == None:
            #raise error that name was not defined in table
            #return appropriate value (1 or maybe negative?)
        if len(deps) == 0:
            self.seqlookup.update(name, 1)
            return 1
        maxDepSeq = 0
        for dep in deps:
            depseq = self.SeqOf(dep)
            if depseq > maxDepSeq:
                maxDepSeq = depseq
        self.seqlookup.update(name, maxDepSeq +1)
        return maxDepSeq + 1

Usage would be:用法是:

table = DepSeqTable(datasource)
#draw whatever info you want out of table

You may want to add more 'get' type functions to access data from the DepSeqTable depending on what you need.您可能需要添加更多“get”类型的函数来访问 DepSeqTable 中的数据,具体取决于您的需要。 Also you may want to remove the second for loop in init if you only want the sequences evaluated on demand.此外,如果您只想按需评估序列,您可能希望删除 init 中的第二个 for 循环。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM