简体   繁体   中英

Finding sequence using Python

I have a problem statement for which i am looking for some guidance. I have a table like the one below-源数据集

Now for every name, we have a dependecy. For some of the items in a name, there is no dependency and for some we can see they are dependent on 2 or 3 items from name column. I want a target dataset, in which there should be another column named sequence and values of sequence should be derived in this way- If there is no dependency for a value in name- sequence should be 1 If there is 1 dependecy for a particular item in name and that dependency value does not have any other dependency further, then value of sequence should be 2 Similarly, if we have an item in name, that is having 2 dependencies like country is having city and address and then city is further dependent on pincode which further does not have any dependency, so value of sequence should be 3 and so on. Hete is what i want the target dataset to look like- 目标数据集

Input Dataset for Boris: enter image description here

You can use CSV library and find calculate the count of data using row and column loop

import csv

with open('testdata1.csv', 'r') as csvfile:
 csvreader = csv.reader(csvfile)
 next(csvreader) #skip the first row
 for row in csvreader:
   i = 0
   for col in row:            
     if col in (None, ""):
        continue     
     if col.find(',') != -1:
        i = 1 + len(col.split(","))
     else:
        i = i + 1
   print(i)

Using pandas the solution can look as following:

import pandas as pd

data = pd.read_excel(r'D:\Desktop\data.xlsx')

sequence = []
for i in range(len(data['Name'])):
    # Here we store heads of 
    # the chains we are currently on
    # [<name>, <length>]
    deps_chains = [[data['Name'][i], 1]]

    # Currently maximal length
    # of the dependency chain
    max_dep_length = 1

    # Whether there are dependencies
    # to proceed in the next iteration
    is_longer = True

    while is_longer:
        # Here are the heads we will
        # consider in the next iteration
        next_deps_chain = []

        for dep, length in deps_chains:
            dep_idx = data[data['Name'] == dep].index[0]

            # Dependencies of the current dependency
            dependencies = data['Dependency'][dep_idx]

            if pd.isnull(dependencies):
                # If the current dependency 
                # have no dependencies of 
                # its own, then find out
                # whether length of the chain
                # is the maximal
                max_dep_length = max(max_dep_length, length)
            else:
                # Dependencies of the current
                # dependency will be considered
                # in the next iteration
                next_deps_chain += [
                    [d, length + 1] for d in dependencies.split(',')
                ]

        # Change for the next iteration
        deps_chains = next_deps_chain

        # Whether there are dependencies
        # for the next iteration
        is_longer = len(next_deps_chain) > 0

    # We found the longest dependency chain
    sequence.append(max_dep_length)

# Here we set the column 'sequence' 
# to our result
data['sequence'] = sequence

data.to_excel(r'D:\Desktop\data.xlsx', index=False)

Lacking specifics, some of this will have to be pseudo-code.

Unlike the other answers, I believe the OP is asking how to calculate the sequence # given the dependencies and names.

One approach would be to use recursive calling, made more efficient by a dict of previously calculated sequences. General idea is that if the dependencies was empty, the sequence # is 1, otherwise it is the maximum sequence # of the dependencies plus 1. If you wanted to, you could even implement this in excel.

class DepSeqTable:
    def __init__(self, datasource):
        self.seqlookup = dict()
        self.deplookup = dict()
        #for loop over each data line in datasource:
            #name = text from name column of datasource
            #parse the dependency column of datasource into a list called listOfDeps
            self.deplookup.update(name,listOfDeps)
        for name in self.deplookup:
            self.SeqOf(name)
    def SeqOf(self, name):
        if self.seqlookup.get(name) != None:
            return self.seqlookup.get(name)
        deps = self.deplookup.get(name)
        if deps == None:
            #raise error that name was not defined in table
            #return appropriate value (1 or maybe negative?)
        if len(deps) == 0:
            self.seqlookup.update(name, 1)
            return 1
        maxDepSeq = 0
        for dep in deps:
            depseq = self.SeqOf(dep)
            if depseq > maxDepSeq:
                maxDepSeq = depseq
        self.seqlookup.update(name, maxDepSeq +1)
        return maxDepSeq + 1

Usage would be:

table = DepSeqTable(datasource)
#draw whatever info you want out of table

You may want to add more 'get' type functions to access data from the DepSeqTable depending on what you need. Also you may want to remove the second for loop in init if you only want the sequences evaluated on demand.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM