简体   繁体   English

在 integer 序列中查找缺失元素的有效方法

[英]Efficient way to find missing elements in an integer sequence

Suppose we have two items missing in a sequence of consecutive integers and the missing elements lie between the first and last elements.假设我们在连续整数序列中缺少两个项目,并且缺少的元素位于第一个和最后一个元素之间。 I did write a code that does accomplish the task.我确实编写了完成任务的代码。 However, I wanted to make it efficient using less loops if possible.但是,如果可能的话,我想使用更少的循环来提高效率。 Any help will be appreciated.任何帮助将不胜感激。 Also what about the condition when we have to find more missing items (say close to n/4) instead of 2. I think then my code should be efficient right because I am breaking out from the loop earlier?另外,当我们必须找到更多丢失的项目(比如接近 n/4)而不是 2 时,情况又如何呢?我认为我的代码应该是高效的,因为我更早地从循环中跳出?

def missing_elements(L,start,end,missing_num):
    complete_list = range(start,end+1)
    count = 0
    input_index = 0
    for item  in  complete_list:
        if item != L[input_index]:
            print item
            count += 1
        else :
            input_index += 1
        if count > missing_num:
            break



def main():
    L = [10,11,13,14,15,16,17,18,20]
    start = 10
    end = 20
    missing_elements(L,start,end,2)



if __name__ == "__main__":
    main()

If the input sequence is sorted , you could use sets here.如果输入序列已排序,则可以在此处使用集合。 Take the start and end values from the input list:从输入列表中获取开始和结束值:

def missing_elements(L):
    start, end = L[0], L[-1]
    return sorted(set(range(start, end + 1)).difference(L))

This assumes Python 3;这假设 Python 3; for Python 2, use xrange() to avoid building a list first.对于 Python 2,使用xrange()避免先构建列表。

The sorted() call is optional; sorted()调用是可选的; without it a set() is returned of the missing values, with it you get a sorted list.如果没有它, set()返回缺失值的set() ,有了它,您将得到一个排序列表。

Demo:演示:

>>> L = [10,11,13,14,15,16,17,18,20]
>>> missing_elements(L)
[12, 19]

Another approach is by detecting gaps between subsequent numbers;另一种方法是检测后续数字之间的差距; using an olderitertools library sliding window recipe :使用旧的itertools库滑动窗口配方

from itertools import islice, chain

def window(seq, n=2):
    "Returns a sliding window (of width n) over data from the iterable"
    "   s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...                   "
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield result    
    for elem in it:
        result = result[1:] + (elem,)
        yield result

def missing_elements(L):
    missing = chain.from_iterable(range(x + 1, y) for x, y in window(L) if (y - x) > 1)
    return list(missing)

This is a pure O(n) operation, and if you know the number of missing items, you can make sure it only produces those and then stops:这是一个纯 O(n) 操作,如果您知道丢失项目的数量,您可以确保它只生成那些然后停止:

def missing_elements(L, count):
    missing = chain.from_iterable(range(x + 1, y) for x, y in window(L) if (y - x) > 1)
    return list(islice(missing, 0, count))

This will handle larger gaps too;这也将处理更大的差距; if you are missing 2 items at 11 and 12, it'll still work:如果您在 11 和 12 处丢失了 2 个项目,它仍然可以工作:

>>> missing_elements([10, 13, 14, 15], 2)
[11, 12]

and the above sample only had to iterate over [10, 13] to figure this out.而上面的示例只需要迭代[10, 13]来解决这个问题。

Assuming that L is a list of integers with no duplicates, you can infer that the part of the list between start and index is completely consecutive if and only if L[index] == L[start] + (index - start) and similarly with index and end is completely consecutive if and only if L[index] == L[end] - (end - index) .假设 L 是一个没有重复的整数列表,你可以推断出 start 和 index 之间的列表部分是完全连续的当且仅当L[index] == L[start] + (index - start)并且类似with index 和 end 是完全连续的当且仅当L[index] == L[end] - (end - index) This combined with splitting the list into two recursively gives a sublinear solution.这与将列表递归地拆分为两个相结合,给出了一个次线性解决方案。

# python 3.3 and up, in older versions, replace "yield from" with yield loop

def missing_elements(L, start, end):
    if end - start <= 1: 
        if L[end] - L[start] > 1:
            yield from range(L[start] + 1, L[end])
        return

    index = start + (end - start) // 2

    # is the lower half consecutive?
    consecutive_low =  L[index] == L[start] + (index - start)
    if not consecutive_low:
        yield from missing_elements(L, start, index)

    # is the upper part consecutive?
    consecutive_high =  L[index] == L[end] - (end - index)
    if not consecutive_high:
        yield from missing_elements(L, index, end)

def main():
    L = [10,11,13,14,15,16,17,18,20]
    print(list(missing_elements(L,0,len(L)-1)))
    L = range(10, 21)
    print(list(missing_elements(L,0,len(L)-1)))

main()
missingItems = [x for x in complete_list if not x in L]

Using collections.Counter :使用collections.Counter

from collections import Counter

dic = Counter([10, 11, 13, 14, 15, 16, 17, 18, 20])
print([i for i in range(10, 20) if dic[i] == 0])

Output:输出:

[12, 19]

Using scipy lib:使用scipy库:

import math
from scipy.optimize import fsolve

def mullist(a):
    mul = 1
    for i in a:
        mul = mul*i
    return mul

a = [1,2,3,4,5,6,9,10]
s = sum(a)
so = sum(range(1,11))
mulo = mullist(range(1,11))
mul = mullist(a)
over = mulo/mul
delta = so -s
# y = so - s -x
# xy = mulo/mul
def func(x):
    return (so -s -x)*x-over

print int(round(fsolve(func, 0))), int(round(delta - fsolve(func, 0)))

Timing it:计时:

$ python -mtimeit -s "$(cat with_scipy.py)" 

7 8

100000000 loops, best of 3: 0.0181 usec per loop

Other option is:其他选项是:

>>> from sets import Set
>>> a = Set(range(1,11))
>>> b = Set([1,2,3,4,5,6,9,10])
>>> a-b
Set([8, 7])

And the timing is:时间是:

Set([8, 7])
100000000 loops, best of 3: 0.0178 usec per loop
arr = [1, 2, 5, 6, 10, 12]
diff = []

"""zip will return array of tuples (1, 2) (2, 5) (5, 6) (6, 10) (10, 12) """
for a, b in zip(arr , arr[1:]):
    if a + 1 != b:
        diff.extend(range(a+1, b))

print(diff)

[3, 4, 7, 8, 9, 11] [3, 4, 7, 8, 9, 11]

If the list is sorted we can lookup for any gap.如果列表已排序,我们可以查找任何差距。 Then generate a range object between current (+1) and next value (not inclusive) and extend it to the list of differences.然后在当前(+1)和下一个值(不包括)之间生成一个范围对象,并将其扩展到差异列表。


 a=[1,2,3,7,5,11,20]
 b=[]
 def miss(a,b):
     for x in range (a[0],a[-1]):
        if x not in a:
            b.append(x)
     return b
 print (miss(a,b))

ANS: [4, 6, 8, 9, 10, 12, 13, 14, 15, 16, 17, 18, 19]答案: [4, 6, 8, 9, 10, 12, 13, 14, 15, 16, 17, 18, 19]

works for sorted , unsorted , with duplicates too适用于已sortedunsorted ,也适用于duplicates

Here's a one-liner:这是一个单行:

In [10]: l = [10,11,13,14,15,16,17,18,20]

In [11]: [i for i, (n1, n2) in enumerate(zip(l[:-1], l[1:])) if n1 + 1 != n2]
Out[11]: [1, 7]

I use the list, slicing to offset the copies by one, and use enumerate to get the indices of the missing item.我使用列表,切片以将副本偏移一个,并使用 enumerate 获取丢失项目的索引。

For long lists, this isn't great because it's not O(log(n)), but I think it should be pretty efficient versus using a set for small inputs.对于长列表,这不是很好,因为它不是 O(log(n)),但我认为与使用用于小输入的set相比,它应该非常有效。 izip from itertools would probably make it quicker still. itertools 中的izip可能会使其更快。

My take was to use no loops and set operations:我的看法是不使用循环和设置操作:

def find_missing(in_list):
    complete_set = set(range(in_list[0], in_list[-1] + 1))
    return complete_set - set(in_list)

def main():
    sample = [10, 11, 13, 14, 15, 16, 17, 18, 20]
    print find_missing(sample)

if __name__ == "__main__":
    main()

# => set([19, 12])

Simply walk the list and look for non-consecutive numbers:只需遍历列表并查找不连续的数字:

prev = L[0]
for this in L[1:]:
    if this > prev+1:
        for item in range(prev+1, this):    # this handles gaps of 1 or more
            print item
    prev = this

We found a missing value if the difference between two consecutive numbers is greater than 1 :如果两个连续数字之间的差大于1我们发现了一个缺失值:

>>> L = [10,11,13,14,15,16,17,18,20]
>>> [x + 1 for x, y in zip(L[:-1], L[1:]) if y - x > 1]
[12, 19]

Note : Python 3. In Python 2 use itertools.izip .注意:Python 3。在 Python 2 中使用itertools.izip

Improved version for more than one value missing in a row:连续丢失多个值的改进版本:

>>> import itertools as it
>>> L = [10,11,14,15,16,17,18,20] # 12, 13 and 19 missing
>>> [x + diff for x, y in zip(it.islice(L, None, len(L) - 1),
                              it.islice(L, 1, None)) 
     for diff in range(1, y - x) if diff]
[12, 13, 19]
>>> l = [10,11,13,14,15,16,17,18,20]
>>> [l[i]+1 for i, j in enumerate(l) if (l+[0])[i+1] - l[i] > 1]
[12, 19]
def missing_elements(inlist):
    if len(inlist) <= 1:
        return []
    else:
        if inlist[1]-inlist[0] > 1:
            return [inlist[0]+1] + missing_elements([inlist[0]+1] + inlist[1:])
        else:
            return missing_elements(inlist[1:])

First we should sort the list and then we check for each element, except the last one, if the next value is in the list.首先我们应该对列表进行排序,然后我们检查每个元素,除了最后一个,如果下一个值在列表中。 Be carefull not to have duplicates in the list!注意列表中不要有重复项!

l.sort()

[l[i]+1 for i in range(len(l)-1) if l[i]+1 not in l]

I stumbled on this looking for a different kind of efficiency -- given a list of unique serial numbers, possibly very sparse, yield the next available serial number, without creating the entire set in memory.我偶然发现了一种不同的效率 - 给定一个唯一序列号列表,可能非常稀疏,产生下一个可用序列号,而无需在内存中创建整个集合。 (Think of an inventory where items come and go frequently, but some are long-lived.) (想想一个物品经常来来去去的库存,但有些是长期存在的。)

def get_serial(string_ids, longtail=False):
  int_list = map(int, string_ids)
  int_list.sort()
  n = len(int_list)
  for i in range(0, n-1):
    nextserial = int_list[i]+1
    while nextserial < int_list[i+1]:
      yield nextserial
      nextserial+=1
  while longtail:
    nextserial+=1
    yield nextserial
[...]
def main():
  [...]
  serialgenerator = get_serial(list1, longtail=True)
  while somecondition:
    newserial = next(serialgenerator)

(Input is a list of string representations of integers, yield is an integer, so not completely generic code. longtail provides extrapolation if we run out of range.) (输入是一个整数的字符串表示列表,yield 是一个整数,所以不是完全通用的代码。如果我们超出范围,longtail 提供外推。)

There's also an answer to a similar question which suggests using a bitarray for efficiently handling a large sequence of integers.还有一个类似问题的答案,它建议使用位数组来有效地处理大量整数。

Some versions of my code used functions from itertools but I ended up abandoning that approach.我的代码的某些版本使用了 itertools 中的函数,但我最终放弃了这种方法。

A bit of mathematics and we get a simple solution.一点数学,我们得到一个简单的解决方案。 The below solution works for integers from m to n.以下解决方案适用于从 m 到 n 的整数。 Works for both sorted and unsorted postive and negative numbers.适用于已排序和未排序的正数和负数。

#numbers = [-1,-2,0,1,2,3,5]
numbers = [-2,0,1,2,5,-1,3]

sum_of_nums =  0
max = numbers[0]
min = numbers[0]
for i in numbers:
    if i > max:
        max = i
    if i < min:
        min = i
    sum_of_nums += i

# Total : sum of numbers from m to n    
total = ((max - min + 1) * (max + min)) / 2

# Subtract total with sum of numbers which will give the missing value
print total - sum_of_nums

With this code you can find any missing values in a sequence, except the last number.使用此代码,您可以找到序列中除最后一个数字之外的任何缺失值。 It in only required to input your data into excel file with column name "numbers".它只需要将您的数据输入到列名为“数字”的 Excel 文件中。

import pandas as pd
import numpy as np

data = pd.read_excel("numbers.xlsx")

data_sort=data.sort_values('numbers',ascending=True)
index=list(range(len(data_sort)))
data_sort['index']=index
data_sort['index']=data_sort['index']+1
missing=[]

for i in range (len(data_sort)-1):
    if data_sort['numbers'].iloc[i+1]-data_sort['numbers'].iloc[i]>1:
        gap=data_sort['numbers'].iloc[i+1]-data_sort['numbers'].iloc[i]
        numerator=1
        for j in range (1,gap):          
            mis_value=data_sort['numbers'].iloc[i+1]-numerator
            missing.append(mis_value)
            numerator=numerator+1
print(np.sort(missing))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 有没有更有效的方法找到丢失的 integer? - Is there a more efficient way to find the missing integer? 查找列表中重复序列索引的有效方法? - Efficient way to find the index of repeated sequence in a list? 查找整个序列的数字总和的有效方法 - Efficient way to find sum of digits of an entire sequence 查找列表中不同元素数量的有效方法 - Efficient way to find number of distinct elements in a list 查找列表中乱序元素的有效方法 - Efficient way to find elements in list out of order 在熊猫数据帧中找到重复整数之间最大位移的有效方法 - efficient way to find the max displacement between a repeating integer in a pandas dataframe 有没有一种有效的方法可以找到所有长度为 10 的 integer 元组总和为 100 - Is there an efficient way to find all integer tuples of length 10 that sum to 100 有没有办法从 pandas dataframe 中使用 python 查找序列中缺失的数字? - Is there a way to find missing numbers in a sequence using in python from a pandas dataframe? 查找包含来自另一个列表的子字符串的列表元素的有效方法 - An efficient way to find elements of a list that contain substrings from another list 查找列表中子列表元素的累积计数的有效方法 - efficient way to find cumulative count for the elements of a sublist in a list
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM