简体   繁体   English

大数据集优化

[英]Optimisation for large dataset

I have posted a code for review in here .我在这里发布了一个代码供审查。 Yet as per now it did not receive the correct respond which I assume is due to the lengthiness of the code.然而,到目前为止它没有收到正确的响应,我认为这是由于代码的冗长。 Here I shall cut it to the chase.在这里,我将切入正题。 Suppose we have the following lists:假设我们有以下列表:

t0=[('Albania','Angola','Germany','UK'),('UK','France','Italy'),('Austria','Bahamas','Brazil','Chile'),('Germany','UK'),('US')]
t1=[('Angola', 'UK'), ('Germany', 'UK'), ('UK', 'France'), ('UK', 'Italy'), ('France', 'Italy'), ('Austria', 'Bahamas')]
t2=[('Angola:UK'), ('Germany:UK'), ('UK:France'), ('UK:Italy'), ('France:Italy'), ('Austria:Bahamas')]

the aim is for each pair in t1 we go through t0 and if the pair is found we replace it with the corresponding t3 element, we can do this using the following:目标是针对t1每一对,我们通过t0 ,如果找到该对,我们将其替换为相应的t3元素,我们可以使用以下方法执行此操作:

result = []
for v1, v2 in zip(t1, t2):
    out = []
    for i in t0:
        common = set(v1).intersection(i)
        if set(v1) == common:
            out.append(tuple(list(set(i) - common) + [v2]))
        else:
            out.append(tuple(i))
    result.append(out)

pprint(result, width=100)  

which gives:这使:

[[('Albania', 'Germany', 'Angola:UK'),
  ('UK', 'France', 'Italy'),
  ('Austria', 'Bahamas', 'Brazil', 'Chile'),
  ('Germany', 'UK'),
  ('U', 'S')],
 [('Albania', 'Angola', 'Germany:UK'),
  ('UK', 'France', 'Italy'),
  ('Austria', 'Bahamas', 'Brazil', 'Chile'),
  ('Germany:UK',),
  ('U', 'S')],
 [('Albania', 'Angola', 'Germany', 'UK'),
  ('Italy', 'UK:France'),
  ('Austria', 'Bahamas', 'Brazil', 'Chile'),
  ('Germany', 'UK'),
  ('U', 'S')],
 [('Albania', 'Angola', 'Germany', 'UK'),
  ('France', 'UK:Italy'),
  ('Austria', 'Bahamas', 'Brazil', 'Chile'),
  ('Germany', 'UK'),
  ('U', 'S')],
 [('Albania', 'Angola', 'Germany', 'UK'),
  ('UK', 'France:Italy'),
  ('Austria', 'Bahamas', 'Brazil', 'Chile'),
  ('Germany', 'UK'),
  ('U', 'S')],
 [('Albania', 'Angola', 'Germany', 'UK'),
  ('UK', 'France', 'Italy'),
  ('Brazil', 'Chile', 'Austria:Bahamas'),
  ('Germany', 'UK'),
  ('U', 'S')]]

This list has length of 6 which shows that there are 6 elements in t1 and t2 and each sublist has 5 elements which are corresponding to number of elements in t0 .此列表的长度为 6,这表明t1t2中有 6 个元素,每个子列表有 5 个元素,对应于t0的元素数。 As it stands the code is fast yet in my case I have t0 which has length of ~48000 and t1 with length of ~30000.就目前而言,代码很快,但在我的情况下,我有t0长度为 ~48000 和 t1 的长度为 ~30000。 Running time takes almost forever I wonder how one performs same operations with faster methods?运行时间几乎是永远的 我想知道如何用更快的方法执行相同的操作?

You could use a double list comprehension.您可以使用双重列表理解。 The code runs approximately 3.47 times faster (13.3 µs vs 46.2 µs).代码运行速度大约快 3.47 倍(13.3 µs 与 46.2 µs)。

t0=[('Albania','Angola','Germany','UK'),('UK','France','Italy'),('Austria','Bahamas','Brazil','Chile'),('Germany','UK'),('US')]
t1=[('Angola', 'UK'), ('Germany', 'UK'), ('UK', 'France'), ('UK', 'Italy'), ('France', 'Italy'), ('Austria', 'Bahamas')]
t2=[('Angola:UK'), ('Germany:UK'), ('UK:France'), ('UK:Italy'), ('France:Italy'), ('Austria:Bahamas')]

# We transform the lists of tuple to lists of sets for easier and faster computations
# We transform the lists of tuple to lists of sets for easier and faster computations
t0 = [set(x) for x in t0]
t1 = [set(x) for x in t1]

# We define a function that removes list of elements and adds an element
# from a set 
def add_remove(set_, to_remove, to_add):
    result_temp = set_.copy()
    for element in to_remove:
        result_temp.remove(element)
    result_temp.add(to_add)
    return result_temp

# We do the computation using a double list comprehension
result = [[add_remove(y, x, z) if x.issubset(y) else y for y in t0] 
          for x, z in zip(t1, t2)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM