简体   繁体   中英

Optimisation for large dataset

I have posted a code for review in here . Yet as per now it did not receive the correct respond which I assume is due to the lengthiness of the code. Here I shall cut it to the chase. Suppose we have the following lists:

t0=[('Albania','Angola','Germany','UK'),('UK','France','Italy'),('Austria','Bahamas','Brazil','Chile'),('Germany','UK'),('US')]
t1=[('Angola', 'UK'), ('Germany', 'UK'), ('UK', 'France'), ('UK', 'Italy'), ('France', 'Italy'), ('Austria', 'Bahamas')]
t2=[('Angola:UK'), ('Germany:UK'), ('UK:France'), ('UK:Italy'), ('France:Italy'), ('Austria:Bahamas')]

the aim is for each pair in t1 we go through t0 and if the pair is found we replace it with the corresponding t3 element, we can do this using the following:

result = []
for v1, v2 in zip(t1, t2):
    out = []
    for i in t0:
        common = set(v1).intersection(i)
        if set(v1) == common:
            out.append(tuple(list(set(i) - common) + [v2]))
        else:
            out.append(tuple(i))
    result.append(out)

pprint(result, width=100)  

which gives:

[[('Albania', 'Germany', 'Angola:UK'),
  ('UK', 'France', 'Italy'),
  ('Austria', 'Bahamas', 'Brazil', 'Chile'),
  ('Germany', 'UK'),
  ('U', 'S')],
 [('Albania', 'Angola', 'Germany:UK'),
  ('UK', 'France', 'Italy'),
  ('Austria', 'Bahamas', 'Brazil', 'Chile'),
  ('Germany:UK',),
  ('U', 'S')],
 [('Albania', 'Angola', 'Germany', 'UK'),
  ('Italy', 'UK:France'),
  ('Austria', 'Bahamas', 'Brazil', 'Chile'),
  ('Germany', 'UK'),
  ('U', 'S')],
 [('Albania', 'Angola', 'Germany', 'UK'),
  ('France', 'UK:Italy'),
  ('Austria', 'Bahamas', 'Brazil', 'Chile'),
  ('Germany', 'UK'),
  ('U', 'S')],
 [('Albania', 'Angola', 'Germany', 'UK'),
  ('UK', 'France:Italy'),
  ('Austria', 'Bahamas', 'Brazil', 'Chile'),
  ('Germany', 'UK'),
  ('U', 'S')],
 [('Albania', 'Angola', 'Germany', 'UK'),
  ('UK', 'France', 'Italy'),
  ('Brazil', 'Chile', 'Austria:Bahamas'),
  ('Germany', 'UK'),
  ('U', 'S')]]

This list has length of 6 which shows that there are 6 elements in t1 and t2 and each sublist has 5 elements which are corresponding to number of elements in t0 . As it stands the code is fast yet in my case I have t0 which has length of ~48000 and t1 with length of ~30000. Running time takes almost forever I wonder how one performs same operations with faster methods?

You could use a double list comprehension. The code runs approximately 3.47 times faster (13.3 µs vs 46.2 µs).

t0=[('Albania','Angola','Germany','UK'),('UK','France','Italy'),('Austria','Bahamas','Brazil','Chile'),('Germany','UK'),('US')]
t1=[('Angola', 'UK'), ('Germany', 'UK'), ('UK', 'France'), ('UK', 'Italy'), ('France', 'Italy'), ('Austria', 'Bahamas')]
t2=[('Angola:UK'), ('Germany:UK'), ('UK:France'), ('UK:Italy'), ('France:Italy'), ('Austria:Bahamas')]

# We transform the lists of tuple to lists of sets for easier and faster computations
# We transform the lists of tuple to lists of sets for easier and faster computations
t0 = [set(x) for x in t0]
t1 = [set(x) for x in t1]

# We define a function that removes list of elements and adds an element
# from a set 
def add_remove(set_, to_remove, to_add):
    result_temp = set_.copy()
    for element in to_remove:
        result_temp.remove(element)
    result_temp.add(to_add)
    return result_temp

# We do the computation using a double list comprehension
result = [[add_remove(y, x, z) if x.issubset(y) else y for y in t0] 
          for x, z in zip(t1, t2)]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM