Python List Append Slow?

Question

i have to merge two text files together into one, and create a new list from that. The first one contains urls and the other one urlpaths/folder, which have to be applied to EVERY url. Im Working with lists, and its really slow, because its roughtly about 200,000 items.

Sample:

urls.txt:

 http://wwww.google.com
 ....

paths.txt:

 /abc
 /bce
 ....

Later, after the loop is finished, there should be a new list with

http://wwww.google.com/abc
http://wwww.google.com/bce

Python Code:

URLS_TO_CHECK = [] #defined as global, needed later

def generate_list():
  urls = open("urls.txt", "r").read().splitlines()
  paths = open("paths.txt", "r").read().splitlines()
  done = open("done.txt", "r").read().splitlines() #old done urls

  for i in range(len(urls)):
    for x in range(len(paths)):
        url = re.search('(http://(.+?)....)', urls[i]) #needed
        url = "%s%s" %(url.group(1), paths[x])
        if url not in URLS_TO_CHECK:
            if url not in done:
                URLS_TO_CHECK.append(url) ##<<< slow!

Already read some other threads about map function, disable gc , but cant use map function with my program. and disable gc didn't really help.

Answer 1

This approach takes advantage of things such as:

quick look-up in set - O(1) instead of O(n)
generating values on demand instead of building whole list as once
reading from file in chunks instead of loading up whole data at once
avoiding unnecessary regular expression

def yield_urls():
    with open("paths.txt") as f:
        paths = f.readlines() # needed in each iteration and iterates over, may be list

    with open("done.txt") as f:
        done_urls = set(f.readlines()) # needed in each iteration and looked up, set is O(1) vs O(n) in list 

    # resources are cleaned up after with

    with open("urls.txt", "r") as f:
        for url in f: # iterate over list, not big list of ints generated before iteratiob, much quicker
            for subpath in paths:
                full_url = ''.join((url[7:], subpath)) # no regex means faster, maybe string formatting is quicker than join, you need to check
                # also, take care about trailing newlines in strings read from file
                if full_url not in done_urls:  # fast lookup in set
                    yield full_url  # yield instead of appending

# usage
for url in yield_urls():
    pass  # to something with url

Answer 2

 URLS_TO_CHECK = set(re.findall("'http://(.+?)....'",open("urls.txt", "r").read()))
 for url in URLS_TO_CHECK:
     for path in paths:
         check_url(url+path)

will probably be much faster ... and I think its essentially the same ....

Answer 3

Search in Dictionaries is faster compared lists Python: List vs Dict for look up table

URLS_TO_CHECK = {} #defined as global, needed later

def generate_list():
  urls = open("urls.txt", "r").read().splitlines()
  paths = open("paths.txt", "r").read().splitlines()
  done = dict([(l, True) for l in open("done.txt", "r").read().splitlines()]) #old done urls

  for i in range(len(urls)):
    for x in range(len(paths)):
      url = re.search('(http://(.+?)....)', urls[i]) #needed
      url = "%s%s" %(url.group(1), paths[x])
      if not url in URLS_TO_CHECK:
        if not url in done:
          URLS_TO_CHECK[url] = True #Result in URLS_TO_CHECK.keys()

Python List Append Slow?

Question

3 answers

solution1
1 ACCPTED 2015-04-23 18:06:57

solution2
0 2015-04-23 16:40:11

solution3
0 2015-04-23 16:52:48

Python List Append Slow?

Question

3 answers

solution1 1 ACCPTED 2015-04-23 18:06:57

solution2 0 2015-04-23 16:40:11

solution3 0 2015-04-23 16:52:48

solution1
1 ACCPTED 2015-04-23 18:06:57

solution2
0 2015-04-23 16:40:11

solution3
0 2015-04-23 16:52:48