简体   繁体   中英

Python List Append Slow?

i have to merge two text files together into one, and create a new list from that. The first one contains urls and the other one urlpaths/folder, which have to be applied to EVERY url. Im Working with lists, and its really slow, because its roughtly about 200,000 items.






Later, after the loop is finished, there should be a new list with


Python Code:

URLS_TO_CHECK = [] #defined as global, needed later

def generate_list():
  urls = open("urls.txt", "r").read().splitlines()
  paths = open("paths.txt", "r").read().splitlines()
  done = open("done.txt", "r").read().splitlines() #old done urls

  for i in range(len(urls)):
    for x in range(len(paths)):
        url = re.search('(http://(.+?)....)', urls[i]) #needed
        url = "%s%s" %(url.group(1), paths[x])
        if url not in URLS_TO_CHECK:
            if url not in done:
                URLS_TO_CHECK.append(url) ##<<< slow!

Already read some other threads about map function, disable gc , but cant use map function with my program. and disable gc didn't really help.

This approach takes advantage of things such as:

  • quick look-up in set - O(1) instead of O(n)
  • generating values on demand instead of building whole list as once
  • reading from file in chunks instead of loading up whole data at once
  • avoiding unnecessary regular expression

def yield_urls():
    with open("paths.txt") as f:
        paths = f.readlines() # needed in each iteration and iterates over, may be list

    with open("done.txt") as f:
        done_urls = set(f.readlines()) # needed in each iteration and looked up, set is O(1) vs O(n) in list 

    # resources are cleaned up after with

    with open("urls.txt", "r") as f:
        for url in f: # iterate over list, not big list of ints generated before iteratiob, much quicker
            for subpath in paths:
                full_url = ''.join((url[7:], subpath)) # no regex means faster, maybe string formatting is quicker than join, you need to check
                # also, take care about trailing newlines in strings read from file
                if full_url not in done_urls:  # fast lookup in set
                    yield full_url  # yield instead of appending

# usage
for url in yield_urls():
    pass  # to something with url
 URLS_TO_CHECK = set(re.findall("'http://(.+?)....'",open("urls.txt", "r").read()))
 for url in URLS_TO_CHECK:
     for path in paths:

will probably be much faster ... and I think its essentially the same ....

Search in Dictionaries is faster compared lists Python: List vs Dict for look up table

URLS_TO_CHECK = {} #defined as global, needed later

def generate_list():
  urls = open("urls.txt", "r").read().splitlines()
  paths = open("paths.txt", "r").read().splitlines()
  done = dict([(l, True) for l in open("done.txt", "r").read().splitlines()]) #old done urls

  for i in range(len(urls)):
    for x in range(len(paths)):
      url = re.search('(http://(.+?)....)', urls[i]) #needed
      url = "%s%s" %(url.group(1), paths[x])
      if not url in URLS_TO_CHECK:
        if not url in done:
          URLS_TO_CHECK[url] = True #Result in URLS_TO_CHECK.keys()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM