简体   繁体   中英

For-Loop: How to ignore previously added values in a list? (Python)

I have a loop that is constantly adding a variable with an unknown value to a list, and then prints the list. However I don't find a way to ignore the values previously found and added to the list when I want to print the list the next time.

I'm scraping a constantly updating website for keyword-matching links using requests and bs4 inside a loop. Once the website added the links I'm looking for my code adds them to a list, and prints the list. Once the website adds the next wave of matching links, these will also be added to my list, however my code will also add the old links found before to the list again since they still match my keyword. Is it possible to ignore these old links?

url = "www.website.com"  
keyword = "news"
results = []                    #list which saves the links 

while True:
        source = requests.get(url).text  
        soup = BeautifulSoup(source, 'lxml')
        options = soup.find_all("a", class_="name-link")      
        for o in options:
            if keyword in o.text:
                link = o.attrs["href"]                 #the links I want                
                results.append(link)                   #adds links to list

        print(results)
        time.sleep(5)                              #wait until next scrape


#so with every loop the value of 'link' is changing which makes it hard         
for me to find a way to ignore previously found links

To maybe make it easier to understand you could think of a loop adding an unknown number to a list with every execution of the loop, but the number should only be printed in the first execution.

Here is a proof of concept using sets, if the challenge is that you only want to keep unique links, and then print the new links found that have not been found previously:

import random

results = set()
for k in range(15):
    new = {random.randint(1,5)}
    print(f"First Seen: {new-results}")
    results = results.union(new)
    print(f"All: {results}")

If it is more of a streaming issue, where you save all links to a large list, but only want to print the latest ones found you can do something like this:

import random

results = []
for k in range(5):
    n = len(results)
    new = []
    for k in range(random.randint(1,5)):
        new.append(random.randint(1,5))

    results.extend(new)
    print(results[n:])

But then again, you can also just print new in this case....

This is a good use case for Set data structure. Sets do not maintain any ordering of the items. Very simple change to your code above:

url = "www.website.com"  
keyword = "news"
results = {}

while True:
        source = requests.get(url).text  
        soup = BeautifulSoup(source, 'lxml')
        options = soup.find_all("a", class_="name-link")      
        for o in options:
            if keyword in o.text:
                link = o.attrs["href"]                 #the links I want                
                results.add(link)                   #adds links to list

        print(results)
        time.sleep(5)                              #wait until next scrape

If you want to maintain order, you can use some variation of an ordered dictionary. Please see here: Does Python have an ordered set?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM