简体   繁体   English

For-Loop:如何忽略列表中先前添加的值? (蟒蛇)

[英]For-Loop: How to ignore previously added values in a list? (Python)

I have a loop that is constantly adding a variable with an unknown value to a list, and then prints the list. 我有一个循环,不断向列表中添加一个未知值的变量,然后打印列表。 However I don't find a way to ignore the values previously found and added to the list when I want to print the list the next time. 但是,当我想要下次打印列表时,我找不到忽略先前找到并添加到列表中的值的方法。

I'm scraping a constantly updating website for keyword-matching links using requests and bs4 inside a loop. 我正在使用循环中的请求和bs4抓取一个不断更新的网站,用于关键字匹配链接。 Once the website added the links I'm looking for my code adds them to a list, and prints the list. 一旦网站添加了链接,我正在寻找我的代码,将它们添加到列表中,然后打印列表。 Once the website adds the next wave of matching links, these will also be added to my list, however my code will also add the old links found before to the list again since they still match my keyword. 一旦网站添加了下一波匹配链接,这些也将添加到我的列表中,但是我的代码也会将之前找到的旧链接再次添加到列表中,因为它们仍然匹配我的关键字。 Is it possible to ignore these old links? 是否可以忽略这些旧链接?

url = "www.website.com"  
keyword = "news"
results = []                    #list which saves the links 

while True:
        source = requests.get(url).text  
        soup = BeautifulSoup(source, 'lxml')
        options = soup.find_all("a", class_="name-link")      
        for o in options:
            if keyword in o.text:
                link = o.attrs["href"]                 #the links I want                
                results.append(link)                   #adds links to list

        print(results)
        time.sleep(5)                              #wait until next scrape


#so with every loop the value of 'link' is changing which makes it hard         
for me to find a way to ignore previously found links

To maybe make it easier to understand you could think of a loop adding an unknown number to a list with every execution of the loop, but the number should only be printed in the first execution. 为了使它更容易理解,你可以想到一个循环在每次执行循环时向列表中添加一个未知数字,但该数字应该只在第一次执行时打印。

Here is a proof of concept using sets, if the challenge is that you only want to keep unique links, and then print the new links found that have not been found previously: 以下是使用集合的概念证明,如果挑战是您只想保留唯一链接,然后打印以前未找到的新链接:

import random

results = set()
for k in range(15):
    new = {random.randint(1,5)}
    print(f"First Seen: {new-results}")
    results = results.union(new)
    print(f"All: {results}")

If it is more of a streaming issue, where you save all links to a large list, but only want to print the latest ones found you can do something like this: 如果它更像是一个流媒体问题,你将所有链接保存到一个大的列表,但只想打印最新发现的链接,你可以这样做:

import random

results = []
for k in range(5):
    n = len(results)
    new = []
    for k in range(random.randint(1,5)):
        new.append(random.randint(1,5))

    results.extend(new)
    print(results[n:])

But then again, you can also just print new in this case.... 但话又说回来,你也可以在这种情况下打印new ....

This is a good use case for Set data structure. 这是Set数据结构的一个很好的用例。 Sets do not maintain any ordering of the items. 集合不保持项目的任何排序。 Very simple change to your code above: 对上面的代码进行非常简单的更改:

url = "www.website.com"  
keyword = "news"
results = {}

while True:
        source = requests.get(url).text  
        soup = BeautifulSoup(source, 'lxml')
        options = soup.find_all("a", class_="name-link")      
        for o in options:
            if keyword in o.text:
                link = o.attrs["href"]                 #the links I want                
                results.add(link)                   #adds links to list

        print(results)
        time.sleep(5)                              #wait until next scrape

If you want to maintain order, you can use some variation of an ordered dictionary. 如果要维护顺序,可以使用有序字典的某些变体。 Please see here: Does Python have an ordered set? 请看这里: Python有一个有序集吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM