简体   繁体   中英

Search for specific word list in a list of urls using Python

I'm trying to determine whether or not a list of URLs contain specific words. Below is my code:


url_list = ['website1.com', 'website2.com']

cci_words = ['Risk Management', 'Labor', 'Migrant Workers']

total_words = []
for url in url_list:
    r = requests.get(url, allow_redirects=False)
    soup = BeautifulSoup(r.content.lower(), 'lxml')
    words = soup.find_all(text=lambda text: text and cci_words.lower() in text)
    count = len(words)
    cci_words = [ ele.strip() for ele in words ]
    for word in words:
        total_words.append(word.strip())

    print('\nUrl: {}\ncontains {} of word: {}'.format(url, count, cci_words))
    print(cci_words)

#print(total_words)
total_count = len(total_words)

But I keep getting this error: AttributeError: 'list' object has no attribute 'lower'

Any ideas what should I do??

In your for loop you cast cci_words to a list in the below line, so your program is throwing an error after it iterates through the loop a second time and tries to call lower() on cci_words.

cci_words = [ ele.strip() for ele in words ]

You seem to be making this probem quite complex where you could do something like this if you want any word in the list being present to return True.

def words_in_url_list(words: List[str], url_list: List[str]) -> bool:
    count = 0
    for word in words:
        word = word.lower()
        [count += 1 for url in url_list if word in url]
    return count > 0

If you want to check for all the words in the list, you could try this approach.

def all_words_in_url_list(words: List[str], url_list: List[str]) -> bool:
    comparison: set[str] = set()
    for word in words:
        word = word.lower()
        [comparison.add(word) for url in url_list if word in url]
    return len(words) == len(comparison)
        

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM