简体   繁体   English

如何过滤字符串列表中的关键字?

[英]How can I filter for a keyword in a list of strings?

I have a list of strings which are links that I scraped using BeautifulSoup. I cannot figure out how to return only the strings which contain the word 'The'.我有一个字符串列表,它们是我使用 BeautifulSoup 抓取的链接。我不知道如何只返回包含单词“The”的字符串。 The solution might use Regex but it has not worked for me.该解决方案可能使用正则表达式,但它对我不起作用。

I tried我试过

for i in links_list:
     if re.match('^The', i) is not None:
        eps_only.append(i)

But I get errors like但我收到类似的错误

File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.8/re.py", line 191, in match
    return _compile(pattern, flags).match(string)
TypeError: expected string or bytes-like object

The list looks like this:该列表如下所示:

['index.html', 'seinfeld-scripts.html', 'episodes_oveview.html', 'seinfeld-characters.html', 'buy-seinfeld.html', 'http://addthis.com/bookmark.php?v=250&username=doctoroids', None, None, None, None, 'http://community.seinfeldscripts.com', 'buy-seinfeld.html', 'seinfeld-t-shirt.html', 'seinfeld-dvd.html', 'episodes_oveview.html', 'alpha.html', '    http://www.shareasale.com/r.cfm?u=439896&b=119192&m=16934&afftrack=seinfeldScriptsTop&urllink=search%2E80stees%2Ecom%2F%3Fcategory%3D80s%2BTV%26i%3D1%26theme%3DSeinfeld%26u1%3Dcategory%26u2%3Dtheme', ' TheSeinfeldChronicles.htm', ' TheStakeout.htm', ' TheRobbery.htm', ' MaleUnbonding.htm', ' TheStockTip.htm', ' TheExGirlfriend.htm', ' ThePonyRemark.htm', ' TheJacket.htm', ' ThePhoneMessage.htm', ' TheApartment.htm', ' TheStatue.htm', ' TheRevenge.htm', ' TheHeartAttack.htm', ' TheDeal.htm', ' TheBabyShower.htm', ' TheChineseRestaurant.htm', ' TheBusboy.htm', 'TheNote.html', ' TheTruth.htm', 'ThePen.html', ' TheDog.htm', ' TheLibrary.htm', ' TheParkingGarage.htm', 'TheCafe.html', ' TheTape.htm', 'TheNoseJob.html', 'TheStranded.html', ...]

Update: Full Code更新:完整代码

import requests
import re
from bs4 import BeautifulSoup

##################
##--user agent--##
##################

user_agent_desktop = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '\
    'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 '\
    'Safari/537.36'

headers = {'User-Agent': user_agent_desktop}

#########################
##--fetching the page--##
#########################

URL = 'https://www.seinfeldscripts.com/seinfeld-scripts.html'
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')


############################################################
##--scraping the links to the scripts from the main page--##
############################################################

links_list = []
eps_only = []

for link in soup.find_all('a'):
    links_list.append(link.get('href'))

### sorting for links that contain 'the' ###

for i in filter(None, links_list):
    if re.match('^The', str(i)) is not None:
        eps_only.append(i)
        print(eps_only)

Python's re.match will fail if it gets passed None as an argument -- hence the error you're getting.如果 Python 的re.matchNone作为参数传递,它将失败——因此会出现错误。

Some of your list elements are None .您的某些列表元素是None

You will have to check for such elements before passing them to re.match .在将这些元素传递给re.match之前,您必须检查这些元素。

For example:例如:

for i in links_list:
    if i is not None and re.match('^The', i) is not None:
        eps_only.append(i)

Or, you could filter them out prior, like this:或者,您可以像这样先过滤掉它们:

links_list = [l for l in links_list if l is not None]

You should filter the list elements (without None ) returned from BeautifulSoup:您应该过滤从 BeautifulSoup 返回的列表元素(没有None ):

for i in filter(None, links_list):
    if re.match('^The', str(i)) is not None:
        eps_only.append(i)
print(eps_only)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM