如何过滤字符串列表中的关键字？

Question

I have a list of strings which are links that I scraped using BeautifulSoup. I cannot figure out how to return only the strings which contain the word 'The'.我有一个字符串列表，它们是我使用 BeautifulSoup 抓取的链接。我不知道如何只返回包含单词“The”的字符串。 The solution might use Regex but it has not worked for me.该解决方案可能使用正则表达式，但它对我不起作用。

I tried我试过

for i in links_list:
     if re.match('^The', i) is not None:
        eps_only.append(i)

But I get errors like但我收到类似的错误

File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.8/re.py", line 191, in match
    return _compile(pattern, flags).match(string)
TypeError: expected string or bytes-like object

The list looks like this:该列表如下所示：

['index.html', 'seinfeld-scripts.html', 'episodes_oveview.html', 'seinfeld-characters.html', 'buy-seinfeld.html', 'http://addthis.com/bookmark.php?v=250&username=doctoroids', None, None, None, None, 'http://community.seinfeldscripts.com', 'buy-seinfeld.html', 'seinfeld-t-shirt.html', 'seinfeld-dvd.html', 'episodes_oveview.html', 'alpha.html', '    http://www.shareasale.com/r.cfm?u=439896&b=119192&m=16934&afftrack=seinfeldScriptsTop&urllink=search%2E80stees%2Ecom%2F%3Fcategory%3D80s%2BTV%26i%3D1%26theme%3DSeinfeld%26u1%3Dcategory%26u2%3Dtheme', ' TheSeinfeldChronicles.htm', ' TheStakeout.htm', ' TheRobbery.htm', ' MaleUnbonding.htm', ' TheStockTip.htm', ' TheExGirlfriend.htm', ' ThePonyRemark.htm', ' TheJacket.htm', ' ThePhoneMessage.htm', ' TheApartment.htm', ' TheStatue.htm', ' TheRevenge.htm', ' TheHeartAttack.htm', ' TheDeal.htm', ' TheBabyShower.htm', ' TheChineseRestaurant.htm', ' TheBusboy.htm', 'TheNote.html', ' TheTruth.htm', 'ThePen.html', ' TheDog.htm', ' TheLibrary.htm', ' TheParkingGarage.htm', 'TheCafe.html', ' TheTape.htm', 'TheNoseJob.html', 'TheStranded.html', ...]

Update: Full Code更新：完整代码

import requests
import re
from bs4 import BeautifulSoup

##################
##--user agent--##
##################

user_agent_desktop = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '\
    'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 '\
    'Safari/537.36'

headers = {'User-Agent': user_agent_desktop}

#########################
##--fetching the page--##
#########################

URL = 'https://www.seinfeldscripts.com/seinfeld-scripts.html'
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')


############################################################
##--scraping the links to the scripts from the main page--##
############################################################

links_list = []
eps_only = []

for link in soup.find_all('a'):
    links_list.append(link.get('href'))

### sorting for links that contain 'the' ###

for i in filter(None, links_list):
    if re.match('^The', str(i)) is not None:
        eps_only.append(i)
        print(eps_only)

Answer 1

Python's re.match will fail if it gets passed None as an argument -- hence the error you're getting.如果 Python 的re.match将None作为参数传递，它将失败——因此会出现错误。

Some of your list elements are None .您的某些列表元素是None 。

You will have to check for such elements before passing them to re.match .在将这些元素传递给re.match之前，您必须检查这些元素。

For example:例如：

for i in links_list:
    if i is not None and re.match('^The', i) is not None:
        eps_only.append(i)

Or, you could filter them out prior, like this:或者，您可以像这样先过滤掉它们：

links_list = [l for l in links_list if l is not None]

Answer 2

You should filter the list elements (without None ) returned from BeautifulSoup:您应该过滤从 BeautifulSoup 返回的列表元素（没有None ）：

for i in filter(None, links_list):
    if re.match('^The', str(i)) is not None:
        eps_only.append(i)
print(eps_only)

如何过滤字符串列表中的关键字？

问题描述

2 个解决方案

解决方案1
1 2020-12-27 06:23:22

解决方案2
1 已采纳 2020-12-27 06:25:56

如何过滤字符串列表中的关键字？

问题描述

2 个解决方案

解决方案1 1 2020-12-27 06:23:22

解决方案2 1 已采纳 2020-12-27 06:25:56

解决方案1
1 2020-12-27 06:23:22

解决方案2
1 已采纳 2020-12-27 06:25:56