[英]How can I manipulate the string items in a list?
I am trying to get familiar with web scraping with python and cannot figure out how to manipulate the strings that are part of a list.我正在尝试熟悉使用 python 进行网络抓取,但无法弄清楚如何操作作为列表一部分的字符串。
below is the code that I am working on to try and extract the movies showing at a local theater and I can get most of the names from the HTML code.下面是我正在尝试提取在当地剧院放映的电影的代码,我可以从 HTML 代码中获取大部分名称。 what I want to do is iterate through the list and take the first two characters off of the strings in the list because with regex I cannot just extract the name in my attempts.
我想要做的是遍历列表并从列表中的字符串中取出前两个字符,因为使用正则表达式我不能只在我的尝试中提取名称。
it throws errors because it sees it as a list object that I am trying to manipulate but its the strings in the list that I want to change.它抛出错误,因为它将它视为我试图操作的列表对象,但它是我想要更改的列表中的字符串。
from urllib.request import urlopen
import re
url = "http://woodburytheatre.com/showtimes"
page = urlopen(url)
page
html_bytes = page.read()
html = html_bytes.decode("utf-8")
#print(html)
span_index = html.find("<div id=\"showtimes_wrapper\">")
start_index = span_index + len("<div id=\"showtimes_wrapper\">")
end_index = html.find("<div id=\"t_comingsoon\">")
#print(span_index)
movie_info = html[start_index:end_index]
movie_list = list()
movie_list2 = list()
movie_list3 = list()
#print(movie_info)
for item in movie_info.split("\n"):
if "showtimes_movie" in item:
movie_list.append(item.strip())
#print(movie_list)
for item in movie_list:
movie_list2.append(re.findall("[0-9]\/[A-z0-9\-]+",str(movie_list)))
#print(movie_list2)
while movie_list2:
temp = movie_list2.pop()
print(type(temp))
print("temp" + str(temp))
temp2 = temp.lstrip("/")
print(temp2)
movie_list3.append(temp2)
print(movie_list3)
print(len(movie_list2))
print(movie_list3)
I know it is very messy and it can be much more efficient but I just want to be able to alter the strings in the list so I can get rid of the number and "/" right before them.我知道它非常混乱,而且效率更高,但我只想能够更改列表中的字符串,这样我就可以摆脱它们之前的数字和“/”。
Thanks in advance!提前致谢!
At second for
loop:在第二个
for
循环:
for item in movie_list:
movie_list2.append(re.findall("[0-9]\/[A-z0-9\-]+",str(movie_list)))
from python document about re.findall
:来自关于
re.findall
python 文档:
def findall(pattern: Pattern[AnyStr],
string: AnyStr,
flags: Union[int, RegexFlag] = ...) -> list
re.findall
returns list of all matching objects. re.findall
返回所有匹配对象的列表。 Therefore, you're appending list
to list
, and list doesn't have method lstrip
, causing error in while
loop.因此,您将
list
附加到list
,并且 list 没有方法lstrip
,导致while
循环出错。
while movie_list2:
temp = movie_list2.pop()
print(type(temp))
print("temp" + str(temp))
temp2 = temp.lstrip("/") # << here
print(temp2)
movie_list3.append(temp2)
Ultimately what you've trying to archive can be done with one-line:最终,您尝试存档的内容可以通过一行来完成:
from urllib.request import urlopen
page = urlopen("http://woodburytheatre.com/showtimes")
html_bytes = page.read()
html = html_bytes.decode("utf-8")
span_index = html.find("<div id=\"showtimes_wrapper\">")
start_index = span_index + len("<div id=\"showtimes_wrapper\">")
end_index = html.find("<div id=\"t_comingsoon\">")
movie_info = html[start_index:end_index]
# this line below
movie_list = [i.split('/')[-1].rstrip(r"'>") for i in movie_info.split("\n") if 'showtimes_movie' in i]
print(movie_list)
outputs:输出:
['Bill-and-Ted-Face-The-Music', 'The-New-Mutants', 'The-Personal-History-of-David-Copperfield', 'Inception-10th-Anniversary', 'Unhinged', 'Made-in-Italy', 'The-Rental', 'Trolls-World-Tour', 'Indiana-Jones-and-the-Temple-of-Doom', 'Indiana-Jones-and-the-Raiders-of-the-Lost-Ark']
With this you can simply replace hyphen with use of .replace
function.有了这个,您可以使用
.replace
函数简单地替换连字符。
To dive in details how that one line generator-expression - aka genexp - works, lets see this example:要详细了解一行生成器表达式- 又名genexp - 是如何工作的,让我们看看这个例子:
>>> output = [i for i in range(4) if i != 3]
You can put if
condition check in genexp to filter out items when creating list.您可以在genexp 中放置
if
条件检查以在创建列表时过滤项目。 This does exact same thing with following for
loop but does job quicker:这与以下
for
循环完全相同for
但工作速度更快:
>>> output = []
>>> for i in range(4):
... if i != 3:
... output.append(i)
And you can use expressions again to item you're adding to list.您可以再次使用表达式来添加到列表中的项目。
Instead of saving i
of for
loop, you can even just add absolutely unrelated things:您甚至可以添加完全不相关的东西,而不是保存
for
循环的i
:
>>> [[i for i in 'Nope.'] for i in range(2)]
[['N', 'o', 'p', 'e', '.'], ['N', 'o', 'p', 'e', '.']]
>>> ['Nope.' for i in range(4)]
['Nope.', 'Nope.', 'Nope.', 'Nope.']
With this combined, what this means is:结合起来,这意味着:
movie_list = [i.split('/')[-1].rstrip(r"'>") for i in movie_info.split("\n") if 'showtimes_movie' in i]
movie_list = []
for i in movie_info.split("\n"):
if 'showtimes_movie' in i:
split_strings_list = i.split('/')
last_item_in_list = split_strings_list[-1] # python can do backward-indexing
strip_text_from_right = last_item_in_list.rstrip("'>") # remove '> from behind of string
movie_list.append(strip_text_from_right)
By using enumerate
you can get both the index and item of a list:通过使用
enumerate
您可以同时获取列表的索引和项目:
for index, item in enumerate(lst):
# Strip the first two letters
list[index] = lst[index][2:]
Do note however that using regex on HTML is a bad idea.但是请注意,在 HTML 上使用正则表达式是一个坏主意。 I highly suggest using HTML parsing libraries, the famous of them being
beautifulsoup4
.我强烈建议使用 HTML 解析库,其中最著名的是
beautifulsoup4
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.