简体   繁体   English

如何操作列表中的字符串项?

[英]How can I manipulate the string items in a list?

I am trying to get familiar with web scraping with python and cannot figure out how to manipulate the strings that are part of a list.我正在尝试熟悉使用 python 进行网络抓取,但无法弄清楚如何操作作为列表一部分的字符串。

below is the code that I am working on to try and extract the movies showing at a local theater and I can get most of the names from the HTML code.下面是我正在尝试提取在当地剧院放映的电影的代码,我可以从 HTML 代码中获取大部分名称。 what I want to do is iterate through the list and take the first two characters off of the strings in the list because with regex I cannot just extract the name in my attempts.我想要做的是遍历列表并从列表中的字符串中取出前两个字符,因为使用正则表达式我不能只在我的尝试中提取名称。

it throws errors because it sees it as a list object that I am trying to manipulate but its the strings in the list that I want to change.它抛出错误,因为它将它视为我试图操作的列表对象,但它是我想要更改的列表中的字符串。

from urllib.request import urlopen
import re
url = "http://woodburytheatre.com/showtimes"
page = urlopen(url)
page
html_bytes = page.read()
html = html_bytes.decode("utf-8")
#print(html)
span_index = html.find("<div id=\"showtimes_wrapper\">")
start_index = span_index + len("<div id=\"showtimes_wrapper\">")
end_index = html.find("<div id=\"t_comingsoon\">")
#print(span_index)
movie_info = html[start_index:end_index]
movie_list = list()
movie_list2 = list()
movie_list3 = list()
#print(movie_info)
for item in movie_info.split("\n"):
   if "showtimes_movie" in item:
       movie_list.append(item.strip())
#print(movie_list)
for item in movie_list:
   movie_list2.append(re.findall("[0-9]\/[A-z0-9\-]+",str(movie_list)))
#print(movie_list2)
while movie_list2:
   temp = movie_list2.pop()
   print(type(temp))
   print("temp" + str(temp))
   temp2 = temp.lstrip("/")
   print(temp2)
   movie_list3.append(temp2)
print(movie_list3)
print(len(movie_list2))
print(movie_list3)

I know it is very messy and it can be much more efficient but I just want to be able to alter the strings in the list so I can get rid of the number and "/" right before them.我知道它非常混乱,而且效率更高,但我只想能够更改列表中的字符串,这样我就可以摆脱它们之前的数字和“/”。

Thanks in advance!提前致谢!

At second for loop:在第二个for循环:

for item in movie_list:
   movie_list2.append(re.findall("[0-9]\/[A-z0-9\-]+",str(movie_list)))

from python document about re.findall :来自关于re.findall python 文档:

def findall(pattern: Pattern[AnyStr],
            string: AnyStr,
            flags: Union[int, RegexFlag] = ...) -> list

re.findall returns list of all matching objects. re.findall返回所有匹配对象的列表 Therefore, you're appending list to list , and list doesn't have method lstrip , causing error in while loop.因此,您将list附加到list ,并且 list 没有方法lstrip ,导致while循环出错。

while movie_list2:
   temp = movie_list2.pop()
   print(type(temp))
   print("temp" + str(temp))
   temp2 = temp.lstrip("/")  # << here
   print(temp2)
   movie_list3.append(temp2)

Ultimately what you've trying to archive can be done with one-line:最终,您尝试存档的内容可以通过一行来完成:

from urllib.request import urlopen

page = urlopen("http://woodburytheatre.com/showtimes")
html_bytes = page.read()
html = html_bytes.decode("utf-8")

span_index = html.find("<div id=\"showtimes_wrapper\">")
start_index = span_index + len("<div id=\"showtimes_wrapper\">")
end_index = html.find("<div id=\"t_comingsoon\">")

movie_info = html[start_index:end_index]

# this line below
movie_list = [i.split('/')[-1].rstrip(r"'>") for i in movie_info.split("\n") if 'showtimes_movie' in i]
print(movie_list)

outputs:输出:

['Bill-and-Ted-Face-The-Music', 'The-New-Mutants', 'The-Personal-History-of-David-Copperfield', 'Inception-10th-Anniversary', 'Unhinged', 'Made-in-Italy', 'The-Rental', 'Trolls-World-Tour', 'Indiana-Jones-and-the-Temple-of-Doom', 'Indiana-Jones-and-the-Raiders-of-the-Lost-Ark']

With this you can simply replace hyphen with use of .replace function.有了这个,您可以使用.replace函数简单地替换连字符。


To dive in details how that one line generator-expression - aka genexp - works, lets see this example:要详细了解一行生成器表达式- 又名genexp - 是如何工作的,让我们看看这个例子:

>>> output = [i for i in range(4) if i != 3]

You can put if condition check in genexp to filter out items when creating list.您可以在genexp 中放置if条件检查以在创建列表时过滤项目。 This does exact same thing with following for loop but does job quicker:这与以下for循环完全相同for但工作速度更快:

>>> output = []
>>> for i in range(4):
...     if i != 3:
...         output.append(i)

And you can use expressions again to item you're adding to list.您可以再次使用表达式来添加到列表中的项目。
Instead of saving i of for loop, you can even just add absolutely unrelated things:您甚至可以添加完全不相关的东西,而不是保存for循环的i

>>> [[i for i in 'Nope.'] for i in range(2)]
[['N', 'o', 'p', 'e', '.'], ['N', 'o', 'p', 'e', '.']]
>>> ['Nope.' for i in range(4)]
['Nope.', 'Nope.', 'Nope.', 'Nope.']

With this combined, what this means is:结合起来,这意味着:

movie_list = [i.split('/')[-1].rstrip(r"'>") for i in movie_info.split("\n") if 'showtimes_movie' in i]
movie_list = []
for i in movie_info.split("\n"):
    if 'showtimes_movie' in i:
        split_strings_list = i.split('/')
        last_item_in_list = split_strings_list[-1]  # python can do backward-indexing
        strip_text_from_right = last_item_in_list.rstrip("'>")  # remove '> from behind of string
        
        movie_list.append(strip_text_from_right)

By using enumerate you can get both the index and item of a list:通过使用enumerate您可以同时获取列表的索引和项目:

for index, item in enumerate(lst):
    # Strip the first two letters
    list[index] = lst[index][2:]

Do note however that using regex on HTML is a bad idea.但是请注意,在 HTML 上使用正则表达式是一个坏主意。 I highly suggest using HTML parsing libraries, the famous of them being beautifulsoup4 .我强烈建议使用 HTML 解析库,其中最著名的是beautifulsoup4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM