[英]How to properly extract URLs from HTML code?
I have saved a website's HTML code in a .txt
file on my computer. 我已将网站的HTML代码保存在计算机上的
.txt
文件中。 I would like to extract all URLs from this text file using the following code: 我想使用以下代码从该文本文件中提取所有URL:
def get_net_target(page):
start_link=page.find("href=")
start_quote=page.find('"',start_link)
end_quote=page.find('"',start_quote+1)
url=page[start_quote+1:end_quote]
return url
my_file = open("test12.txt")
page = my_file.read()
print(get_net_target(page))
However, the script only prints the first URL, but not all other links. 但是,该脚本仅打印第一个URL,而不打印所有其他链接。 Why is this?
为什么是这样?
You need to implement a loop to go through all URLs. 您需要实现循环以遍历所有URL。
print(get_net_target(page))
only prints the first URL found in page
, so you will need to call this function again and again, each time replacing page
by the substring page[end_quote+1:]
until no more URL is found. print(get_net_target(page))
仅打印page
找到的page
一个URL,因此您需要一次又一次调用此函数,每次用子字符串page[end_quote+1:]
替换page
,直到找不到更多URL。
To get you started, next_index
will store the last ending URL position, then the loop will retrieve the following URLs: 为使您入门,
next_index
将存储最后一个结束URL的位置,然后循环将检索以下URL:
next_index = 0 # the next page position from which the URL search starts
def get_net_target(page):
global next_index
start_link=page.find("href=")
if start_link == -1: # no more URL
return ""
start_quote=page.find('"',start_link)
end_quote=page.find('"',start_quote+1)
next_index=end_quote
url=page[start_quote+1:end_quote]
end_quote=5
return url
my_file = open("test12.txt")
page = my_file.read()
while True:
url = get_net_target(page)
if url == "": # no more URL
break
print(url)
page = page[next_index:] # continue with the page
Also be careful because you only retrieve links which are enclosed inside "
, but they can be enclosed by '
or even nothing... 也要小心,因为您只能检索
"
内的链接,但是它们可以用'
甚至什么都没有...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.