如何从HTML代码中正确提取URL？

Question

I have saved a website's HTML code in a .txt file on my computer. 我已将网站的HTML代码保存在计算机上的.txt文件中。 I would like to extract all URLs from this text file using the following code: 我想使用以下代码从该文本文件中提取所有URL：

def get_net_target(page):
    start_link=page.find("href=")
    start_quote=page.find('"',start_link)
    end_quote=page.find('"',start_quote+1)
    url=page[start_quote+1:end_quote]
    return url
my_file = open("test12.txt")
page = my_file.read()
print(get_net_target(page))

However, the script only prints the first URL, but not all other links. 但是，该脚本仅打印第一个URL，而不打印所有其他链接。 Why is this? 为什么是这样？

Answer 1

You need to implement a loop to go through all URLs. 您需要实现循环以遍历所有URL。

print(get_net_target(page)) only prints the first URL found in page , so you will need to call this function again and again, each time replacing page by the substring page[end_quote+1:] until no more URL is found. print(get_net_target(page))仅打印page找到的page一个URL，因此您需要一次又一次调用此函数，每次用子字符串page[end_quote+1:]替换page ，直到找不到更多URL。

To get you started, next_index will store the last ending URL position, then the loop will retrieve the following URLs: 为使您入门， next_index将存储最后一个结束URL的位置，然后循环将检索以下URL：

next_index = 0 # the next page position from which the URL search starts

def get_net_target(page):
  global next_index

  start_link=page.find("href=")
  if start_link == -1: # no more URL
    return ""
  start_quote=page.find('"',start_link)
  end_quote=page.find('"',start_quote+1)
  next_index=end_quote
  url=page[start_quote+1:end_quote]
  end_quote=5
  return url


my_file = open("test12.txt")
page = my_file.read()

while True:
    url = get_net_target(page)
    if url == "": # no more URL
        break
    print(url)
    page = page[next_index:] # continue with the page

Also be careful because you only retrieve links which are enclosed inside " , but they can be enclosed by ' or even nothing... 也要小心，因为您只能检索"内的链接，但是它们可以用'甚至什么都没有...

如何从HTML代码中正确提取URL？

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-03-06 23:43:48

如何从HTML代码中正确提取URL？

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-03-06 23:43:48

解决方案1
2 已采纳 2017-03-06 23:43:48