简体   繁体   中英

How to find a particular URL in an HTML file with python?

There is a URL with a .bin attachment in my HTML file.
My goal is to extract the full link with my Python script. I am running this script across many HTML files and the location of the .bin URL may change.
If I was able to get the index of the beginning of the URL and the end, I could extract it that way.

I tried doing a word search through the HTML files but there are a few .bin URLS, I only want the first one. Any ideas would be appreciated. Or any other methods.

import urllib.request, urllib.error, urllib.parse
html_link = "www.mywebsitelink.com"
response = urllib.request.urlopen(html_link)
webContent = response.read()

I suggest you look at using Regex .

In your example, you will probably be looking for something like:

^http://.+\.bin$

You can test this out and explore what each part of the Regex expression means using this helpful tool: regex101

Your code would probably look something like this:

import re

bin_url = re.search("^http://.+\.bin$", webContent)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM