There is a URL with a .bin attachment in my HTML file.
My goal is to extract the full link with my Python script. I am running this script across many HTML files and the location of the .bin URL may change.
If I was able to get the index of the beginning of the URL and the end, I could extract it that way.
I tried doing a word search through the HTML files but there are a few .bin URLS, I only want the first one. Any ideas would be appreciated. Or any other methods.
import urllib.request, urllib.error, urllib.parse
html_link = "www.mywebsitelink.com"
response = urllib.request.urlopen(html_link)
webContent = response.read()
I suggest you look at using Regex .
In your example, you will probably be looking for something like:
^http://.+\.bin$
You can test this out and explore what each part of the Regex expression means using this helpful tool: regex101
Your code would probably look something like this:
import re
bin_url = re.search("^http://.+\.bin$", webContent)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.