This is the url:
url = "www.face.com/me/4000517004580.html?gps-id=5547572&scm=1007.19201.130907.0&scm_id=1007.19201.130907.0&scm-url=1007.19201.130907.0&pvid=56aacc48-cc78-4cb9-b176-c9acb7a0662c"
I need to remove the part after the .html , so it becomes:
"www.face.com/me/4000517004580.html"
You can use python's urllib to parse the url into parts and then remove the query string from the url
from urllib.parse import urlparse
url = "www.face.com/me/4000517004580.html?gps-id=5547572&scm=1007.19201.130907.0&scm_id=1007.19201.130907.0&scm-url=1007.19201.130907.0&pvid=56aacc48-cc78-4cb9-b176-c9acb7a0662c"
parse_result = urlparse(url)
url = parse_result._replace(query="").geturl() # Remove query from url
Try:
url.split('.html')[0]+'.html'
result:
'www.face.com/me/4000517004580.html'
When you are not sure how to approach a problem, I suggest starting with some documentation. For example, you can check out the string methods and common string operations .
Scrolling through this list, you will read about the find()
function:
Return the lowest index in the string where substring sub is found within the slice s[start:end]. Optional arguments start and end are interpreted as in slice notation. Return -1 if sub is not found.
So to find the "?"
you can do this:
i = url.find("?")
Rather than thinking about how to remove part of the string, let's figure out how to keep the part we want. We can do this with a slice:
url = url[:i]
The builtin urllib
library can be used here.
from urllib.parse import urljoin, urlparse
url = 'www.face.com/me/4000517004580.html?gps-id=5547572&scm=1007.19201.130907.0&scm_id=1007.19201.130907.0&scm-url=1007.19201.130907.0&pvid=56aacc48-cc78-4cb9-b176-c9acb7a0662c'
output = urljoin(url, urlparse(url).path)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.