简体   繁体   中英

remove part of a url using regex

This is the url:

url = "www.face.com/me/4000517004580.html?gps-id=5547572&scm=1007.19201.130907.0&scm_id=1007.19201.130907.0&scm-url=1007.19201.130907.0&pvid=56aacc48-cc78-4cb9-b176-c9acb7a0662c"

I need to remove the part after the .html , so it becomes:

"www.face.com/me/4000517004580.html"

You can use python's urllib to parse the url into parts and then remove the query string from the url

from urllib.parse import urlparse
url = "www.face.com/me/4000517004580.html?gps-id=5547572&scm=1007.19201.130907.0&scm_id=1007.19201.130907.0&scm-url=1007.19201.130907.0&pvid=56aacc48-cc78-4cb9-b176-c9acb7a0662c"

parse_result = urlparse(url)
url = parse_result._replace(query="").geturl()  # Remove query from url

Try:

url.split('.html')[0]+'.html'

result:

'www.face.com/me/4000517004580.html'

When you are not sure how to approach a problem, I suggest starting with some documentation. For example, you can check out the string methods and common string operations .

Scrolling through this list, you will read about the find() function:

Return the lowest index in the string where substring sub is found within the slice s[start:end]. Optional arguments start and end are interpreted as in slice notation. Return -1 if sub is not found.

So to find the "?" you can do this:

i = url.find("?")

Rather than thinking about how to remove part of the string, let's figure out how to keep the part we want. We can do this with a slice:

url = url[:i]

The builtin urllib library can be used here.

from urllib.parse import urljoin, urlparse

url = 'www.face.com/me/4000517004580.html?gps-id=5547572&scm=1007.19201.130907.0&scm_id=1007.19201.130907.0&scm-url=1007.19201.130907.0&pvid=56aacc48-cc78-4cb9-b176-c9acb7a0662c' 
output = urljoin(url, urlparse(url).path) 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM