I have a client that gave me a list of urls that I need to check but the list contains urls with white spaces - ex: " https://exdomain.com/dir/this is just%20a%20text.html"
I need to iterate over this list and replace all the spaces for %20. I know that is a best practice to use - instead of %20 but that's a problem to solve in the future.
What I've done so far is:
import pandas as pd
df = pd.DataFrame(columns = ['urls_with_spaces', 'urls_with_%20'])
df['urls_with_spaces'] =
['https://exdomain.com/dir/this is just%20a%20text.jpg',
'https://exdomain.com/dir/this is just%20a%20text2.jpg',
'https://subdomain.exdomain.com/dir/this is just%20a%20text3.jpg']
df['urls_with_%20'] = [x.replace(' ', '%20') for x in data['urls_with_%20']]
Now, the problem is that there are urls that have line breaks, so I can replace the spaces for %20 but because of this line breaks I cannot access the urls after doing this.
An example of what I'm getting:
"https://subdomain.exdomain.com/content/x/ex/region/subregion/something/this
Is%20an%20example/x2/w-program/get-out.jpg
Any thoughts?
Use re.sub
to match all whitespaces, not only the space with \s
:
import re
...
df['urls_with_%20'] = [re.sub(r'\s+', '%20', x) for x in data['urls_with_spaces']]
Alternatively, you can try using urlib.parse.quote
, but I'm not sure how it will handle break lines in your case:
from urlib.parse import quote
...
df['urls_with_%20'] = [quote(x) for x in data['urls_with_spaces']]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.