简体   繁体   中英

Iterate over list of urls and replace space with %20

I have a client that gave me a list of urls that I need to check but the list contains urls with white spaces - ex: " https://exdomain.com/dir/this is just%20a%20text.html"

I need to iterate over this list and replace all the spaces for %20. I know that is a best practice to use - instead of %20 but that's a problem to solve in the future.

What I've done so far is:

import pandas as pd
df = pd.DataFrame(columns = ['urls_with_spaces', 'urls_with_%20'])

df['urls_with_spaces'] = 
['https://exdomain.com/dir/this is just%20a%20text.jpg', 
'https://exdomain.com/dir/this is just%20a%20text2.jpg', 
'https://subdomain.exdomain.com/dir/this is just%20a%20text3.jpg']

df['urls_with_%20'] = [x.replace(' ', '%20') for x in data['urls_with_%20']]

Now, the problem is that there are urls that have line breaks, so I can replace the spaces for %20 but because of this line breaks I cannot access the urls after doing this.

An example of what I'm getting:

"https://subdomain.exdomain.com/content/x/ex/region/subregion/something/this
Is%20an%20example/x2/w-program/get-out.jpg

Any thoughts?

Use re.sub to match all whitespaces, not only the space with \s :

import re
...
df['urls_with_%20'] = [re.sub(r'\s+', '%20', x) for x in data['urls_with_spaces']]

Alternatively, you can try using urlib.parse.quote , but I'm not sure how it will handle break lines in your case:

from urlib.parse import quote
...
df['urls_with_%20'] = [quote(x) for x in data['urls_with_spaces']]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM