简体   繁体   中英

How to replace all URLs in a string with their hostname and tld (e.g. google.com)

I have a string with multiple URLs and some text in between.

How can I replace each URL with their hostname and top-level-domain?

Example Input: www.google.com some text google.com some text http://google.com some text https://stackoverflow.com/questions/ask

Desired Output: google.com some text google.com some text google.com some text stackoverflow.com

I've found the Python module tldextract but that just helps with extracting hostname + tld but not with finding and replacing all URLs

Thanks in advance!

You can also use regex with the logic below:

  1. (http[s]?://) --> Capture http:// or https://
  2. (www\.) --> Capture www.
  3. (?<=.[az][az][az])(/[^ ]*) Capture anything past .com with slashes, excluding .com (also other domains, like org, net, as long as 3-letter long)
yourString = 'www.google.com some text google.com some text http://google.com some text https://stackoverflow.com/questions/ask'

re.sub(r'(http[s]?://)|(?<=.com)(/[^ ]*)|(www\.)', '', yourString)

Out[1]:'google.com some text google.com some text google.com some text stackoverflow.com'

You could just replace 'www' (etc.) with '' for the part before the domain, but that solution ignores everything after the suffix which can't be predicted.

Try this:

import tldextract

somestr = 'www.google.com some text google.com some text http://google.com some text https://stackoverflow.com/questions/ask'

newstr = ''

for word in somestr.split(' '):
    extracted = tldextract.extract(word)
    if extracted.domain != '' and extracted.suffix != '':
        newstr += extracted.domain + '.' + extracted.suffix + ' '
    else:
        newstr += word + ' '

print(newstr)

Here is another version on pandas column using "re" and "tldextract":

import re
import tldextract

#define the regex pattern to catch any url (try it on regex101.com)
ANY_URL_REGEX = re.compile(r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))""")

#you may want to lower the string in your column
data['column'] = data['column'].str.lower()
#to simplify the process create 2 more columns
#1- that catches the full url example xyz.co.uk or asd.am.edu or google.com
#2- that catches the domain in that full url

data['url'] = data['column'].str.extract(ANY_URL_REGEX, expand=False)
data['domain'] = data['column'].str.extract(ANY_URL_REGEX, expand=False).apply(lambda url: tldextract.extract(url).domain if pd.notnull(url) else '')

#now apply on column to find any "URL" and replace it with "domain"
data['column'] = data.apply(lambda x: str(x['coalesced_brand']).replace(str(x['url']),x['domain']), axis=1)

Note: this sample code extracts sample_site out of (http(s)://www.)sample_site.com/whatever . you can modify it to extract sample_site.com

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM