简体   繁体   中英

Extract urls information from pandas column

I need to keep some parts of a link:

Link             
www.xxx.co.uk/path1
www.asx_win.com/path2
www.asdfe.aer.com
...

Desired output:

Link2
xxx.co.uk
asx_win.com
asdfe.aer.com
...

I used urlparse and tldextract but I get either

Netloc
www.xxx.co.uk
www.asx_win.com
www.asdfe.aer.com
...

or

TLDEXTRACT

xxx
asx_win
asdfe.aer
...

By using strings, some issues can come from the following:

9     https://www.facebook.com/login/?next=https%3A%...
10    https://pt-br.facebook.com/114546123419/pos...
11    https://www.facebook.com/login/?next=https%3A%...
20    http://fsareq.media/?pg=article&id=s...
22    https://www.wq-wq.com/lrq-rqwrq-...
24    https://faseqrq.it/2020/05/28/...

My attempt would be to consider differences between what I get from url parse (Netloc) and from tldextract (i.,e., ending part). For example, from Netloc I get www.xxx.co.uk and from tldextract I get xxx . This means that if I subtract tldextract from Netloc I get www and co.uk . I would use as a cut-off point the part in common and keep the part after (ie, .co.uk ), that is what I am looking for.

The difference would be given by something like df['Link2'] = [a.replace(b, '').strip() for a, b in zip(df['Netloc'], df['TLDEXTRACT'])] . This works only because of the ending part (suffix) that I need to consider. Now I need to understand how to consider only the ending part to get the expected output. You can use the columns Netloc and TLDEXTRACT in the sample above.

First remove http / https:

from urllib.parse import urlparse
def remove(row):
    if(row['urls'].str.contains('https') or row['urls'].str.contains('http')):
        return urlparse(row['urls']).netloc
   
withouthttp = df.apply(lambda x: remove(x), axis=1)

then:

cut the first 4 signs ("www.")

cut everything after (/)


df = pd.DataFrame({'urls': ['www.xxx.co.uk/path1', 'www.asx_win.com/path2', 'www.asdfe.aer.com']})
df['urls'] = df['urls'].str[4:]
df['urls'].str.split('/').str[0]

You can also edit all records with https and http:

onlyHttps = df.loc[df['urls'].str.contains("https", case=False)]
allWithoutHttps = df[~df["urls"].str.contains("https", case=False)]

and after all operation (removing www and removing http/https - concat proper records)

pd.concat([https, http, www])

tldextract.extract() returns a named tuple of (subdomain, domain, suffix) :

tldextract.extract('www.xxx.co.uk')

# ExtractResult(subdomain='www', domain='xxx', suffix='co.uk')

So you can just join indexes [1:] :

import tldextract
df['Extracted'] = df.Link.apply(lambda x: '.'.join(tldextract.extract(x)[1:]))

#                                                 Link     Extracted
# 0                                www.xxx.co.uk/path1     xxx.co.uk
# 1                              www.asx_win.com/path2   asx_win.com
# 2                                  www.asdfe.aer.com       aer.com
# 3  https://www.facebook.com/login/?next=https%3A%...  facebook.com
# 4     https://pt-br.facebook.com/114546123419/pos...  facebook.com
# 5  https://www.facebook.com/login/?next=https%3A%...  facebook.com
# 6            http://fsareq.media/?pg=article&id=s...  fsareq.media
# 7                https://www.wq-wq.com/lrq-rqwrq-...     wq-wq.com
# 8                  https://faseqrq.it/2020/05/28/...    faseqrq.it

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM