How to differentiate domain addreses with urls by using the regex?

Question

Problem:

Iam working on a feed generator which collects feeds from various online sources and I need to divide them into domains ,urls and ip addresses.According to my logic i can be able to differentiate them into ips and rest.And on applying regex on the remaining i thought of differentiating them into domains and urls but all are going to domain list and not to urls.

Code

#!/usr/bin/python
import csv
import MySQLdb
import time
from shutil import copyfile
import socket
import re

def __init__(self):
                self.data = ''
                self.ip = []
                self.url = []
                self.domain = []
def parse(self):
                with open('something.csv') as csv_file:
                        self.count = 0
                        self.reader = csv.reader( csv_file , delimiter = ',')
                        print self.reader.next()
                        self.data = self.reader

                        for trail in self.data:
                                self.address = trail[0]
                                try:
                                        socket.inet_aton( self.address )
                                        self.ip.append( self.address )
                                except:
                                        try:
                                                m = re.search(r"[a-zA-Z\d-]{,63}(\.[a-zA-Z\d-]{,63})*" , trail[0] )
                                                print m.group()

                                                self.domain.append( self.address)
                                        except:
                                                self.url.append( self.address )
                                self.count +=1

                self.today_date = time.strftime( "%Y-%m-%d" )
                with open('/home//ip/{}'.format(self.today_date) , 'w') as g:
                        for i in self.ip:
                                g.write( i )
                                g.write( '\n')

What I tried

To differentiate ip addresses i used socket library and used socket.inet_aton method to verify whether it is valid ip or not, if it is valid ip address then iam appending to list.
I took some help from regex tutorial and wrote the regex to differentiate the domains

Solution I wanted (Edited) Wrongly Asked. What I wanted is extract the domain names from urls http://www.pcwebopedia.com/index.html and find domain_name as pcwebopedia.com and sent it to domain list and also send the full url to url_list.

Suppose item is www.google.com it should send google.com to domain list and www.google.com to url list.

suppose item is abc.net it should send to domain list and not to url list.

Any suggestions on how to solve this ?

Answer 1

The simplest solution would be to check if '/' in url:

If you want more generic and valid solution use:

from urllib.parse import urlparse
urls = []
domains = []

def url_categorize(url):
    o = urlparse(url)
    if o.path:
        urls.append(url)
    else:
        domains.append(url)

url_categorize("https://www.google.co.in")
url_categorize("http://example.com")
url_categorize("https://stackoverflow.com/questions/ask")
url_categorize("https://twitter.com/MarceloRivero")


In [36]: urls
Out[36]: ['stackoverflow.com/questions/ask', 'twitter.com/MarceloRivero']

In [37]: domains
Out[37]: ['google.com', 'abc.net']

How to differentiate domain addreses with urls by using the regex?

Question

1 answers

solution1
0 2017-08-02 06:56:51

How to differentiate domain addreses with urls by using the regex?

Question

1 answers

solution1 0 2017-08-02 06:56:51

solution1
0 2017-08-02 06:56:51