简体   繁体   中英

How to extract ip in web scanning

While performing a simple task of ip-address extraction, I found that the program is doing well. But in the complete program for web crawling it fail to survive and gives uneven results.

This is my code snippet for ip-address:

    #!/usr/bin/python3

    import os
    import re 

    def get_ip_address(url):
        command = "host " + url
        process = os.popen(command)
        results = str(process.read())
        marker = results.find("has address") + 12
        n = (results[marker:].splitlines()[0])
        m = re.search('\w+ \w+: \d\([A-Z]+\)', n)
        if m is not None:
            url_new = url[8:]
            command = "host " + url_new
            process = os.popen(command)
            results = str(process.read())
            marker = results.find("has address") + 12
            return results[marker:].splitlines()[0]

    print(get_ip_address("https://www.yahoo.com"))

The complete program for web crawling looks like this:

    #!/usr/bin/python3

    from general import *
    from domain_name import *
    from ip_address import *
    from nmap import * 
    from robots_txt import *
    from whois import *

    ROOT_DIR = "companies"
    create_dir(ROOT_DIR)

    def gather_info(name, url):
        domain_name = get_domain_name(url)
        ip_address = get_ip_address(url)
        nmap = get_nmap('-F', ip_address)
        robots_txt = get_robots_txt(url)
        whois = get_whois(domain_name)
        create_report(name, url, domain_name, nmap, robots_txt, whois, ip_address)

   def create_report(name, full_url, domain_name, nmap, robots_txt, whois, ip_address):
       project_dir = ROOT_DIR + '/' + name
       create_dir(project_dir)
       write_file(project_dir + '/full_url.txt', full_url)
       write_file(project_dir + '/domain_name.txt', domain_name)
       write_file(project_dir + '/nmap.txt', nmap)
       write_file(project_dir + '/robots_txt.txt', robots_txt)
       write_file(project_dir + '/whois.txt', whois)
       write_file(project_dir + '/ip_address.txt', ip_address)

    x = input("Enter the Company Name: ")
    y = input("Enter the complete url of the company: ")    
    gather_info( x , y )

The input entered looks like this:

    root@nitin-Lenovo-G580:~/Desktop/web_scanning# python3 main.py 
    106.10.138.240
    Enter the Company Name: Yahoo
    Enter the complete url of the company: https://www.yahoo.com/
    /bin/sh: 1: Syntax error: "(" unexpected

And the output in ip_address.txt is:

    hoo.com/ not found: 3(NXDOMAIN)

The program as seen runs well during runtime and gives ip as 106.10.138.240 still saving something different in ip_address.txt Also I failed to find out how this /bin/sh syntax error came. Please help me...

Sorry I don't have enough reputation to add comments, so I'll post my suggestions here.

I think the problem is from process = os.popen(command) in def get_ip_address(url) . You can print command to see if it's valid.

Besides the problem, just a few suggestions:

  1. Try not to use * in import, since it makes readers harder to trace code.

  2. Learn pdb, which is a python debugger, simple but powerful for small or even medium size projects. Simplest way to use it is to add import pdb; pdb.set_trace() import pdb; pdb.set_trace() before the line you want the program to stop such that you can run your code line by line.

I second Joe Lin's suggestion to not use wildcards in your import statements. It pollutes your namespace greatly and may yield bizarre behavior.

Python is "batteries included" so you probably should leverage the requests and urllib3 packages for HTTP requests, use subprocess cautiously for executing commands, and checkout out the scrapy package for web scraping. The data their respective objects and methods return may have what you are attempting to extract.

Be as lazy as possible and rely on "prior art."

In the first few lines of get_ip_address I notice the following:

def get_ip_address(url):
    command = "host " + url
    process = os.popen(command)
    ....

If I executed this command via a shell, it would literally mirror this:

host http://www.foo.com

Doing a man host and reading the man page:

   host is a simple utility for performing DNS lookups. It is normally
   used to convert names to IP addresses and vice versa. When no arguments
   or options are given, host prints a short summary of its command line
   arguments and options.

   name is the domain name that is to be looked up. It can also be a
   dotted-decimal IPv4 address or a colon-delimited IPv6 address, in which
   case host will by default perform a reverse lookup for that address.
   server is an optional argument which is either the name or IP address
   of the name server that host should query instead of the server or
   servers listed in /etc/resolv.conf.

You are providing host a URL, when it is only wanting either an IP address or a hostname. URLs include the scheme, hostname, and path. You will have to extract the hostname explicitly to make host work the way have chosen to interact with it. Given that URLs may/may not include detailing path info, you have to unravel it:

url= "http://www.yahoo.com/some_random/path"

# Split on "//" to extract scheme
_, host_and_path = url.split("//")

# Use .split() with maxsplit 1 to break this into pieces as desired
hostname , path = host_path.split("/", 1)

# # Use 'hostname' as input to the command
command = "host " + url
...

I do not believe the question is providing all of the code that is related to this problem. The error output appears to be shell-based, not a traditional Python stack trace, maybe one of the get_something functions making use of Popen to do some shell commands you desire.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM