简体   繁体   English


[英]Running a simple Python code

I tried to run a simple crawler in Python: 我试图在Python中运行一个简单的搜寻器:

import sys
import csv
import socket
import sqlite3
import logging
from optparse import OptionParser
from urlparse import urlparse
#pip install requests
import requests

# FUNCTION process_row_to_db.
#  handle one row and push to the DB

def process_row_to_db(conn, data_row, comment, hostname):
    exchange_host     = ''
    seller_account_id = ''
    account_type      = ''
    tag_id            = ''

    if len(data_row) >= 3:
        exchange_host     = data_row[0].lower()
        seller_account_id = data_row[1].lower()
        account_type      = data_row[2].lower()

    if len(data_row) == 4:
        tag_id            = data_row[3].lower()

    #data validation heurstics
    data_valid = 1;

    # Minimum length of a domain name is 1 character, not including extensions.
    # Domain Name Rules - Nic AG
    # www.nic.ag/rules.htm
    if(len(hostname) < 3):
        data_valid = 0

    if(len(exchange_host) < 3):
        data_valid = 0

    # could be single digit integers
    if(len(seller_account_id) < 1):
        data_valid = 0

    ## ads.txt supports 'DIRECT' and 'RESELLER'
    if(len(account_type) < 6):
        data_valid = 0

    if(data_valid > 0):
        logging.debug( "%s | %s | %s | %s | %s | %s" % (hostname, exchange_host, seller_account_id, account_type, tag_id, comment))

        # Insert a row of data using bind variables (protect against sql injection)
        c = conn.cursor()
        c.execute(insert_stmt, (hostname, exchange_host, seller_account_id, account_type, tag_id, comment))

        # Save (commit) the changes
        return 1

    return 0

# end process_row_to_db  #####

# FUNCTION crawl_to_db.
#  crawl the URLs, parse the data, validate and dump to a DB

def crawl_to_db(conn, crawl_url_queue):

    rowcnt = 0

    myheaders = {
            'User-Agent': 'AdsTxtCrawler/1.0; +https://github.com/InteractiveAdvertisingBureau/adstxtcrawler',
            'Accept': 'text/plain',

    for aurl in crawl_url_queue:
        ahost = crawl_url_queue[aurl]
        logging.info(" Crawling  %s : %s " % (aurl, ahost))
        r = requests.get(aurl, headers=myheaders)
        logging.info("  %d" % r.status_code)

        if(r.status_code == 200):
            logging.debug("%s" % r.text)

            tmpfile = 'tmpads.txt'
            with open(tmpfile, 'wb') as tmp_csv_file:

            with open(tmpfile, 'rb') as tmp_csv_file:
                #read the line, split on first comment and keep what is to the left (if any found)
                line_reader = csv.reader(tmp_csv_file, delimiter='#', quotechar='|')
                comment = ''

                for line in line_reader:
                    logging.debug("DATA:  %s" % line)

                        data_line = line[0]
                        data_line = "";

                    #determine delimiter, conservative = do it per row
                    if data_line.find(",") != -1:
                        data_delimiter = ','
                    elif data_line.find("\t") != -1:
                        data_delimiter = '\t'
                        data_delimiter = ' '

                    data_reader = csv.reader([data_line], delimiter=',', quotechar='|')
                    for row in data_reader:

                        if len(row) > 0 and row[0].startswith( '#' ):

                        if (len(line) > 1) and (len(line[1]) > 0):
                             comment = line[1]

                        rowcnt = rowcnt + process_row_to_db(conn, row, comment, ahost)

    return rowcnt

# end crawl_to_db  #####

# FUNCTION load_url_queue
#  Load the target set of URLs and reduce to an ads.txt domains queue

def load_url_queue(csvfilename, url_queue):
    cnt = 0

    with open(csvfilename, 'rb') as csvfile:
        targets_reader = csv.reader(csvfile, delimiter=',', quotechar='|')
        for row in targets_reader:

            if len(row) < 1 or row[0].startswith( '#' ):

            for item in row:
                host = "localhost"

                if  "http:" in item or "https:" in item :
                    logging.info( "URL: %s" % item)
                    parsed_uri = urlparse(row[0])
                    host = parsed_uri.netloc
                    host = item
                    logging.info( "HOST: %s" % item)

            skip = 0

                #print "Checking DNS: %s" % host
                ip = socket.gethostbyname(host)

                if "127.0.0" in ip:
                    skip = 0 #swap to 1 to skip localhost testing
                elif "" in ip:
                    skip = 1
                    logging.info("  Validated Host IP: %s" % ip)
                skip = 1

            if(skip < 1):
                ads_txt_url = 'http://{thehost}/ads.txt'.format(thehost=host)
                logging.info("  pushing %s" % ads_txt_url)
                url_queue[ads_txt_url] = host
                cnt = cnt + 1

    return cnt

# end load_url_queue  #####

#### MAIN ####

arg_parser = OptionParser()
arg_parser.add_option("-t", "--targets", dest="target_filename",
                  help="list of domains to crawl ads.txt from", metavar="FILE")
arg_parser.add_option("-d", "--database", dest="target_database",
                  help="Database to dump crawled data into", metavar="FILE")
arg_parser.add_option("-v", "--verbose", dest="verbose", action='count',
                  help="Increase verbosity (specify multiple times for more)")

(options, args) = arg_parser.parse_args()

if len(sys.argv)==1:

log_level = logging.WARNING # default
if options.verbose == 1:
    log_level = logging.INFO
elif options.verbose >= 2:
    log_level = logging.DEBUG
logging.basicConfig(filename='adstxt_crawler.log',level=log_level,format='%(asctime)s %(filename)s:%(lineno)d:%(levelname)s  %(message)s')

crawl_url_queue = {}
conn = None
cnt_urls = 0
cnt_records = 0

cnt_urls = load_url_queue(options.target_filename, crawl_url_queue)

if (cnt_urls > 0) and options.target_database and (len(options.target_database) > 1):
    conn = sqlite3.connect(options.target_database)

with conn:
    cnt_records = crawl_to_db(conn, crawl_url_queue)
    if(cnt_records > 0):

print "Wrote %d records from %d URLs to %s" % (cnt_records, cnt_urls, options.target_database)

logging.warning("Wrote %d records from %d URLs to %s" % (cnt_records, cnt_urls, options.target_database))

I'm using Python 2.7.9. 我正在使用Python 2.7.9。 I tried to install sqlite with this command: 我尝试使用此命令安装sqlite:

python -m pip install sqlite

I got back this: 我回来了:

Downloading/unpacking sqlite3 Could not find any downloads that satisfy the requirement sqlite3 Cleaning up... No distributions at all found for sqlite3 Storing debug log for failure in ...\\pip.log 下载/解压缩sqlite3找不到满足sqlite3要求的任何下载清理...没有找到sqlite3的所有发行版将调试日志存储在... \\ pip.log中

First step would be this command: 第一步将是以下命令:

$sqlite3 adstxt.db < adstxt_crawler.sql

I got these: 我得到这些:

"'sqlite3' is not recognized as an internal or external command, operable program or batch file." ““ sqlite3”未被识别为内部或外部命令,可操作程序或批处理文件。”

I know it's very basic, but I haven't found any relevant help, if you could help me, I really apprecitiate it. 我知道这是非常基本的,但是我还没有找到任何相关的帮助,如果您能帮助我,我真的很感激。 Thanks. 谢谢。

Adam 亚当

The first error: 第一个错误:

'sqlite3' is not recognized as an internal or external command, operable program or batch file.

Is because you try to run sqlite command line tool, which is not installed on your system. 是因为您尝试运行系统上未安装的sqlite命令行工具。 Python 3 includes sqlite but does not provide the standalone command sqlite3 Python 3包含sqlite,但不提供独立命令sqlite3

The second error is a syntax error. 第二个错误是语法错误。 In Python 3, print is a standard function, so must be used with parenthesis 在Python 3中,print是一个标准函数,因此必须与括号一起使用

print('hello world')

You probably tried to run python 2 code with Python 3 interpreter 您可能试图使用Python 3解释器运行python 2代码

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM