簡體   English   中英

在 python 上使用 sqlite 創建鏈接數據庫

[英]Creating a database of links using sqlite on python

我編寫了以下代碼,我可以在其中跟隨初始頁面到兩個新頁面,並在 4 個級別重復此過程以記錄這些頁面中的所有 url。 我想為我遇到的所有鏈接創建一個數據庫。 如果我訪問了一個頁面(即跟隨它以獲得更多鏈接的訪問權限),我想為該鏈接記錄一個 1,如果我沒有訪問過該頁面,則記錄為 0。

    def getlinks(xurl):
     # given a Wikipedia article url,
     # return all links on that page to Wikipedia articles
     # (really should add error checking)
    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    import re

    xlinks = [] # initialize list of links
    hpage = urlopen(xurl) # read/open page
    bs = BeautifulSoup(hpage, 'html.parser') # parse page

     # find all links in div named 'bodyContent'
     # such that they start with '/wiki/' and contain no colon
    for link in bs.find('div', {'id':'bodyContent'}).find_all('a',
href=re.compile('^(/wiki/)((?!:).)*$')):
        if 'href' in link.attrs:
            # make the url complete and add to list
            xlinks.append('https://en.wikipedia.org{}'.format(link.attrs['href']))
    return xlinks # return list of urls

maxlevel = 4 # levels deep to follow

# branches to follow from each page on each level before the last
numbranches = 2
tasks = [] # initialize task list
mastlinks = set() # initialize master set of urls
iurl = 'https://en.wikipedia.org/wiki/Kevin_Bacon' # first page
ilevel = 1 # first (top) level
mastlinks.add(iurl) # add first page to master set

# add current level and page to tasks
tasks.append((ilevel, iurl))

import sqlite3
import csv
import os

visited = 0

db_connection = sqlite3.connect('dd5.db')
cursor = db_connection.cursor()
cretab = '''CREATE TABLE IF NOT EXISTS links (link TEXT PRIMARY KEY, visited BIT)'''
cursor.execute(cretab)


for ix in range(40): # do no more than 40 pages
    if not tasks: # if no more tasks, we're done
        break

    # remove next task level, url from end of task list
    level, url = tasks.pop()
    print('\n', ix, 'level', level, url)
    visited = 1
    links = getlinks(url) # get links from current %page
    cursor.execute("INSERT OR IGNORE INTO links VALUES (?, ?)", (links, visited))
    print(len(links), 'article links')
    ulinks = set(links)
    print(len(ulinks), 'unique article links')
    newlinks = ulinks.difference(mastlinks)
    mastlinks = mastlinks.union(newlinks)
    print(len(newlinks), 'new unique article links')
    linklist = list(newlinks)
    cursor.execute("UPDATE links SET visited=? WHERE link=?", (visited, links))
    print('sample links:')
    for link in linklist[:10]:
        print(link)
    if level < maxlevel:
        for link in linklist[:numbranches]:
            print('following', link)

            # add next level link to tasks
            tasks.append((level + 1, link))

我不斷收到“接口錯誤:錯誤綁定參數 0 - 可能不受支持的類型。” 錯誤。我也不確定我對sqlite相關代碼的放置是否正確,因為我是這方面的初學者。你能幫忙嗎?謝謝!

嗯, links是一個列表。 使用.execute()不能INSERTUPDATE列表。

您可以遍歷該列表:

for link in links:
    cursor.execute("INSERT OR IGNORE INTO links VALUES (?, ?)", (link, visited))

另一個潛在的解決方案是使用.executemany() ,您可以這樣使用:

to_insert = []
for link in links:
    to_insert.append((link,visited))
cursor.executemany("INSERT OR IGNORE INTO links VALUES (?, ?)", to_insert)

您的UPDATE代碼存在類似問題,上述信息也適用於該查詢。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM