[英]Creating a database of links using sqlite on python
我編寫了以下代碼,我可以在其中跟隨初始頁面到兩個新頁面,並在 4 個級別重復此過程以記錄這些頁面中的所有 url。 我想為我遇到的所有鏈接創建一個數據庫。 如果我訪問了一個頁面(即跟隨它以獲得更多鏈接的訪問權限),我想為該鏈接記錄一個 1,如果我沒有訪問過該頁面,則記錄為 0。
def getlinks(xurl):
# given a Wikipedia article url,
# return all links on that page to Wikipedia articles
# (really should add error checking)
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
xlinks = [] # initialize list of links
hpage = urlopen(xurl) # read/open page
bs = BeautifulSoup(hpage, 'html.parser') # parse page
# find all links in div named 'bodyContent'
# such that they start with '/wiki/' and contain no colon
for link in bs.find('div', {'id':'bodyContent'}).find_all('a',
href=re.compile('^(/wiki/)((?!:).)*$')):
if 'href' in link.attrs:
# make the url complete and add to list
xlinks.append('https://en.wikipedia.org{}'.format(link.attrs['href']))
return xlinks # return list of urls
maxlevel = 4 # levels deep to follow
# branches to follow from each page on each level before the last
numbranches = 2
tasks = [] # initialize task list
mastlinks = set() # initialize master set of urls
iurl = 'https://en.wikipedia.org/wiki/Kevin_Bacon' # first page
ilevel = 1 # first (top) level
mastlinks.add(iurl) # add first page to master set
# add current level and page to tasks
tasks.append((ilevel, iurl))
import sqlite3
import csv
import os
visited = 0
db_connection = sqlite3.connect('dd5.db')
cursor = db_connection.cursor()
cretab = '''CREATE TABLE IF NOT EXISTS links (link TEXT PRIMARY KEY, visited BIT)'''
cursor.execute(cretab)
for ix in range(40): # do no more than 40 pages
if not tasks: # if no more tasks, we're done
break
# remove next task level, url from end of task list
level, url = tasks.pop()
print('\n', ix, 'level', level, url)
visited = 1
links = getlinks(url) # get links from current %page
cursor.execute("INSERT OR IGNORE INTO links VALUES (?, ?)", (links, visited))
print(len(links), 'article links')
ulinks = set(links)
print(len(ulinks), 'unique article links')
newlinks = ulinks.difference(mastlinks)
mastlinks = mastlinks.union(newlinks)
print(len(newlinks), 'new unique article links')
linklist = list(newlinks)
cursor.execute("UPDATE links SET visited=? WHERE link=?", (visited, links))
print('sample links:')
for link in linklist[:10]:
print(link)
if level < maxlevel:
for link in linklist[:numbranches]:
print('following', link)
# add next level link to tasks
tasks.append((level + 1, link))
我不斷收到“接口錯誤:錯誤綁定參數 0 - 可能不受支持的類型。” 錯誤。我也不確定我對sqlite相關代碼的放置是否正確,因為我是這方面的初學者。你能幫忙嗎?謝謝!
嗯, links
是一個列表。 使用.execute()
不能INSERT
或UPDATE
列表。
您可以遍歷該列表:
for link in links:
cursor.execute("INSERT OR IGNORE INTO links VALUES (?, ?)", (link, visited))
另一個潛在的解決方案是使用.executemany()
,您可以這樣使用:
to_insert = []
for link in links:
to_insert.append((link,visited))
cursor.executemany("INSERT OR IGNORE INTO links VALUES (?, ?)", to_insert)
您的UPDATE
代碼存在類似問題,上述信息也適用於該查詢。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.