[英]No such File or Directory in Python Web Crawling?
I am a newbie to python. 我是python的新手。 I want to extract the name of categories and webpages of wikipedia page through the crawling procedure.
我想通过爬网过程来提取Wikipedia页面的类别和网页的名称。 During the course of this I am facing the following error.
在此过程中,我面临以下错误。
Downloading
Traceback (most recent call last):
File "C:\Users\SIBA\Desktop\PDF\Code\trialcode.py", line 100, in <module>
printTree(name, 0)
File "C:\Users\SIBA\Desktop\PDF\Code\trialcode.py", line 80, in printTree
content = open("categories/Category:"+catName+".html").readlines()
FileNotFoundError: [Errno 2] No such file or directory: 'categories/Category:Cricket.html'
The code snippet of what I have tried is as follows. 我尝试过的代码片段如下。 I am using Python 3.6 version.
我正在使用Python 3.6版本。
#Imports
import httplib2
from bs4 import BeautifulSoup
import subprocess
import time
import os,sys
os.path.dirname(sys.argv[0])
#declarations
catRoot = "http://en.wikipedia.org/wiki/Category:"
MAX_DEPTH = 100
done = []
ignore = []
# Removes all newline characters and replaces with spaces
def removeNewLines(in_text):
return in_text.replace('\n', ' ')
# Downloads a link into the destination
def download(link, dest):
# print link
if not os.path.exists(dest) or os.path.getsize(dest) == 0:
subprocess.getoutput('wget "' + link + '" -O "' + dest+ '"')
print ("Downloading")
def ensureDir(f):
if not os.path.exists(f):
os.makedirs(f)
# Cleans a text by removing tags
def clean(in_text):
s_list = list(in_text)
i,j = 0,0
while i < len(s_list):
# iterate until a left-angle bracket is found
if s_list[i] == '<':
if s_list[i+1] == 'b' and s_list[i+2] == 'r' and s_list[i+3] == '>':
i=i+1
print (hello)
continue
while s_list[i] != '>':
# pop everything from the the left-angle bracket until the right-angle bracket
s_list.pop(i)
# pops the right-angle bracket, too
s_list.pop(i)
elif s_list[i] == '\n':
s_list.pop(i)
else:
i=i+1
# convert the list back into text
join_char=''
return (join_char.join(s_list))#.replace("<br>","\n")
# Gets bullets
def getBullets(content):
mainSoup = BeautifulSoup(contents)
# Gets empty bullets
def getAllBullets(content):
mainSoup = BeautifulSoup(str(content))
subcategories = mainSoup.findAll('div',attrs={"class" : "CategoryTreeItem"})
empty = []
full = []
for x in subcategories:
subSoup = BeautifulSoup(str(x))
link = str(subSoup.findAll('a')[0])
if (str(x)).count("CategoryTreeEmptyBullet") > 0:
empty.append(clean(link).replace(" ","_"))
elif (str(x)).count("CategoryTreeBullet") > 0:
full.append(clean(link).replace(" ","_"))
return((empty,full))
def printTree(catName, count):
catName = catName.replace("\\'","'")
if count == MAX_DEPTH: return
path='trivial'
download(catRoot+catName, path)
content = open("categories/Category:"+catName+".html").readlines()
(emptyBullets,fullBullets) = getAllBullets(content)
f.close()
for x in emptyBullets:
for i in range(count): print (" "),
download(catRoot+x, "categories/Category:"+x+".html")
print (x)
for x in fullBullets:
for i in range(count): print (" "),
print (x)
if x in done:
print ("Done... "+x)
continue
done.append(x)
try: printTree(x, count + 1)
except: print ("ERROR: " + x)
name = "Cricket"
printTree(name, 0)
As @AS Mackay pointed out: 正如@AS Mackay指出的那样:
you are using download(catRoot+x, "categories/Category:"+x+".html")
You should use download(catRoot+x, "categories/Category/"+x+".html")
您正在使用
download(catRoot+x, "categories/Category:"+x+".html")
您应该使用download(catRoot+x, "categories/Category/"+x+".html")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.