[英]I'm attempting to extract some links from craigslist using beautifulsoup but it's pulling the links 100 times rather than once
因此,我嘗試從craigslist中提取最新電視列表的鏈接。 我已經知道要獲得所需信息的程度,但是由於某種原因,它在將信息移至下一個鏈接之前將其拉動了100次。 我不確定為什么要這么做?
import urllib2
from bs4 import BeautifulSoup
import re
import time
import csv
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
# id url
url = ('http://omaha.craigslist.org/sya/')
# this opens the url
ourUrl = opener.open(url).read()
# now we are passing the url to beautiful soup
soup = BeautifulSoup(ourUrl)
for link in soup.findAll('a', attrs={'class': re.compile("hdrlnk")}):
find = re.compile('/sys/(.*?)"')
#time.sleep(1)
timeset = time.strftime("%m-%d %H:%M") # current date and time
for linka in soup.findAll('a', attrs={'href': re.compile("^/sys/")}):
find = re.compile('/sys/(.*?)"')
searchTv = re.search(find, str(link))
Tv = searchTv.group(1)
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
url = ('http://omaha.craigslist.org/sys/' + Tv)
ourUrl = opener.open(url).read()
soup = BeautifulSoup(ourUrl)
print "http://omaha.craigslist.org/sys/" + Tv
try:
outfile = open('C:/Python27/Folder/Folder/Folder/craigstvs.txt', 'a')
outfile.write(timeset + "; " + link.text + "; " + "http://omaha.craigslist.org/sys/" + Tv + '\n')
timeset = time.strftime("%m-%d %H:%M") # current date and time
except:
print "No go--->" + str(link.text)
這是輸出示例:08-10 15:19; MAC mini intel核心wifi dvdrw大cond; http://omaha.craigslist.org/sys/4612480593.html這正是我要完成的工作,只是它提取了大約100次以上的信息。 然后繼續進行下一個清單的操作...我停滯不前,無法解決。 任何幫助將不勝感激,謝謝您的提前!
@alexce的Scrapy:
import scrapy
import csv
from tutorial.items import DmozItem
import re
import urllib2
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
import html2text
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["http://omaha.craigslist.org"]
start_urls = [
"http://omaha.craigslist.org/sya/",
]
def parse(self, response):
for sel in response.xpath('//html'):
#title = sel.xpath('a/text()').extract()
link = sel.xpath('/html/body/article/section/div/div[2]/p/span/span[2]/a').extract()[0:4]
#at this point it doesn't repeat itself, which is good!
#desc = sel.xpath('text()').extract()
print link
您在這里不需要嵌套循環。 其他說明/改進:
opener.open()
結果可以直接傳遞給BeautifulSoup
構造函數,無需read()
urlopener
可以定義一次,並在循環中重用以跟隨鏈接 find_all()
而不是findAll()
urljoin()
串聯url部分 csv
模塊寫入定界數據 with
上下文管理器一起使用 完整的固定版本:
import csv
import re
import time
import urllib2
from urlparse import urljoin
from bs4 import BeautifulSoup
BASE_URL = 'http://omaha.craigslist.org/sys/'
URL = 'http://omaha.craigslist.org/sya/'
FILENAME = 'C:/Python27/Folder/Folder/Folder/craigstvs.txt'
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
soup = BeautifulSoup(opener.open(URL))
with open(FILENAME, 'a') as f:
writer = csv.writer(f, delimiter=';')
for link in soup.find_all('a', class_=re.compile("hdrlnk")):
timeset = time.strftime("%m-%d %H:%M")
item_url = urljoin(BASE_URL, link['href'])
item_soup = BeautifulSoup(opener.open(item_url))
# do smth with the item_soup? or why did you need to follow this link?
writer.writerow([timeset, link.text, item_url])
這是代碼產生的內容:
08-10 16:56;Dell Inspiron-15 Laptop;http://omaha.craigslist.org/sys/4612666460.html
08-10 16:56;computer????;http://omaha.craigslist.org/sys/4612637389.html
08-10 16:56;macbook 13 inch 160 gig wifi dvdrw ;http://omaha.craigslist.org/sys/4612480237.html
08-10 16:56;MAC mini intel core wifi dvdrw great cond ;http://omaha.craigslist.org/sys/4612480593.html
...
只是附帶說明,因為您需要跟隨鏈接,獲取數據並將其輸出到csv
文件中,所以聽起來像Scrapy
非常適合這里。 有Rules
, Link Extractors
,它可以將抓取的項目序列化為開箱即用的csv
。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.