我正在嘗試使用beautifulsoup從craigslist中提取一些鏈接，但它會將鏈接拉到100次而不是一次

Question

因此，我嘗試從craigslist中提取最新電視列表的鏈接。 我已經知道要獲得所需信息的程度，但是由於某種原因，它在將信息移至下一個鏈接之前將其拉動了100次。 我不確定為什么要這么做？

import urllib2
from bs4 import BeautifulSoup
import re
import time
import csv
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
# id url
url = ('http://omaha.craigslist.org/sya/')
# this opens the url
ourUrl = opener.open(url).read()
# now we are passing the url to beautiful soup
soup = BeautifulSoup(ourUrl)

for link in soup.findAll('a', attrs={'class': re.compile("hdrlnk")}):
    find = re.compile('/sys/(.*?)"')
    #time.sleep(1)
    timeset = time.strftime("%m-%d %H:%M") # current date and time
    for linka in soup.findAll('a', attrs={'href': re.compile("^/sys/")}):
        find = re.compile('/sys/(.*?)"')
        searchTv = re.search(find, str(link))
        Tv = searchTv.group(1)
        opener = urllib2.build_opener()
        opener.addheaders = [('User-agent', 'Mozilla/5.0')]
        url = ('http://omaha.craigslist.org/sys/' + Tv)
        ourUrl = opener.open(url).read()
        soup = BeautifulSoup(ourUrl)
        print "http://omaha.craigslist.org/sys/" + Tv
        try:
            outfile = open('C:/Python27/Folder/Folder/Folder/craigstvs.txt', 'a')
            outfile.write(timeset + "; " + link.text + "; " + "http://omaha.craigslist.org/sys/" + Tv + '\n')
            timeset = time.strftime("%m-%d %H:%M") # current date and time
        except:
            print "No go--->" + str(link.text)

這是輸出示例：08-10 15:19; MAC mini intel核心wifi dvdrw大cond; http://omaha.craigslist.org/sys/4612480593.html這正是我要完成的工作，只是它提取了大約100次以上的信息。 然后繼續進行下一個清單的操作...我停滯不前，無法解決。 任何幫助將不勝感激，謝謝您的提前！

@alexce的Scrapy：

import scrapy
import csv
from tutorial.items import DmozItem
import re
import urllib2
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
import html2text

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["http://omaha.craigslist.org"]
    start_urls = [
        "http://omaha.craigslist.org/sya/",

    ]

    def parse(self, response):
        for sel in response.xpath('//html'):
            #title = sel.xpath('a/text()').extract()
            link = sel.xpath('/html/body/article/section/div/div[2]/p/span/span[2]/a').extract()[0:4]
            #at this point it doesn't repeat itself, which is good!
            #desc = sel.xpath('text()').extract()
    print link

Answer 1

您在這里不需要嵌套循環。 其他說明/改進：

opener.open()結果可以直接傳遞給BeautifulSoup構造函數，無需read()
urlopener可以定義一次，並在循環中重用以跟隨鏈接
使用find_all()而不是findAll()
使用urljoin()串聯url部分
使用csv模塊寫入定界數據
在處理文件時with上下文管理器一起使用

完整的固定版本：

import csv
import re
import time
import urllib2
from urlparse import urljoin
from bs4 import BeautifulSoup

BASE_URL = 'http://omaha.craigslist.org/sys/'
URL = 'http://omaha.craigslist.org/sya/'
FILENAME = 'C:/Python27/Folder/Folder/Folder/craigstvs.txt'

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
soup = BeautifulSoup(opener.open(URL))

with open(FILENAME, 'a') as f:
    writer = csv.writer(f, delimiter=';')
    for link in soup.find_all('a', class_=re.compile("hdrlnk")):
        timeset = time.strftime("%m-%d %H:%M")

        item_url = urljoin(BASE_URL, link['href'])
        item_soup = BeautifulSoup(opener.open(item_url))

        # do smth with the item_soup? or why did you need to follow this link?

        writer.writerow([timeset, link.text, item_url])

這是代碼產生的內容：

08-10 16:56;Dell Inspiron-15 Laptop;http://omaha.craigslist.org/sys/4612666460.html
08-10 16:56;computer????;http://omaha.craigslist.org/sys/4612637389.html
08-10 16:56;macbook 13 inch 160 gig wifi dvdrw ;http://omaha.craigslist.org/sys/4612480237.html
08-10 16:56;MAC mini intel core wifi dvdrw great cond ;http://omaha.craigslist.org/sys/4612480593.html
...

只是附帶說明，因為您需要跟隨鏈接，獲取數據並將其輸出到csv文件中，所以聽起來像Scrapy非常適合這里。 有Rules ， Link Extractors ，它可以將抓取的項目序列化為開箱即用的csv 。

我正在嘗試使用beautifulsoup從craigslist中提取一些鏈接，但它會將鏈接拉到100次而不是一次

問題描述

1 個解決方案

解決方案1
1 已采納 2014-08-10 20:51:58

我正在嘗試使用beautifulsoup從craigslist中提取一些鏈接，但它會將鏈接拉到100次而不是一次

問題描述

1 個解決方案

解決方案1 1 已采納 2014-08-10 20:51:58

解決方案1
1 已采納 2014-08-10 20:51:58