簡體   English   中英

我正在嘗試使用beautifulsoup從craigslist中提取一些鏈接,但它會將鏈接拉到100次而不是一次

[英]I'm attempting to extract some links from craigslist using beautifulsoup but it's pulling the links 100 times rather than once

因此,我嘗試從craigslist中提取最新電視列表的鏈接。 我已經知道要獲得所需信息的程度,但是由於某種原因,它在將信息移至下一個鏈接之前將其拉動了100次。 我不確定為什么要這么做?

import urllib2
from bs4 import BeautifulSoup
import re
import time
import csv
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
# id url
url = ('http://omaha.craigslist.org/sya/')
# this opens the url
ourUrl = opener.open(url).read()
# now we are passing the url to beautiful soup
soup = BeautifulSoup(ourUrl)

for link in soup.findAll('a', attrs={'class': re.compile("hdrlnk")}):
    find = re.compile('/sys/(.*?)"')
    #time.sleep(1)
    timeset = time.strftime("%m-%d %H:%M") # current date and time
    for linka in soup.findAll('a', attrs={'href': re.compile("^/sys/")}):
        find = re.compile('/sys/(.*?)"')
        searchTv = re.search(find, str(link))
        Tv = searchTv.group(1)
        opener = urllib2.build_opener()
        opener.addheaders = [('User-agent', 'Mozilla/5.0')]
        url = ('http://omaha.craigslist.org/sys/' + Tv)
        ourUrl = opener.open(url).read()
        soup = BeautifulSoup(ourUrl)
        print "http://omaha.craigslist.org/sys/" + Tv
        try:
            outfile = open('C:/Python27/Folder/Folder/Folder/craigstvs.txt', 'a')
            outfile.write(timeset + "; " + link.text + "; " + "http://omaha.craigslist.org/sys/" + Tv + '\n')
            timeset = time.strftime("%m-%d %H:%M") # current date and time
        except:
            print "No go--->" + str(link.text)

這是輸出示例:08-10 15:19; MAC mini intel核心wifi dvdrw大cond; http://omaha.craigslist.org/sys/4612480593.html這正是我要完成的工作,只是它提取了大約100次以上的信息。 然后繼續進行下一個清單的操作...我停滯不前,無法解決。 任何幫助將不勝感激,謝謝您的提前!

@alexce的Scrapy:

import scrapy
import csv
from tutorial.items import DmozItem
import re
import urllib2
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
import html2text

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["http://omaha.craigslist.org"]
    start_urls = [
        "http://omaha.craigslist.org/sya/",

    ]

    def parse(self, response):
        for sel in response.xpath('//html'):
            #title = sel.xpath('a/text()').extract()
            link = sel.xpath('/html/body/article/section/div/div[2]/p/span/span[2]/a').extract()[0:4]
            #at this point it doesn't repeat itself, which is good!
            #desc = sel.xpath('text()').extract()
    print link

您在這里不需要嵌套循環。 其他說明/改進:

  • opener.open()結果可以直接傳遞BeautifulSoup構造函數,無需read()
  • urlopener可以定義一次,並在循環中重用以跟隨鏈接
  • 使用find_all()而不是findAll()
  • 使用urljoin()串聯url部分
  • 使用csv模塊寫入定界數據
  • 在處理文件時with上下文管理器一起使用

完整的固定版本:

import csv
import re
import time
import urllib2
from urlparse import urljoin
from bs4 import BeautifulSoup

BASE_URL = 'http://omaha.craigslist.org/sys/'
URL = 'http://omaha.craigslist.org/sya/'
FILENAME = 'C:/Python27/Folder/Folder/Folder/craigstvs.txt'

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
soup = BeautifulSoup(opener.open(URL))

with open(FILENAME, 'a') as f:
    writer = csv.writer(f, delimiter=';')
    for link in soup.find_all('a', class_=re.compile("hdrlnk")):
        timeset = time.strftime("%m-%d %H:%M")

        item_url = urljoin(BASE_URL, link['href'])
        item_soup = BeautifulSoup(opener.open(item_url))

        # do smth with the item_soup? or why did you need to follow this link?

        writer.writerow([timeset, link.text, item_url])

這是代碼產生的內容:

08-10 16:56;Dell Inspiron-15 Laptop;http://omaha.craigslist.org/sys/4612666460.html
08-10 16:56;computer????;http://omaha.craigslist.org/sys/4612637389.html
08-10 16:56;macbook 13 inch 160 gig wifi dvdrw ;http://omaha.craigslist.org/sys/4612480237.html
08-10 16:56;MAC mini intel core wifi dvdrw great cond ;http://omaha.craigslist.org/sys/4612480593.html
...

只是附帶說明,因為您需要跟隨鏈接,獲取數據並將其輸出到csv文件中,所以聽起來像Scrapy非常適合這里。 RulesLink Extractors ,它可以將抓取的項目序列化為開箱即用的csv

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM