改进一个python片段

Question

I'm working on a python script to do some web scraping.我正在编写一个 python 脚本来进行一些网络抓取。 I want to find the base URL of a given section on a web page that looks like this:我想在如下所示的网页上找到给定部分的基本 URL：

<div class='pagination'>
    <a href='webpage-category/page/1'>1</a>
    <a href='webpage-category/page/2'>2</a>
    ...
</div>

So, I just need to get everything from the first href besides the number('webpage-category/page/') and I have the following working code:所以，我只需要从第一个 href 中获取除 number('webpage-category/page/') 之外的所有内容，并且我有以下工作代码：

pages = [l['href'] for link in soup.find_all('div', class_='pagination')
     for l in link.find_all('a') if not re.search('pageSub', l['href'])]

s = pages[0]
f = ''.join([i for i in s if not i.isdigit()])

The problem is, generating this list is a waste, since I just need the first href.问题是，生成这个列表是一种浪费，因为我只需要第一个 href。 I think a Generator would be the answer but I couldn't pull this off.我认为发电机将是答案，但我无法做到这一点。 Maybe you guys could help me to make this code more concise?也许你们可以帮助我使这段代码更简洁？

Answer 1

What about this:那这个呢：

from bs4 import BeautifulSoup

html = """ <div class='pagination'>
    <a href='webpage-category/page/1'>1</a>
    <a href='webpage-category/page/2'>2</a>
</div>"""

soup = BeautifulSoup(html)

link = soup.find('div', {'class': 'pagination'}).find('a')['href']

print '/'.join(link.split('/')[:-1])

prints:印刷：

webpage-category/page

Just FYI, speaking about the code you've provided - you can use [next()][-1] instead of a list comprehension:仅供参考，谈到您提供的代码 - 您可以使用 [next()][-1] 而不是列表理解：

s = next(l['href'] for link in soup.find_all('div', class_='pagination')
         for l in link.find_all('a') if not re.search('pageSub', l['href']))

UPD (using the website link provided): UPD（使用提供的网站链接）：

import urllib2
from bs4 import BeautifulSoup


url = "http://www.hdwallpapers.in/cars-desktop-wallpapers/page/2"
soup = BeautifulSoup(urllib2.urlopen(url))

links = soup.find_all('div', {'class': 'pagination'})[1].find_all('a')

print next('/'.join(link['href'].split('/')[:-1]) for link in links 
           if link.text.isdigit() and link.text != "1")

改进一个python片段

问题描述

1 个解决方案

解决方案1
2 已采纳 2014-03-13 17:13:36

改进一个python片段

问题描述

1 个解决方案

解决方案1 2 已采纳 2014-03-13 17:13:36

解决方案1
2 已采纳 2014-03-13 17:13:36