简体   繁体   English

改进一个python片段

[英]Improving a python snippet

I'm working on a python script to do some web scraping.我正在编写一个 python 脚本来进行一些网络抓取。 I want to find the base URL of a given section on a web page that looks like this:我想在如下所示的网页上找到给定部分的基本 URL:

<div class='pagination'>
    <a href='webpage-category/page/1'>1</a>
    <a href='webpage-category/page/2'>2</a>
    ...
</div>

So, I just need to get everything from the first href besides the number('webpage-category/page/') and I have the following working code:所以,我只需要从第一个 href 中获取除 number('webpage-category/page/') 之外的所有内容,并且我有以下工作代码:

pages = [l['href'] for link in soup.find_all('div', class_='pagination')
     for l in link.find_all('a') if not re.search('pageSub', l['href'])]

s = pages[0]
f = ''.join([i for i in s if not i.isdigit()])

The problem is, generating this list is a waste, since I just need the first href.问题是,生成这个列表是一种浪费,因为我只需要第一个 href。 I think a Generator would be the answer but I couldn't pull this off.我认为发电机将是答案,但我无法做到这一点。 Maybe you guys could help me to make this code more concise?也许你们可以帮助我使这段代码更简洁?

What about this:那这个呢:

from bs4 import BeautifulSoup

html = """ <div class='pagination'>
    <a href='webpage-category/page/1'>1</a>
    <a href='webpage-category/page/2'>2</a>
</div>"""

soup = BeautifulSoup(html)

link = soup.find('div', {'class': 'pagination'}).find('a')['href']

print '/'.join(link.split('/')[:-1])

prints:印刷:

webpage-category/page

Just FYI, speaking about the code you've provided - you can use [next()][-1] instead of a list comprehension:仅供参考,谈到您提供的代码 - 您可以使用 [next()][-1] 而不是列表理解:

s = next(l['href'] for link in soup.find_all('div', class_='pagination')
         for l in link.find_all('a') if not re.search('pageSub', l['href']))

UPD (using the website link provided): UPD(使用提供的网站链接):

import urllib2
from bs4 import BeautifulSoup


url = "http://www.hdwallpapers.in/cars-desktop-wallpapers/page/2"
soup = BeautifulSoup(urllib2.urlopen(url))

links = soup.find_all('div', {'class': 'pagination'})[1].find_all('a')

print next('/'.join(link['href'].split('/')[:-1]) for link in links 
           if link.text.isdigit() and link.text != "1")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM