简体   繁体   English

如何细化和限制 BeautifulSoup 结果

[英]How to refine and limit BeautifulSoup results

So I'm stuck here.所以我被困在这里了。 I'm a doctor so my programming background and skills are close to none and most likely that's the problem.我是一名医生,所以我的编程背景和技能几乎没有,这很可能就是问题所在。 I'm trying to learn some basics about Python and for me, the best way is by doing stuff.我正在尝试学习一些关于 Python 的基础知识,对我来说,最好的方法就是做一些事情。

The project:该项目:

  • scrape the cover images from several books从几本书中刮下封面图片

Some of the links used:使用的一些链接:

http://coleccaoargonauta.blogspot.com/2011/09/1-perdidos-na-estratosfera.html
http://coleccaoargonauta.blogspot.com/2011/09/n-2-o-estranho-mundo-de-kilsona.html
http://coleccaoargonauta.blogspot.com/2011/09/n-3-ultima-cidade-da-terra.html
http://coleccaoargonauta.blogspot.com/2011/09/n-4-nave-sideral.html
http://coleccaoargonauta.blogspot.com/2011/09/n-5-o-universo-vivo.html

That website structure is messed up.该网站结构搞砸了。 The links are located inside a div with class:" post-title entry-title " which in turn has two or more " separator " class div's that can have content or be empty.这些链接位于带有 class 的 div 中:“ post-title entry-title ”又具有两个或多个“分隔符”class div,可以包含内容或为空。 What I can tell so far is that 95% of the time what I want is the last two links in the first two " separator " class DIV's.到目前为止我可以说的是,95% 的时间我想要的是前两个“分隔符”class DIV 中的最后两个链接。 And for this stage that's good enough.对于这个阶段来说,这已经足够了。

My code so far is as follow:到目前为止我的代码如下:

#intro
r=requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')

#select the first teo 'separator' divs
separador = soup.select("div.separator")[:2]

#we need a title for each page - for debugging and later used to rename images      
titulo = soup.find_all("h3", {"class": "post-title entry-title"})[0]
m = titulo.string
print (m)

#the find all links loop
for div in separador:
  imagens = div.find_all('a')
  for link in imagens:
    print (link['href'], '\n')

What I can do right now:我现在能做的:

  • I can print the right URL's, I can then use wget to download and rename files.我可以打印正确的 URL,然后我可以使用 wget 下载和重命名文件。 However, I only want the last two links from the results and that is the only thing that's missing in my google-fu.但是,我只想要结果中的最后两个链接,这是我的 google-fu 中唯一缺少的东西。 I think the problem is in the way BeautifulSoup exports results ( ResultSet ) and my lack ok knowledge in things such as lists.我认为问题在于 BeautifulSoup 导出结果( ResultSet )的方式以及我对列表等事物缺乏了解。 If the first "separator" has one link and the second two links I get a list with two items (and the second item is two links), hence not slicable.如果第一个“分隔符”有一个链接,第二个有两个链接,我得到一个包含两个项目的列表(第二个项目是两个链接),因此不可切片。

Example output示例 output

2-O Estranho Mundo de Kilsona.jpg
http://3.bp.blogspot.com/-tAyl2wdRT1g/UPfbGczmv2I/AAAAAAAAB0M/mP71TRQIg3c/s1600/2+-+O+Estranho+Mundo+de+Kilsona.jpg 

http://4.bp.blogspot.com/-D0cUIP8PkEU/UPfbByjSuII/AAAAAAAAB0E/LP6kbIEJ_eI/s1600/Argonauta002.jpg 

http://3.bp.blogspot.com/-tAyl2wdRT1g/UPfbGczmv2I/AAAAAAAAB0M/mP71TRQIg3c/s1600/2+-+O+Estranho+Mundo+de+Kilsona.jpg 

But I wanted it to be但我希望它成为

2-O Estranho Mundo de Kilsona.jpg

http://4.bp.blogspot.com/-D0cUIP8PkEU/UPfbByjSuII/AAAAAAAAB0E/LP6kbIEJ_eI/s1600/Argonauta002.jpg 

http://3.bp.blogspot.com/-tAyl2wdRT1g/UPfbGczmv2I/AAAAAAAAB0M/mP71TRQIg3c/s1600/2+-+O+Estranho+Mundo+de+Kilsona.jpg 

Can anyone shed some light on this?任何人都可以阐明这一点吗?

The issue is due to the line imagens = div.find_all('a') being called within a loop.问题是由于在循环中调用了imagens = div.find_all('a')行。 This creates a list of lists .这将创建一个list lists As such we need to find a way to flatten them into a single list.因此,我们需要找到一种方法将它们扁平化为一个列表。 I do this below with the line merged_list = [] [merged_list.extend(list) for list in imagens] .我在下面使用行merged_list = [] [merged_list.extend(list) for list in imagens]执行此操作。

From here I then create a new list with just the links and then dedupe the list by calling using set (a set is a useful data structure to use when you don't want duplicated data).然后我从这里创建一个仅包含链接的新列表,然后通过使用set调用来删除列表的重复数据(当您不想要重复数据时, set是一种有用的数据结构)。 I then turn It back into a list and it's back to your code.然后我把它变回一个list ,然后它又回到你的代码中。

import requests
from bs4 import BeautifulSoup

link1 = "http://coleccaoargonauta.blogspot.com/2011/09/1-perdidos-na-estratosfera.html"
link2 = "http://coleccaoargonauta.blogspot.com/2011/09/n-2-o-estranho-mundo-de-kilsona.html"
link3 = "http://coleccaoargonauta.blogspot.com/2011/09/n-3-ultima-cidade-da-terra.html"
link4 = "http://coleccaoargonauta.blogspot.com/2011/09/n-4-nave-sideral.html"
link5 = "http://coleccaoargonauta.blogspot.com/2011/09/n-5-o-universo-vivo.html"


#intro
r=requests.get(link2)
soup = BeautifulSoup(r.content, 'lxml')

#select the first teo 'separator' divs
separador = soup.select("div.separator")[:2]

#we need a title for each page - for debugging and later used to rename images      
titulo = soup.find_all("h3", {"class": "post-title entry-title"})[0]
m = titulo.string
print (m)

imagens = [div.find_all('a') for div in separador]
merged_list = []
[merged_list.extend(list) for list in imagens]
link_list = [link['href'] for link in merged_list]
deduped_list = list(set(link_list))
for link in deduped_list:
    print(link, '\n')

You can use CSS selectors to extract image directly from div with class separator ( link to docs ).您可以使用 CSS 选择器直接从带有 class 分隔符的 div 中提取图像( 文档链接)。

I also use list comprehension instead of for loop.我还使用列表理解而不是 for 循环。

Below is working example for url from your list.以下是您列表中 url 的工作示例。


import requests
from bs4 import BeautifulSoup

#intro
url = "http://coleccaoargonauta.blogspot.com/2011/09/1-perdidos-na-estratosfera.html"
r=requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')


#we need a title for each page - for debugging and later used to rename images      
titulo = soup.find_all("h3", {"class": "post-title entry-title"})[0]
m = titulo.string
print (m)

#find all links
links = [link['href'] for link in soup.select('.separator a')]
print(links)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM