[英]How to refine and limit BeautifulSoup results
So I'm stuck here.所以我被困在这里了。 I'm a doctor so my programming background and skills are close to none and most likely that's the problem.我是一名医生,所以我的编程背景和技能几乎没有,这很可能就是问题所在。 I'm trying to learn some basics about Python and for me, the best way is by doing stuff.我正在尝试学习一些关于 Python 的基础知识,对我来说,最好的方法就是做一些事情。
The project:该项目:
Some of the links used:使用的一些链接:
http://coleccaoargonauta.blogspot.com/2011/09/1-perdidos-na-estratosfera.html
http://coleccaoargonauta.blogspot.com/2011/09/n-2-o-estranho-mundo-de-kilsona.html
http://coleccaoargonauta.blogspot.com/2011/09/n-3-ultima-cidade-da-terra.html
http://coleccaoargonauta.blogspot.com/2011/09/n-4-nave-sideral.html
http://coleccaoargonauta.blogspot.com/2011/09/n-5-o-universo-vivo.html
That website structure is messed up.该网站结构搞砸了。 The links are located inside a div with class:" post-title entry-title " which in turn has two or more " separator " class div's that can have content or be empty.这些链接位于带有 class 的 div 中:“ post-title entry-title ”又具有两个或多个“分隔符”class div,可以包含内容或为空。 What I can tell so far is that 95% of the time what I want is the last two links in the first two " separator " class DIV's.到目前为止我可以说的是,95% 的时间我想要的是前两个“分隔符”class DIV 中的最后两个链接。 And for this stage that's good enough.对于这个阶段来说,这已经足够了。
My code so far is as follow:到目前为止我的代码如下:
#intro
r=requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
#select the first teo 'separator' divs
separador = soup.select("div.separator")[:2]
#we need a title for each page - for debugging and later used to rename images
titulo = soup.find_all("h3", {"class": "post-title entry-title"})[0]
m = titulo.string
print (m)
#the find all links loop
for div in separador:
imagens = div.find_all('a')
for link in imagens:
print (link['href'], '\n')
What I can do right now:我现在能做的:
Example output示例 output
2-O Estranho Mundo de Kilsona.jpg
http://3.bp.blogspot.com/-tAyl2wdRT1g/UPfbGczmv2I/AAAAAAAAB0M/mP71TRQIg3c/s1600/2+-+O+Estranho+Mundo+de+Kilsona.jpg
http://4.bp.blogspot.com/-D0cUIP8PkEU/UPfbByjSuII/AAAAAAAAB0E/LP6kbIEJ_eI/s1600/Argonauta002.jpg
http://3.bp.blogspot.com/-tAyl2wdRT1g/UPfbGczmv2I/AAAAAAAAB0M/mP71TRQIg3c/s1600/2+-+O+Estranho+Mundo+de+Kilsona.jpg
But I wanted it to be但我希望它成为
2-O Estranho Mundo de Kilsona.jpg
http://4.bp.blogspot.com/-D0cUIP8PkEU/UPfbByjSuII/AAAAAAAAB0E/LP6kbIEJ_eI/s1600/Argonauta002.jpg
http://3.bp.blogspot.com/-tAyl2wdRT1g/UPfbGczmv2I/AAAAAAAAB0M/mP71TRQIg3c/s1600/2+-+O+Estranho+Mundo+de+Kilsona.jpg
Can anyone shed some light on this?任何人都可以阐明这一点吗?
The issue is due to the line imagens = div.find_all('a')
being called within a loop.问题是由于在循环中调用了imagens = div.find_all('a')
行。 This creates a list
of lists
.这将创建一个list
lists
。 As such we need to find a way to flatten them into a single list.因此,我们需要找到一种方法将它们扁平化为一个列表。 I do this below with the line merged_list = [] [merged_list.extend(list) for list in imagens]
.我在下面使用行merged_list = [] [merged_list.extend(list) for list in imagens]
执行此操作。
From here I then create a new list with just the links and then dedupe the list by calling using set
(a set
is a useful data structure to use when you don't want duplicated data).然后我从这里创建一个仅包含链接的新列表,然后通过使用set
调用来删除列表的重复数据(当您不想要重复数据时, set
是一种有用的数据结构)。 I then turn It back into a list
and it's back to your code.然后我把它变回一个list
,然后它又回到你的代码中。
import requests
from bs4 import BeautifulSoup
link1 = "http://coleccaoargonauta.blogspot.com/2011/09/1-perdidos-na-estratosfera.html"
link2 = "http://coleccaoargonauta.blogspot.com/2011/09/n-2-o-estranho-mundo-de-kilsona.html"
link3 = "http://coleccaoargonauta.blogspot.com/2011/09/n-3-ultima-cidade-da-terra.html"
link4 = "http://coleccaoargonauta.blogspot.com/2011/09/n-4-nave-sideral.html"
link5 = "http://coleccaoargonauta.blogspot.com/2011/09/n-5-o-universo-vivo.html"
#intro
r=requests.get(link2)
soup = BeautifulSoup(r.content, 'lxml')
#select the first teo 'separator' divs
separador = soup.select("div.separator")[:2]
#we need a title for each page - for debugging and later used to rename images
titulo = soup.find_all("h3", {"class": "post-title entry-title"})[0]
m = titulo.string
print (m)
imagens = [div.find_all('a') for div in separador]
merged_list = []
[merged_list.extend(list) for list in imagens]
link_list = [link['href'] for link in merged_list]
deduped_list = list(set(link_list))
for link in deduped_list:
print(link, '\n')
You can use CSS selectors to extract image directly from div with class separator ( link to docs ).您可以使用 CSS 选择器直接从带有 class 分隔符的 div 中提取图像( 文档链接)。
I also use list comprehension instead of for loop.我还使用列表理解而不是 for 循环。
Below is working example for url from your list.以下是您列表中 url 的工作示例。
import requests
from bs4 import BeautifulSoup
#intro
url = "http://coleccaoargonauta.blogspot.com/2011/09/1-perdidos-na-estratosfera.html"
r=requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
#we need a title for each page - for debugging and later used to rename images
titulo = soup.find_all("h3", {"class": "post-title entry-title"})[0]
m = titulo.string
print (m)
#find all links
links = [link['href'] for link in soup.select('.separator a')]
print(links)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.