Python 3.5 | 从网站抓取数据

Question

I want to scrape a specific part of the website Kickstarter.com 我想抓取网站Kickstarter.com的特定部分

I need the strings of the Project-title. 我需要Project-title的字符串。 The website is structured and every project has this line. 网站结构合理，每个项目都有此行。

 <div class="Project-title">

My code looks like: 我的代码如下：

 #Loading Libraries import urllib import urllib.request from bs4 import BeautifulSoup #define URL for scraping theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=popularity&seed=2448324&page=1" thepage = urllib.request.urlopen(theurl) #Cooking the Soup soup = BeautifulSoup(thepage,"html.parser") #Scraping "Project Title" (project-title) project_title = soup.find('h6', {'class': 'project-title'}).findChildren('a') title = project_title[0].text print (title)

If I use the soup.find_all or set another value at the line Project_title[0] instead of zero, Python shows an error. 如果我使用soup.find_all或在Project_title [0]行中设置另一个值而不是零，Python将显示错误。

I need a list with all the project titles of this Website. 我需要列出该网站所有项目名称的列表。 Eg.: 例如。：

The Superbook: Turn your smartphone into a laptop for $99 超级本：只需99美元即可将智能手机变成笔记本电脑
Weights: Weigh Smarter 重量：更智能称重
Mine Kafon Drone World's First And Only Complete 卡方无人驾驶飞机世界上第一个也是唯一一个完成
Weather Camera System Omega2: $5 IoT Computer with Wi-Fi, Powered by Linux 气象摄像头系统Omega2：5美元的IoT计算机，带有Wi-Fi，由Linux提供支持

Answer 1

find() only returns one element. find()仅返回一个元素。 To get all, you must use findAll 要获得全部，必须使用findAll

Here's the code you need 这是您需要的代码

project_elements = soup.findAll('h6', {'class': 'project-title'})
project_titles = [project.findChildren('a')[0].text for project in project_elements]
print(project_titles)

We look at all the elements of tag h6 and class project-title . 我们看一下标签h6和class project-title所有元素。 We then take the title from each of these elements, and create a list with it. 然后，我们从每个元素中获取标题，并使用它创建一个列表。

Hope it helped, and don't hesitate to ask if you have any question. 希望它能对您有所帮助，不要犹豫，问您是否有任何问题。

edit : the problem of the above code is that it will fail if we do not get at least a child of tag a for each element in the list returned by findAll 编辑：上面的代码的问题是，如果我们没有为findAll返回的列表中的每个元素至少获得标签a的子元素，它将失败。

How to prevent this : 如何防止这种情况：

project_titles = [project.findChildren('a')[0].text for project in project_elements if project.findChildren('a')]

this will create the list only if the project.findChildren('a') as at least one element. 仅当project.findChildren('a')作为至少一个元素时，才创建列表。 ( if [] returns False) （ if []返回False）

edit : to get the description of the elements (class project-blurb ), let's look a bit at the HTML code. 编辑：获取元素的描述（class project-blurb ），让我们看一下HTML代码。

<p class="project-blurb">
Bagel is a digital tape measure that helps you measure, organize, and analyze any size measurements in a smart way.
</p>

This is only a paragraph of class project-blurb . 这只是project-blurb类的一个段落。 To get them, we could use the same as we did to get the project_elements, or more condensed : 为了获得它们，我们可以使用与获得project_elements相同的方法，或更简明扼要的：

project_desc = [description.text for description in soup.findAll('p', {'class': 'project-blurb'})]

Answer 2

With respect to the title of this post i would recommend you two different tutorial based on scraping particular data from a website . 关于这篇文章的标题，我建议您基于从网站上抓取特定数据的两个不同的教程。 They do have a detailed explanation regarding how the task is achieved. 他们确实有关于如何完成任务的详细说明。

Firstly i would recommend to checkout pyimagesearch Scraping images using scrapy. 首先，我建议您检出pyimagesearch 使用scrapy刮取图像。

then you should try if you are more specific web scraping will help you. 那么您应该尝试是否更具体的网络抓取将对您有所帮助。

Answer 3

All the data you want is in the section with the css class staff-picks , just find the h6's with the project-title class and extract the text from the anchor tag inside: 您想要的所有数据都在具有css类staff-picks的部分中 ，只需找到具有project-title类的h6并从内部的anchor标记中提取文本即可：

soup = BeautifulSoup(thepage,"html.parser")


print [a.text for a in soup.select("section.staff-picks h6.project-title a")]

Output: 输出：

[u'The Superbook: Turn your smartphone into a laptop for $99', u'Weighitz: Weigh Smarter', u'Omega2: $5 IoT Computer with Wi-Fi, Powered by Linux', u"Bagel: The World's Smartest Tape Measure", u'FireFlies - Truly Wire-Free Earbuds - Music Without Limits!', u'ISOLATE\xae - Switch off your ears!']

Or using find with find_all : 或将find与find_all结合使用 ：

project_titles = soup.find("section",class_="staff-picks").find_all("h6", "project-title")
print([proj.a.text for proj in project_titles])

There is also only one anchor tag inside each h6 tag so you cannot end up with more than one whatever approach you take. 每个h6标签内也只有一个锚定标签，因此无论采取哪种方法，最终都无法获得多个。

Python 3.5 | 从网站抓取数据

问题描述

3 个解决方案

解决方案1
2 已采纳 2016-07-25 10:54:28

解决方案2
1 2016-07-26 08:37:19

解决方案3
0 2016-07-25 11:00:31

Python 3.5 | 从网站抓取数据

问题描述

3 个解决方案

解决方案1 2 已采纳 2016-07-25 10:54:28

解决方案2 1 2016-07-26 08:37:19

解决方案3 0 2016-07-25 11:00:31

解决方案1
2 已采纳 2016-07-25 10:54:28

解决方案2
1 2016-07-26 08:37:19

解决方案3
0 2016-07-25 11:00:31