简体   繁体   English

Python 3.5 | 从网站抓取数据

[英]Python 3.5 | Scraping data from website

I want to scrape a specific part of the website Kickstarter.com 我想抓取网站Kickstarter.com的特定部分

I need the strings of the Project-title. 我需要Project-title的字符串。 The website is structured and every project has this line. 网站结构合理,每个项目都有此行。

 <div class="Project-title"> 

My code looks like: 我的代码如下:

 #Loading Libraries import urllib import urllib.request from bs4 import BeautifulSoup #define URL for scraping theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=popularity&seed=2448324&page=1" thepage = urllib.request.urlopen(theurl) #Cooking the Soup soup = BeautifulSoup(thepage,"html.parser") #Scraping "Project Title" (project-title) project_title = soup.find('h6', {'class': 'project-title'}).findChildren('a') title = project_title[0].text print (title) 

If I use the soup.find_all or set another value at the line Project_title[0] instead of zero, Python shows an error. 如果我使用soup.find_all或在Project_title [0]行中设置另一个值而不是零,Python将显示错误。

I need a list with all the project titles of this Website. 我需要列出该网站所有项目名称的列表。 Eg.: 例如。:

  • The Superbook: Turn your smartphone into a laptop for $99 超级本:只需99美元即可将智能手机变成笔记本电脑
  • Weights: Weigh Smarter 重量:更智能称重
  • Mine Kafon Drone World's First And Only Complete 卡方无人驾驶飞机世界上第一个也是唯一一个完成
  • Weather Camera System Omega2: $5 IoT Computer with Wi-Fi, Powered by Linux 气象摄像头系统Omega2:5美元的IoT计算机,带有Wi-Fi,由Linux提供支持

find() only returns one element. find()仅返回一个元素。 To get all, you must use findAll 要获得全部,必须使用findAll

Here's the code you need 这是您需要的代码

project_elements = soup.findAll('h6', {'class': 'project-title'})
project_titles = [project.findChildren('a')[0].text for project in project_elements]
print(project_titles)

We look at all the elements of tag h6 and class project-title . 我们看一下标签h6和class project-title所有元素。 We then take the title from each of these elements, and create a list with it. 然后,我们从每个元素中获取标题,并使用它创建一个列表。

Hope it helped, and don't hesitate to ask if you have any question. 希望它能对您有所帮助,不要犹豫,问您是否有任何问题。

edit : the problem of the above code is that it will fail if we do not get at least a child of tag a for each element in the list returned by findAll 编辑:上面的代码的问题是,如果我们没有为findAll返回的列表中的每个元素至少获得标签a的子元素,它将失败。

How to prevent this : 如何防止这种情况:

project_titles = [project.findChildren('a')[0].text for project in project_elements if project.findChildren('a')]

this will create the list only if the project.findChildren('a') as at least one element. 仅当project.findChildren('a')作为至少一个元素时,才创建列表。 ( if [] returns False) if []返回False)

edit : to get the description of the elements (class project-blurb ), let's look a bit at the HTML code. 编辑:获取元素的描述(class project-blurb ),让我们看一下HTML代码。

<p class="project-blurb">
Bagel is a digital tape measure that helps you measure, organize, and analyze any size measurements in a smart way.
</p>

This is only a paragraph of class project-blurb . 这只是project-blurb类的一个段落。 To get them, we could use the same as we did to get the project_elements, or more condensed : 为了获得它们,我们可以使用与获得project_elements相同的方法,或更简明扼要的:

project_desc = [description.text for description in soup.findAll('p', {'class': 'project-blurb'})]

With respect to the title of this post i would recommend you two different tutorial based on scraping particular data from a website . 关于这篇文章的标题,我建议您基于从网站上抓取特定数据的两个不同的教程。 They do have a detailed explanation regarding how the task is achieved. 他们确实有关于如何完成任务的详细说明。

Firstly i would recommend to checkout pyimagesearch Scraping images using scrapy. 首先,我建议您检出pyimagesearch 使用scrapy刮取图像。

then you should try if you are more specific web scraping will help you. 那么您应该尝试是否更具体的网络抓取将对您有所帮助。

All the data you want is in the section with the css class staff-picks , just find the h6's with the project-title class and extract the text from the anchor tag inside: 您想要的所有数据都在具有css类staff-picks的部分中 ,只需找到具有project-title类的h6并从内部的anchor标记中提取文本即可:

soup = BeautifulSoup(thepage,"html.parser")


print [a.text for a in soup.select("section.staff-picks h6.project-title a")]

Output: 输出:

[u'The Superbook: Turn your smartphone into a laptop for $99', u'Weighitz: Weigh Smarter', u'Omega2: $5 IoT Computer with Wi-Fi, Powered by Linux', u"Bagel: The World's Smartest Tape Measure", u'FireFlies - Truly Wire-Free Earbuds - Music Without Limits!', u'ISOLATE\xae - Switch off your ears!']

Or using find with find_all : 或将findfind_all结合使用

project_titles = soup.find("section",class_="staff-picks").find_all("h6", "project-title")
print([proj.a.text for proj in project_titles])

There is also only one anchor tag inside each h6 tag so you cannot end up with more than one whatever approach you take. 每个h6标签内也只有一个锚定标签,因此无论采取哪种方法,最终都无法获得多个。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM