简体   繁体   中英

Python 3.5 | Scraping data from website

I want to scrape a specific part of the website Kickstarter.com

I need the strings of the Project-title. The website is structured and every project has this line.

 <div class="Project-title"> 

My code looks like:

 #Loading Libraries import urllib import urllib.request from bs4 import BeautifulSoup #define URL for scraping theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=popularity&seed=2448324&page=1" thepage = urllib.request.urlopen(theurl) #Cooking the Soup soup = BeautifulSoup(thepage,"html.parser") #Scraping "Project Title" (project-title) project_title = soup.find('h6', {'class': 'project-title'}).findChildren('a') title = project_title[0].text print (title) 

If I use the soup.find_all or set another value at the line Project_title[0] instead of zero, Python shows an error.

I need a list with all the project titles of this Website. Eg.:

  • The Superbook: Turn your smartphone into a laptop for $99
  • Weights: Weigh Smarter
  • Mine Kafon Drone World's First And Only Complete
  • Weather Camera System Omega2: $5 IoT Computer with Wi-Fi, Powered by Linux

find() only returns one element. To get all, you must use findAll

Here's the code you need

project_elements = soup.findAll('h6', {'class': 'project-title'})
project_titles = [project.findChildren('a')[0].text for project in project_elements]
print(project_titles)

We look at all the elements of tag h6 and class project-title . We then take the title from each of these elements, and create a list with it.

Hope it helped, and don't hesitate to ask if you have any question.

edit : the problem of the above code is that it will fail if we do not get at least a child of tag a for each element in the list returned by findAll

How to prevent this :

project_titles = [project.findChildren('a')[0].text for project in project_elements if project.findChildren('a')]

this will create the list only if the project.findChildren('a') as at least one element. ( if [] returns False)

edit : to get the description of the elements (class project-blurb ), let's look a bit at the HTML code.

<p class="project-blurb">
Bagel is a digital tape measure that helps you measure, organize, and analyze any size measurements in a smart way.
</p>

This is only a paragraph of class project-blurb . To get them, we could use the same as we did to get the project_elements, or more condensed :

project_desc = [description.text for description in soup.findAll('p', {'class': 'project-blurb'})]

With respect to the title of this post i would recommend you two different tutorial based on scraping particular data from a website . They do have a detailed explanation regarding how the task is achieved.

Firstly i would recommend to checkout pyimagesearch Scraping images using scrapy.

then you should try if you are more specific web scraping will help you.

All the data you want is in the section with the css class staff-picks , just find the h6's with the project-title class and extract the text from the anchor tag inside:

soup = BeautifulSoup(thepage,"html.parser")


print [a.text for a in soup.select("section.staff-picks h6.project-title a")]

Output:

[u'The Superbook: Turn your smartphone into a laptop for $99', u'Weighitz: Weigh Smarter', u'Omega2: $5 IoT Computer with Wi-Fi, Powered by Linux', u"Bagel: The World's Smartest Tape Measure", u'FireFlies - Truly Wire-Free Earbuds - Music Without Limits!', u'ISOLATE\xae - Switch off your ears!']

Or using find with find_all :

project_titles = soup.find("section",class_="staff-picks").find_all("h6", "project-title")
print([proj.a.text for proj in project_titles])

There is also only one anchor tag inside each h6 tag so you cannot end up with more than one whatever approach you take.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM