[英]Python Scraping/Parsing with BeautifulSoup
I'm trying to scrape a url with BeautifulSoup/Requests, and then clean it by pulling out just the sections I need. 我正在尝试使用BeautifulSoup / Requests刮取一个URL,然后通过仅拉出我需要的部分来清理它。 After having decided on a different target url, it outputs the HTML correctly, but my code for cleaning it is not working. 确定了不同的目标URL后,它会正确输出HTML,但是我清理它的代码无法正常工作。 Here is my code: 这是我的代码:
import requests
from bs4 import BeautifulSoup
import bs4.element
import pprint
def connection(url):
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5)'}
r = requests.get(url,headers=headers)
soup = BeautifulSoup(r.text)
return soup
def scrape_metacritic(soup,movie_list=[]):
for mlist in get_modules(soup).items():
for movie in mlist:
try:
m = parse_movie_li(movie)
except:
continue
#m['release_type']=release_type
movie_list.append(m)
return movie_list
def just_tags(templist):
tags = [t for t in templist if isinstance(t,bs4.element.Tag)]
return tags
def get_modules(soup):
module = soup.find(class_='body_wrap') #body_wrap
module_dict = {}
for mod in module.find_all('li', class_='product'):
movie_lis = just_tags(mod.find('ul').contents)
module_dict[mod]=movie_lis
return module_dict
get_modules(soup)
That part works. 那部分起作用。 Here is the url: 这是网址:
url = 'http://www.metacritic.com/browse/movies/title/dvd/a?view=detailed'
soup = connection(url)
This is some of what I'm getting after the scrape: 这是我在刮擦后得到的一些东西:
{<li class="product limited_release_product has_small_image"><div class="wrap product_wrap"><div class="product_basics stats"><div class="basic_stats has_score"><div class="main_stats"><div class="basic_stat product_title"><h3 class="product_title"><a href="/movie/a-birders-guide-to-everything">A Birder's Guide to Everything</a></h3></div><a class="basic_stat product_score" href="/movie/a-birders-guide-to-everything">
<span class="metascore_w medium movie positive">61</span>
</a></div> <div class="more_stats extended_stats">
<ul class="more_stats">
<li class="stat release_date">
<span class="label">Release Date:</span>
<span class="data">March 21, 2014</span>
</li>
<li class="stat rating">
<span class="label">Rated:</span>
<span class="data">
..
<span class="data">136 min</span>
</li>]}
Now I try to clean it with this: 现在,我尝试用以下方法清洁它:
from dateutil import parser
def parse_movie_li(li):
title_div = li.find(class_='product_title')
movie = {
'title':title_div.text.strip(),
'rel_url':title_div.find('a')['href'],
'release_date':get_release_date(li.find(class_='release_date').find(class_='data')),
'metascore_w':get_metascore_w(li.find(class_='metascore_w')),
'user_score':get_user_score(li.find(class_='product avg_userscore').find(class_='data')), #add func
'genre':get_genre(li.find(class_='genre').find(class_='data')), #add func
'star_cast':get_star_cast(li.find(class_='cast').find(class_='data')), #add func
'runtime':get_runtime(li.find(class_='runtime').find(class_='data')) #add func
}
#print movie,'\n'
return movie
def get_metascore_w(div):
try:
score = div.text
except:
print 'no text in metascore div'
return None
try:
score = int(score)
except:
pass
return score
def get_release_date(div):
try:
datestr = div.text
except:
return None
try:
date = parser.parse(datestr)
except:
return datestr
return date
def get_user_score(div):
try:
uscore = div.text
except:
print 'no text in userscore div'
return None
try:
uscore = int(uscore)
except:
pass
return uscore
def get_genre(div):
try:
genre = div.text
except:
print 'no text in genre div'
return None
try:
genre
except:
pass
return score
def get_star_cast(div):
try:
cast = div.text
except:
print 'no text in cast div'
return None
try:
cast
except:
pass
return cast
def get_runtime(div):
try:
runtime = div.text.strip(' min')
except:
print 'no text in runtime div'
return None
try:
runtime = int(runtime)
except:
pass
return runtime
It should be outputting in this form: 它应该以以下形式输出:
[{'metascore_w': 28,
'rel_url': '/movie/mortdecai',
'release_date': datetime.datetime(2015, 1, 23, 0, 0),
'release_type': u'Wide releases now in theaters',
'title': u'Mortdecai'},
{'metascore_w': 24,
'rel_url': '/movie/strange-magic',
'release_date': datetime.datetime(2015, 1, 23, 0, 0),
'release_type': u'Wide releases now in theaters',
'title': u'Strange Magic'},
..
{'metascore_w': u'tbd',
'rel_url': '/movie/20-once-again',
'release_date': datetime.datetime(2015, 1, 16, 0, 0),
'release_type': u'Limited releases now in theaters',
'title': u'20 Once Again'}]
However, I'm getting this: 但是,我得到这个:
{<li class="product limited_release_product has_small_image alt"><div class="wrap product_wrap"><div class="product_basics stats"><div class="basic_stats has_score"><div class="main_stats"><div class="basic_stat product_title"><h3 class="product_title"><a href="/movie/a-family-thing">A Family Thing</a></h3></div><a class="basic_stat product_score" href="/movie/a-family-thing">
<span class="metascore_w medium movie positive">71</span>
</a></div> <div class="more_stats extended_stats">
<ul class="more_stats">
<li class="stat release_date">
<span class="label">Release Date:</span>
<span class="data">March 29, 1996</span>
</li>..
It is unparsed. 它是未分析的。 Any guidance on what I'm doing incorrectly with the parse_movie_li function? 关于parse_movie_li函数的错误操作的任何指导吗?
The error is actually really simple. 该错误实际上非常简单。 In the parse_movie_li() function you are calling a find method on "li" when you're not allowed to. 在parse_movie_li()函数中,不允许的情况下,您正在“ li”上调用find方法。 I'm not exactly sure where you're calling the method or what variable you're putting into it. 我不确定您要在哪里调用该方法或将什么变量放入其中。 But wherever you get "li" I would chain .find(class_='product_title') to that part of the function. 但是无论您在哪里得到“ li”,我都会将.find(class _ ='product_title')链接到该函数的该部分。 You can however target the children of it like so: li.div.b to get the b tags in the div tags in the li tags. 但是,您可以像这样针对其子对象:li.div.b在li标签中的div标签中获取b标签。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.