简体   繁体   中英

How can I make my python scraping function execute between a certain range of post?

I am a novice. I have created a function that scrapes between a certain amount of posts. It works, but it just seems so large and novice looking. I want to condense the code and make it behave in a way that will decrease the amount of posts it scrapes by 1 if the initial amount is to large. So if it tries to scrape 15 and there are only 14 it will drop to 14 instead of halting. heres my code

def scrape_world():
    url = 'http://www.example.org'
    html = requests.get(url, headers=headers)
    soup = BeautifulSoup(html.text, 'html5lib')
    titles = []

    if len(titles) > 15:
        titles = soup.find_all('section', 'box')[:15]
        random.shuffle(titles)
        print(len(titles))

    elif len(titles) > 14:
        titles = soup.find_all('section', 'box')[:14]
        # random.shuffle(titles)
        print(len(titles))

    elif len(titles) > 13:
        titles = soup.find_all('section', 'box')[:13]
        random.shuffle(titles)
        print(len(titles))

    elif len(titles) > 12:
        titles = soup.find_all('section', 'box')[:12]
        random.shuffle(titles)
        print(len(titles))

    elif len(titles) > 11:
        titles = soup.find_all('section', 'box')[:11]
        random.shuffle(titles)
        print(len(titles))

    elif len(titles) > 10:
        titles = soup.find_all('section', 'box')[:10]
        random.shuffle(titles)
        print(len(titles))

    elif len(titles) > 9:
        titles = soup.find_all('section', 'box')[:9]
        random.shuffle(titles)
        print(len(titles))

    else:
        titles = soup.find_all('section', 'box')[:8]
        random.shuffle(titles)
        print(len(titles))

    entries = [{'href': url + box.a.get('href'),
                'src': box.img.get('src'),
                'text': box.strong.a.text,
                } for box in titles]

    # random.shuffle(entries)

    return entries

I tried something like

if len(titles) > 15 || < 9:

but that did not work right

UPDATE: print(titles) output

[<section class="box">
<a class="video-box" href="/videos/video.php?v=wshh2Nw4BKk0vav380lx">
<img alt="" height="125" src="http://i.ytimg.com/vi/clPaWvb6lWk/maxresdefault.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshh2Nw4BKk0vav380lx">Spodee - All I Want</a></strong>
<div>
<span class="views">18,781</span> 
<span class="comments"><a data-disqus-identifier="95018" href="http://www.worldstarhiphop.com/videos/video.php?v=wshh2Nw4BKk0vav380lx#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshh058e7C1B1Ey8qwNT">
<img alt="" height="125" src="http://hw-static.worldstarhiphop.com/u/pic/2016/06/t9OWyXfcdYQm.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshh058e7C1B1Ey8qwNT">Sheesh: Dude Grill Is On Another Level!</a></strong>
<div>
<span class="views">182,832</span> 
<span class="comments"><a data-disqus-identifier="95013" href="http://www.worldstarhiphop.com/videos/video.php?v=wshh058e7C1B1Ey8qwNT#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshhrXYCnHFIj4h2GQjE">
<img alt="" height="125" src="http://hw-static.worldstarhiphop.com/u/pic/2016/06/M1itOMKyh7zj.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshhrXYCnHFIj4h2GQjE">Back At It: Brock Lesnar To Return At UFC 200, WWE Approved!</a></strong>
<div>
<span class="views">124,237</span> 
<span class="comments"><a data-disqus-identifier="95016" href="http://www.worldstarhiphop.com/videos/video.php?v=wshhrXYCnHFIj4h2GQjE#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshhj7V8H8GXx08iH2V9">
<img alt="" height="125" src="http://i.ytimg.com/vi/YRlsJtuZ09s/maxresdefault.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshhj7V8H8GXx08iH2V9">Jose Guapo - Off Top</a></strong>
<div>
<span class="views">16,462</span> 
<span class="comments"><a data-disqus-identifier="95017" href="http://www.worldstarhiphop.com/videos/video.php?v=wshhj7V8H8GXx08iH2V9#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshhfOnhy45f780tHqQG">
<img alt="" height="125" src="http://hw-static.worldstarhiphop.com/u/pic/2016/06/wn03kuXW3v2a.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshhfOnhy45f780tHqQG">Tulsa Candidate Angry About Not Being Involved In The Mayoral Debate, Runs Up There Anyway!</a></strong>
<div>
<span class="views">115,333</span> 
<span class="comments"><a data-disqus-identifier="95014" href="http://www.worldstarhiphop.com/videos/video.php?v=wshhfOnhy45f780tHqQG#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshhrYcD83QWN1n0665g">
<img alt="" height="125" src="http://hw-static.worldstarhiphop.com/u/pic/2016/06/14H17jc8ZTIw.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshhrYcD83QWN1n0665g">This Motel Has An Interesting Key Policy!</a></strong>
<div>
<span class="views">16,015</span> 
<span class="comments"><a data-disqus-identifier="95019" href="http://www.worldstarhiphop.com/videos/video.php?v=wshhrYcD83QWN1n0665g#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshhs2kTRq49K0gXYbuu">
<img alt="" height="125" src="http://hw-static.worldstarhiphop.com/u/pic/2016/06/e2VMzdzmKwFe.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshhs2kTRq49K0gXYbuu">Yonio &amp; AG - Holy (Freestyle) [Houston Unsigned Artist] </a></strong>
<div>
<span class="views">4,076</span> 
<span class="comments"><a data-disqus-identifier="95012" href="http://www.worldstarhiphop.com/videos/video.php?v=wshhs2kTRq49K0gXYbuu#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshhDQZ3eC6yJE6Y5hjL">
<img alt="" height="125" src="http://hw-static.worldstarhiphop.com/u/pic/2016/06/dVjLEzVRc1YQ.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshhDQZ3eC6yJE6Y5hjL">Messed Up: 6-Year Old Polish Boy Beats His Mother And Pulls Her Hair!</a></strong>
<div>
<span class="views">201,996</span> 
<span class="comments"><a data-disqus-identifier="95015" href="http://www.worldstarhiphop.com/videos/video.php?v=wshhDQZ3eC6yJE6Y5hjL#disqus_thread"></a></span>
</div>
</section>]

It's always better in your example to actually include the example of what you're trying to do so that it's easier for folks to repro your issue.

Like the comments say, your code is going straight to titles[:8] because before the loop, titles =[] which means len(titles) is 0. the soup.find_all function is smart enough to know how big your dataset is, so no need to specify the length. Based on your print(titles) output, I assumed your pointing your code at url = 'http://www.worldstarhiphop.com' so the below uses that. When scraping this specific url, there's a "SUBMIT YOUR VIDEO" result in titles[11] that's throwing an error when you build your entries dictionary. roganjosh 's answer is the right basic approach, but in this case it won't capture titles[11] which is not None, but unfortunately just a different format. If you update cleaned_titles to be the below it should work for you.

cleaned_titles = [title for title in titles if title.a.get('href') != 'vsubmit.php']

giving you:

def scrape_world():
    url = 'http://www.worldstarhiphop.com'
    html = requests.get(url, headers=headers)
    soup = BeautifulSoup(html.text, 'html5lib')
    titles = soup.find_all('section', 'box')

    cleaned_titles = [title for title in titles if title.a.get('href') != 'vsubmit.php']

    entries = [{'href': url + box.a.get('href'),
                'src': box.img.get('src'),
                'text': box.strong.a.text,
                } for box in cleaned_titles]
    return entries

Ok, BeautifulSoup returns a different type of structure than I was expecting. However, I did push for clarifications on the premise of an answer, so I will post and retract if there's an issue with this.

def scrape_world():
    url = 'http://www.example.org'
    html = requests.get(url, headers=headers)
    soup = BeautifulSoup(html.text, 'html5lib')
    titles = soup.find_all('section', 'box')

    cleaned_titles = [title for title in titles if title is not None]

    entries = [{'href': url + box.a.get('href'),
                'src': box.img.get('src'),
                'text': box.strong.a.text,
                } for box in cleaned_titles]
    return entries

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM