简体   繁体   English

如何使我的python抓取函数在一定范围的帖子之间执行?

[英]How can I make my python scraping function execute between a certain range of post?

I am a novice. 我是新手。 I have created a function that scrapes between a certain amount of posts. 我创建了一个在一定数量的帖子之间抓取的函数。 It works, but it just seems so large and novice looking. 它可以工作,但是看起来又大又新手。 I want to condense the code and make it behave in a way that will decrease the amount of posts it scrapes by 1 if the initial amount is to large. 我想压缩代码并使其表现为某种方式,如果初始数量过多,它会将抓取的帖子数量减少1。 So if it tries to scrape 15 and there are only 14 it will drop to 14 instead of halting. 因此,如果它尝试刮取15,而只有14,它将下降到14,而不是停止。 heres my code 这是我的代码

def scrape_world():
    url = 'http://www.example.org'
    html = requests.get(url, headers=headers)
    soup = BeautifulSoup(html.text, 'html5lib')
    titles = []

    if len(titles) > 15:
        titles = soup.find_all('section', 'box')[:15]
        random.shuffle(titles)
        print(len(titles))

    elif len(titles) > 14:
        titles = soup.find_all('section', 'box')[:14]
        # random.shuffle(titles)
        print(len(titles))

    elif len(titles) > 13:
        titles = soup.find_all('section', 'box')[:13]
        random.shuffle(titles)
        print(len(titles))

    elif len(titles) > 12:
        titles = soup.find_all('section', 'box')[:12]
        random.shuffle(titles)
        print(len(titles))

    elif len(titles) > 11:
        titles = soup.find_all('section', 'box')[:11]
        random.shuffle(titles)
        print(len(titles))

    elif len(titles) > 10:
        titles = soup.find_all('section', 'box')[:10]
        random.shuffle(titles)
        print(len(titles))

    elif len(titles) > 9:
        titles = soup.find_all('section', 'box')[:9]
        random.shuffle(titles)
        print(len(titles))

    else:
        titles = soup.find_all('section', 'box')[:8]
        random.shuffle(titles)
        print(len(titles))

    entries = [{'href': url + box.a.get('href'),
                'src': box.img.get('src'),
                'text': box.strong.a.text,
                } for box in titles]

    # random.shuffle(entries)

    return entries

I tried something like 我尝试了类似的东西

if len(titles) > 15 || < 9:

but that did not work right 但这行不通

UPDATE: print(titles) output 更新:打印(标题)输出

[<section class="box">
<a class="video-box" href="/videos/video.php?v=wshh2Nw4BKk0vav380lx">
<img alt="" height="125" src="http://i.ytimg.com/vi/clPaWvb6lWk/maxresdefault.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshh2Nw4BKk0vav380lx">Spodee - All I Want</a></strong>
<div>
<span class="views">18,781</span> 
<span class="comments"><a data-disqus-identifier="95018" href="http://www.worldstarhiphop.com/videos/video.php?v=wshh2Nw4BKk0vav380lx#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshh058e7C1B1Ey8qwNT">
<img alt="" height="125" src="http://hw-static.worldstarhiphop.com/u/pic/2016/06/t9OWyXfcdYQm.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshh058e7C1B1Ey8qwNT">Sheesh: Dude Grill Is On Another Level!</a></strong>
<div>
<span class="views">182,832</span> 
<span class="comments"><a data-disqus-identifier="95013" href="http://www.worldstarhiphop.com/videos/video.php?v=wshh058e7C1B1Ey8qwNT#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshhrXYCnHFIj4h2GQjE">
<img alt="" height="125" src="http://hw-static.worldstarhiphop.com/u/pic/2016/06/M1itOMKyh7zj.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshhrXYCnHFIj4h2GQjE">Back At It: Brock Lesnar To Return At UFC 200, WWE Approved!</a></strong>
<div>
<span class="views">124,237</span> 
<span class="comments"><a data-disqus-identifier="95016" href="http://www.worldstarhiphop.com/videos/video.php?v=wshhrXYCnHFIj4h2GQjE#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshhj7V8H8GXx08iH2V9">
<img alt="" height="125" src="http://i.ytimg.com/vi/YRlsJtuZ09s/maxresdefault.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshhj7V8H8GXx08iH2V9">Jose Guapo - Off Top</a></strong>
<div>
<span class="views">16,462</span> 
<span class="comments"><a data-disqus-identifier="95017" href="http://www.worldstarhiphop.com/videos/video.php?v=wshhj7V8H8GXx08iH2V9#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshhfOnhy45f780tHqQG">
<img alt="" height="125" src="http://hw-static.worldstarhiphop.com/u/pic/2016/06/wn03kuXW3v2a.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshhfOnhy45f780tHqQG">Tulsa Candidate Angry About Not Being Involved In The Mayoral Debate, Runs Up There Anyway!</a></strong>
<div>
<span class="views">115,333</span> 
<span class="comments"><a data-disqus-identifier="95014" href="http://www.worldstarhiphop.com/videos/video.php?v=wshhfOnhy45f780tHqQG#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshhrYcD83QWN1n0665g">
<img alt="" height="125" src="http://hw-static.worldstarhiphop.com/u/pic/2016/06/14H17jc8ZTIw.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshhrYcD83QWN1n0665g">This Motel Has An Interesting Key Policy!</a></strong>
<div>
<span class="views">16,015</span> 
<span class="comments"><a data-disqus-identifier="95019" href="http://www.worldstarhiphop.com/videos/video.php?v=wshhrYcD83QWN1n0665g#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshhs2kTRq49K0gXYbuu">
<img alt="" height="125" src="http://hw-static.worldstarhiphop.com/u/pic/2016/06/e2VMzdzmKwFe.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshhs2kTRq49K0gXYbuu">Yonio &amp; AG - Holy (Freestyle) [Houston Unsigned Artist] </a></strong>
<div>
<span class="views">4,076</span> 
<span class="comments"><a data-disqus-identifier="95012" href="http://www.worldstarhiphop.com/videos/video.php?v=wshhs2kTRq49K0gXYbuu#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshhDQZ3eC6yJE6Y5hjL">
<img alt="" height="125" src="http://hw-static.worldstarhiphop.com/u/pic/2016/06/dVjLEzVRc1YQ.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshhDQZ3eC6yJE6Y5hjL">Messed Up: 6-Year Old Polish Boy Beats His Mother And Pulls Her Hair!</a></strong>
<div>
<span class="views">201,996</span> 
<span class="comments"><a data-disqus-identifier="95015" href="http://www.worldstarhiphop.com/videos/video.php?v=wshhDQZ3eC6yJE6Y5hjL#disqus_thread"></a></span>
</div>
</section>]

It's always better in your example to actually include the example of what you're trying to do so that it's easier for folks to repro your issue. 在您的示例中,最好实际包含您要执行的操作的示例,以便人们更轻松地解决问题。

Like the comments say, your code is going straight to titles[:8] because before the loop, titles =[] which means len(titles) is 0. the soup.find_all function is smart enough to know how big your dataset is, so no need to specify the length. 就像注释中所说的那样,您的代码直接进入titles[:8]因为在循环之前, titles =[]意味着len(titles)soup.find_all函数足够聪明,可以知道您的数据集有多大,因此无需指定长度。 Based on your print(titles) output, I assumed your pointing your code at url = 'http://www.worldstarhiphop.com' so the below uses that. 根据您的print(titles)输出,我假设您将您的代码指向url = 'http://www.worldstarhiphop.com'因此下面将使用该代码。 When scraping this specific url, there's a "SUBMIT YOUR VIDEO" result in titles[11] that's throwing an error when you build your entries dictionary. 抓取此特定的url时, titles[11]中出现“提交您的视频”的结果,当您构建entries字典时会抛出错误。 roganjosh 's answer is the right basic approach, but in this case it won't capture titles[11] which is not None, but unfortunately just a different format. roganjosh的答案是正确的基本方法,但是在这种情况下,它不会捕获标题[11],这不是None,但是不幸的是,只是一种不同的格式。 If you update cleaned_titles to be the below it should work for you. 如果您将cleaned_titles更新为以下版本,则应该可以使用。

cleaned_titles = [title for title in titles if title.a.get('href') != 'vsubmit.php']

giving you: 给你:

def scrape_world():
    url = 'http://www.worldstarhiphop.com'
    html = requests.get(url, headers=headers)
    soup = BeautifulSoup(html.text, 'html5lib')
    titles = soup.find_all('section', 'box')

    cleaned_titles = [title for title in titles if title.a.get('href') != 'vsubmit.php']

    entries = [{'href': url + box.a.get('href'),
                'src': box.img.get('src'),
                'text': box.strong.a.text,
                } for box in cleaned_titles]
    return entries

Ok, BeautifulSoup returns a different type of structure than I was expecting. 好的,BeautifulSoup返回的结构类型与我期望的不同。 However, I did push for clarifications on the premise of an answer, so I will post and retract if there's an issue with this. 但是,我确实在回答的前提下进行了澄清,因此,如果存在问题,我将张贴并撤回。

def scrape_world():
    url = 'http://www.example.org'
    html = requests.get(url, headers=headers)
    soup = BeautifulSoup(html.text, 'html5lib')
    titles = soup.find_all('section', 'box')

    cleaned_titles = [title for title in titles if title is not None]

    entries = [{'href': url + box.a.get('href'),
                'src': box.img.get('src'),
                'text': box.strong.a.text,
                } for box in cleaned_titles]
    return entries

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何让我的 Python 代码执行得更快? - How can I make my Python Code execute faster? 如何改善我的代码? Web搜寻,XPath - How can I make my code better? Web Scraping, XPath 如何制作具有特定范围内的值的 Numpy 矩阵? - How can I make a Numpy matrix with values in a certain range? 如何在 Python 中执行我的网络浏览器? - How can i execute my webbrowser in Python? 如何从特定范围的单元格中获取值并将其复制到 Python 中的 dataframe 中的特定范围内? - How can I take values from a certain range of cells and copy it to a certain range in the dataframe in Python? 如何计算python中一定范围内不为零的行数? - How can I count the number of rows that are not zero in a certain range in python? 如何让程序从键盘快捷键执行功能? - How can I make my program execute a function from a keyboard shortcut? 如何让我的 python discord bot 检测到命令中提到的某个用户? - how can I make my python discord bot detect a certain user being mentionned in a command? 如何让python中的if函数尝试“a”范围内的值并返回正确答案 - How can I make the if function in python try the values from the range for “a” and return the correct answer 在 python 中,我如何在范围和一些条件下进行二进制搜索 function - In python, how can i make a binary search function with range and some conditions
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM