简体   繁体   English

Web 使用 Beautiful soup 进行抓取并执行多个功能以添加到列表中

[英]Web Scraping using Beautiful soup and executing multiple functions to add to a list

I'm fairly new to Python and I'm trying to webscrape Facebook.我是 Python 的新手,我正在尝试网络抓取 Facebook。

I have created a function for each section to extract, ie The Poster Name, Captions etc.我为每个要提取的部分创建了一个 function,即海报名称、标题等。

Here is the main part of the code:这是代码的主要部分:

 FacebookPosts = [] 


source_data = driver.page_source
bs_data = bs(source_data, 'html.parser')

 NumberofPosts = bs_data.find_all('h2', {"id": re.compile('^jsc_c')})

def _extract_post_name(bs_data):
    postername = ""
    actualPosts = bs_data.find_all('h2', {"id": re.compile('^jsc_c')})
    for posts in actualPosts:
        postername = posts.find('strong').text
        #postername.append(paragraphs)
    return postername



 def _extract_post_caption(bs_data):   
    captionblocks = bs_data.find_all('div', {"class": re.compile('^ii04i59q')})
    captions = ""
    for captiondivs in captionblocks:
        caption = captiondivs.find('div', attrs = {'style':'text-align: start;'}).text
        #captions.append(caption)
    return caption



for posts in NumberofPosts:
    post = {
            'Original Poster:' :  _extract_post_name(bs_data),
            'Caption:'         :  _extract_post_caption(bs_data),
            }
    FacebookPosts.append(post)

print(FacebookPosts)

I have other functions for more extraction but ill keep it small for simplicity.我还有其他功能可以进行更多提取,但为了简单起见,我将其保持较小。

The issue at the moment is, that with this method, only 1 line in the dictionary is being shown and always the same one, when I run the code inside the function without the function it prints multiple times, I know I can append to the list but there would be another issue.目前的问题是,使用这种方法,字典中只显示 1 行并且始终是同一行,当我在没有 function 的情况下运行 function 中的代码时,它会打印多次,我知道我可以 append 到列表,但会有另一个问题。

Ultimately what I would like to extract is:最终我想提取的是:

FacebookPosts{
Post1{
Poster Name : Steve
Caption : Text inside Caption
}

Post2: {
Poster Name : Bob
Caption : Please Help me

what's being extracted now is:现在提取的是:

    FacebookPosts{
    Poster Name : Steve
    Caption : Text inside Caption
    }
    Poster Name : Steve
    Caption : Text inside Caption
}

For every element found in NumberofPosts对于在NumberofPosts中找到的每个元素

Any help is greatly appreciated, I've been stuck on this problem for days.非常感谢任何帮助,我已经在这个问题上停留了好几天了。

I believe that my problem is a lack of knowledge about functions and dictionary/lists.我认为我的问题是缺乏关于函数和字典/列表的知识。

Like how do you add to a dictionary from multiple sources such as functions and have them in the same set.就像你如何从多个来源(如函数)添加到字典并将它们放在同一个集合中一样。

Oh I think this might be a simple fix brother.哦,我认为这可能是一个简单的修复兄弟。

for posts in NumberofPosts:
    post = {
            'Original Poster:' :  _extract_post_name(bs_data),
            'Caption:'         :  _extract_post_caption(bs_data),
            }
FacebookPosts.append(post)

print(FacebookPosts)

There is an issue here you need to the put the FacebookPosts.append(post) inside the for block else you're only appending the last post这里有一个问题,你需要将 FacebookPosts.append(post) 放在 for 块中,否则你只是附加最后一个帖子

for posts in NumberofPosts:
    post = {
            'Original Poster:' :  _extract_post_name(bs_data),
            'Caption:'         :  _extract_post_caption(bs_data),
            }
    FacebookPosts.append(post)

print(FacebookPosts)

^That should fix it if I'm not mistaken. ^如果我没记错的话应该可以解决。

I solved the issue.我解决了这个问题。 Basically I had to change NumberofPosts = bs_data.find_all('h2', {"id": re.compile('^jsc_c')}) that element was getting the H2 headers which only contained the Name of the poster.基本上我必须更改 NumberofPosts = bs_data.find_all('h2', {"id": re.compile('^jsc_c')}) 该元素正在获取仅包含海报名称的 H2 标头。 It has now been changed to bs_data.find_all('div', {"class": 'du4w35lb k4urcfbm l9j0dhe7 sjgh65i0'}) which is getting the wrapper of the post.它现在已更改为 bs_data.find_all('div', {"class": 'du4w35lb k4urcfbm l9j0dhe7 sjgh65i0'}) ,它正在获取帖子的包装器。 I'll leave the post here just in case someone needs the code.我会把帖子留在这里以防万一有人需要代码。 Thanks for the help.谢谢您的帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM