简体   繁体   English

无法弄清楚如何正确输出我的数据

[英]Can't figure out how to properly output my data

I'm a relative novice at python but yet, somehow managed to build a scraper for Instagram.我是 python 的相对新手,但不知何故设法为 Instagram 构建了一个刮板。 I now want to take this one step further and output the 5 most commonly used hashtags from an IG profile into my CSV output file.我现在想更进一步,将 IG 配置文件中的 5 个最常用的主题标签输出到我的 CSV 输出文件中。

Current output:电流输出:

I've managed to isolate the 5 most commonly used hashtags, but I get this result in my csv:我设法隔离了 5 个最常用的主题标签,但在我的 csv 中得到了这个结果:

[('#striveforgreatness', 3), ('#jamesgang', 3), ('#thekidfromakron', 2), ('#togetherwecanchangetheworld', 1), ('#halloweenchronicles', 1)] [('#striveforgreatness', 3), ('#jamesgang', 3), ('#thekidfromakron', 2), ('#togetherwecanchangetheworld', 1), ('#halloweenchronicles', 1)]

csv输出

Desired output:期望的输出:

What I'm looking to end up with in the end is having 5 columns at the end of my .CSV outputting the X-th most commonly used value.我希望最终得到的是在我的 .CSV 的末尾有 5 列输出第 X 个最常用的值。

So something in the lines of this:所以在这一行中的东西:

期望的输出

I've Googled for a while and managed to isolate them separately, but I always end up with '('#thekidfromakron', 2)' as an output.我在谷歌上搜索了一段时间并设法将它们分开,但我总是以 '('#thekidfromakron', 2)' 作为输出。 I seem to be missing some part of the puzzle :(.我似乎错过了谜题的某些部分:(。

Here is what I'm working with at the moment:这是我目前正在使用的内容:

import csv
import requests
from bs4 import BeautifulSoup
import json
import re
import time
from collections import Counter
ts = time.gmtime()


def get_csv_header(top_numb):
        fieldnames = ['USER','MEDIA COUNT','FOLLOWERCOUNT','TOTAL LIKES','TOTAL COMMENTS','ER','ER IN %', 'BIO', 'ALL CAPTION TEXT','HASHTAGS COUNTED','MOST COMMON HASHTAGS']
        return fieldnames


def write_csv_header(filename, headers):
        with open(filename, 'w', newline='') as f_out:
            writer = csv.DictWriter(f_out, fieldnames=headers)
            writer.writeheader()
        return

def read_user_name(t_file):
        with open(t_file) as f:
            user_list = f.read().splitlines()
        return user_list
if __name__ == '__main__':

    # HERE YOU CAN SPECIFY YOUR USERLIST FILE NAME,
    # Which contains a list of usernames's BY DEFAULT <current working directory>/userlist.txt
    USER_FILE = 'userlist.txt'

    # HERE YOU CAN SPECIFY YOUR DATA FILE NAME, BY DEFAULT (data.csv)', Where your final result stays
    DATA_FILE = 'users_with_er.csv'
    MAX_POST = 12  # MAX POST

    print('Starting the engagement calculations... Please wait until it finishes!')


    users = read_user_name(USER_FILE)
    """ Writing data to csv file """
    csv_headers = get_csv_header(MAX_POST)
    write_csv_header(DATA_FILE, csv_headers)

    for user  in users:

        post_info = {'USER': user}
        url = 'https://www.instagram.com/' + user + '/'

        #for troubleshooting, un-comment the next two lines:
        #print(user)
        #print(url)

        try: 
            r = requests.get(url)
            if r.status_code != 200: 
                print(timestamp,' user {0} not found or page unavailable! Skipping...'.format(user))
                continue
            soup = BeautifulSoup(r.content, "html.parser")
            scripts = soup.find_all('script', type="text/javascript", text=re.compile('window._sharedData'))
            stringified_json = scripts[0].get_text().replace('window._sharedData = ', '')[:-1]

            j = json.loads(stringified_json)['entry_data']['ProfilePage'][0]
            timestamp = time.strftime("%d-%m-%Y %H:%M:%S", ts)
        except ValueError:
            print(timestamp,'ValueError for username {0}...Skipping...'.format(user))
            continue
        except IndexError as error:
        # Output expected IndexErrors.
            print(timestamp, error)
            continue
        if j['graphql']['user']['edge_followed_by']['count'] <=0:
            print(timestamp,'user {0} has no followers! Skipping...'.format(user))
            continue
        if j['graphql']['user']['edge_owner_to_timeline_media']['count'] <12:
            print(timestamp,'user {0} has less than 12 posts! Skipping...'.format(user))
            continue
        if j['graphql']['user']['is_private'] is True:
            print(timestamp,'user {0} has a private profile! Skipping...'.format(user))
            continue
        media_count = j['graphql']['user']['edge_owner_to_timeline_media']['count']
        accountname = j['graphql']['user']['username']
        followercount = j['graphql']['user']['edge_followed_by']['count']
        bio = j['graphql']['user']['biography']
        i = 0
        total_likes = 0
        total_comments = 0
        all_captiontext = ''
        while i <= 11: 
                total_likes += j['graphql']['user']['edge_owner_to_timeline_media']['edges'][i]['node']['edge_liked_by']['count']
                total_comments += j['graphql']['user']['edge_owner_to_timeline_media']['edges'][i]['node']['edge_media_to_comment']['count']
                captions = j['graphql']['user']['edge_owner_to_timeline_media']['edges'][i]['node']['edge_media_to_caption']
                caption_detail = captions['edges'][0]['node']['text']
                all_captiontext += caption_detail
                i += 1
        engagement_rate_percentage = '{0:.4f}'.format((((total_likes + total_comments) / followercount)/12)*100) + '%'
        engagement_rate = (((total_likes + total_comments) / followercount)/12*100)

        #isolate and count hashtags
        hashtags = re.findall(r'#\w*', all_captiontext)
        hashtags_counted = Counter(hashtags)
        most_common = hashtags_counted.most_common(5)

        with open('users_with_er.csv', 'a', newline='',  encoding='utf-8') as data_out:

            print(timestamp,'Writing Data for user {0}...'.format(user))            
            post_info["USER"] = accountname
            post_info["FOLLOWERCOUNT"] = followercount
            post_info["MEDIA COUNT"] = media_count
            post_info["TOTAL LIKES"] = total_likes
            post_info["TOTAL COMMENTS"] = total_comments
            post_info["ER"] = engagement_rate
            post_info["ER IN %"] = engagement_rate_percentage
            post_info["BIO"] = bio
            post_info["ALL CAPTION TEXT"] = all_captiontext
            post_info["HASHTAGS COUNTED"] = hashtags_counted
            csv_writer = csv.DictWriter(data_out, fieldnames=csv_headers)
            csv_writer.writerow(post_info)

""" Done with the script """
print('ALL DONE !!!! ')

The code that goes before this simply scrapes the webpage, and compiles all the captions from the last 12 posts into "all_captiontext".在此之前的代码只是抓取网页,并将最后 12 个帖子中的所有标题编译为“all_captiontext”。

Any help to solve this (probably simple) issue would be greatly appreciated as I've been struggling with this for days (again, I'm a noob :') ).任何解决这个(可能很简单)问题的帮助将不胜感激,因为我已经为此苦苦挣扎了好几天(再次,我是一个菜鸟:'))。

Replace line替换线

post_info["MOST COMMON HASHTAGS"] = most_common

with:和:

for i, counter_tuple in enumerate(most_common):
  tag_name = counter_tuple[0].replace('#','')
  label = "Top %d" % (i + 1)
  post_info[label] = tag_name

There's also a bit of code missing.还有一些代码缺失。 For example, your code doesn't include csv_headers variable, which I suppose would be例如,您的代码不包含csv_headers变量,我想是

csv_headers = post_info.keys()

It also seems that you're opening a file to write just one row.似乎您正在打开一个文件只写一行。 I don't think that's intended, so what you would like to do is to collect the results into a list of dictionaries.我认为这不是故意的,所以您想要做的是将结果收集到字典列表中。 A cleaner solution would be to use pandas' dataframe, which you can output straight into a csv file .简洁的解决方案是使用 pandas 的数据框,您可以将其直接输出到 csv 文件中

most_common being the output of the call to hashtags_counted.most_common, I had a look at the doc here: https://docs.python.org/2/library/collections.html#collections.Counter.most_common most_common 是对 hashtags_counted.most_common 调用的输出,我在这里查看了文档: https : //docs.python.org/2/library/collections.html#collections.Counter.most_common

Output if formatted the following : [(key, value), (key, value), ...] and ordered in decreasing importance of number of occurences.输出格式如下: [(key, value), (key, value), ...]并按出现次数的重要性递减排序。

Hence, to get only the name and not the number of occurence, you should replace:因此,要仅获取名称而不是出现次数,您应该替换:

post_info["MOST COMMON HASHTAGS"] = most_common

by经过

post_info["MOST COMMON HASHTAGS"] = [x[0] for x in most_common]

You have a list of tuple.你有一个元组列表。 This statement builds on the fly the list of the first element of each tuple, keeping the sorting order.此语句动态构建每个元组的第一个元素的列表,保持排序顺序。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 我不知道如何在 Python 中获得正确的 output - I can't figure out how to get the right output in Python 我正在尝试在我的 class 中创建一个用户输入字典以获得额外的信用,但我无法弄清楚如何正确结束程序 - I'm trying to make a user input dictionary for extra credit in my class and I can't figure out how to properly end the program 无法弄清楚如何在项目中表达我的逻辑 - can't figure out how to present my logic within the project 无法弄清楚如何在我的pygame中成倍增加外星人? - Can't figure out how to multiply aliens in my pygame? 无法弄清楚如何使用all()在我的“代码”中工作 - Can't figure out how to use all() to work in my “code” 我不知道如何将数据写入到csv文件中以及如何正确执行排序代码 - I can't figure out how to write to my data to a csv file as well as having my sorting code function correctly 在我的代码中没有得到我需要的输出,我不知道为什么。 Python - Not getting the output I need in my code and I can't figure out why. Python 无法找到在新的 web 页面上显示已处理数据输出(字典列表)的方法 - Can't figure out a way to display processed data output(list of dictionary) on a new web page 无法弄清楚为什么程序不提供输出 - Can't figure out why program doesn't give output 无法弄清楚如何在 Django 中为 psycopg2.errors.NotNullViolation 正确“断言” - Can't figure out how to 'assert' properly for psycopg2.errors.NotNullViolation in Django
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM