簡體   English   中英

無法弄清楚如何正確輸出我的數據

[英]Can't figure out how to properly output my data

我是 python 的相對新手,但不知何故設法為 Instagram 構建了一個刮板。 我現在想更進一步,將 IG 配置文件中的 5 個最常用的主題標簽輸出到我的 CSV 輸出文件中。

電流輸出:

我設法隔離了 5 個最常用的主題標簽,但在我的 csv 中得到了這個結果:

[('#striveforgreatness', 3), ('#jamesgang', 3), ('#thekidfromakron', 2), ('#togetherwecanchangetheworld', 1), ('#halloweenchronicles', 1)]

csv輸出

期望的輸出:

我希望最終得到的是在我的 .CSV 的末尾有 5 列輸出第 X 個最常用的值。

所以在這一行中的東西:

期望的輸出

我在谷歌上搜索了一段時間並設法將它們分開,但我總是以 '('#thekidfromakron', 2)' 作為輸出。 我似乎錯過了謎題的某些部分:(。

這是我目前正在使用的內容:

import csv
import requests
from bs4 import BeautifulSoup
import json
import re
import time
from collections import Counter
ts = time.gmtime()


def get_csv_header(top_numb):
        fieldnames = ['USER','MEDIA COUNT','FOLLOWERCOUNT','TOTAL LIKES','TOTAL COMMENTS','ER','ER IN %', 'BIO', 'ALL CAPTION TEXT','HASHTAGS COUNTED','MOST COMMON HASHTAGS']
        return fieldnames


def write_csv_header(filename, headers):
        with open(filename, 'w', newline='') as f_out:
            writer = csv.DictWriter(f_out, fieldnames=headers)
            writer.writeheader()
        return

def read_user_name(t_file):
        with open(t_file) as f:
            user_list = f.read().splitlines()
        return user_list
if __name__ == '__main__':

    # HERE YOU CAN SPECIFY YOUR USERLIST FILE NAME,
    # Which contains a list of usernames's BY DEFAULT <current working directory>/userlist.txt
    USER_FILE = 'userlist.txt'

    # HERE YOU CAN SPECIFY YOUR DATA FILE NAME, BY DEFAULT (data.csv)', Where your final result stays
    DATA_FILE = 'users_with_er.csv'
    MAX_POST = 12  # MAX POST

    print('Starting the engagement calculations... Please wait until it finishes!')


    users = read_user_name(USER_FILE)
    """ Writing data to csv file """
    csv_headers = get_csv_header(MAX_POST)
    write_csv_header(DATA_FILE, csv_headers)

    for user  in users:

        post_info = {'USER': user}
        url = 'https://www.instagram.com/' + user + '/'

        #for troubleshooting, un-comment the next two lines:
        #print(user)
        #print(url)

        try: 
            r = requests.get(url)
            if r.status_code != 200: 
                print(timestamp,' user {0} not found or page unavailable! Skipping...'.format(user))
                continue
            soup = BeautifulSoup(r.content, "html.parser")
            scripts = soup.find_all('script', type="text/javascript", text=re.compile('window._sharedData'))
            stringified_json = scripts[0].get_text().replace('window._sharedData = ', '')[:-1]

            j = json.loads(stringified_json)['entry_data']['ProfilePage'][0]
            timestamp = time.strftime("%d-%m-%Y %H:%M:%S", ts)
        except ValueError:
            print(timestamp,'ValueError for username {0}...Skipping...'.format(user))
            continue
        except IndexError as error:
        # Output expected IndexErrors.
            print(timestamp, error)
            continue
        if j['graphql']['user']['edge_followed_by']['count'] <=0:
            print(timestamp,'user {0} has no followers! Skipping...'.format(user))
            continue
        if j['graphql']['user']['edge_owner_to_timeline_media']['count'] <12:
            print(timestamp,'user {0} has less than 12 posts! Skipping...'.format(user))
            continue
        if j['graphql']['user']['is_private'] is True:
            print(timestamp,'user {0} has a private profile! Skipping...'.format(user))
            continue
        media_count = j['graphql']['user']['edge_owner_to_timeline_media']['count']
        accountname = j['graphql']['user']['username']
        followercount = j['graphql']['user']['edge_followed_by']['count']
        bio = j['graphql']['user']['biography']
        i = 0
        total_likes = 0
        total_comments = 0
        all_captiontext = ''
        while i <= 11: 
                total_likes += j['graphql']['user']['edge_owner_to_timeline_media']['edges'][i]['node']['edge_liked_by']['count']
                total_comments += j['graphql']['user']['edge_owner_to_timeline_media']['edges'][i]['node']['edge_media_to_comment']['count']
                captions = j['graphql']['user']['edge_owner_to_timeline_media']['edges'][i]['node']['edge_media_to_caption']
                caption_detail = captions['edges'][0]['node']['text']
                all_captiontext += caption_detail
                i += 1
        engagement_rate_percentage = '{0:.4f}'.format((((total_likes + total_comments) / followercount)/12)*100) + '%'
        engagement_rate = (((total_likes + total_comments) / followercount)/12*100)

        #isolate and count hashtags
        hashtags = re.findall(r'#\w*', all_captiontext)
        hashtags_counted = Counter(hashtags)
        most_common = hashtags_counted.most_common(5)

        with open('users_with_er.csv', 'a', newline='',  encoding='utf-8') as data_out:

            print(timestamp,'Writing Data for user {0}...'.format(user))            
            post_info["USER"] = accountname
            post_info["FOLLOWERCOUNT"] = followercount
            post_info["MEDIA COUNT"] = media_count
            post_info["TOTAL LIKES"] = total_likes
            post_info["TOTAL COMMENTS"] = total_comments
            post_info["ER"] = engagement_rate
            post_info["ER IN %"] = engagement_rate_percentage
            post_info["BIO"] = bio
            post_info["ALL CAPTION TEXT"] = all_captiontext
            post_info["HASHTAGS COUNTED"] = hashtags_counted
            csv_writer = csv.DictWriter(data_out, fieldnames=csv_headers)
            csv_writer.writerow(post_info)

""" Done with the script """
print('ALL DONE !!!! ')

在此之前的代碼只是抓取網頁,並將最后 12 個帖子中的所有標題編譯為“all_captiontext”。

任何解決這個(可能很簡單)問題的幫助將不勝感激,因為我已經為此苦苦掙扎了好幾天(再次,我是一個菜鳥:'))。

替換線

post_info["MOST COMMON HASHTAGS"] = most_common

和:

for i, counter_tuple in enumerate(most_common):
  tag_name = counter_tuple[0].replace('#','')
  label = "Top %d" % (i + 1)
  post_info[label] = tag_name

還有一些代碼缺失。 例如,您的代碼不包含csv_headers變量,我想是

csv_headers = post_info.keys()

似乎您正在打開一個文件只寫一行。 我認為這不是故意的,所以您想要做的是將結果收集到字典列表中。 簡潔的解決方案是使用 pandas 的數據框,您可以將其直接輸出到 csv 文件中

most_common 是對 hashtags_counted.most_common 調用的輸出,我在這里查看了文檔: https : //docs.python.org/2/library/collections.html#collections.Counter.most_common

輸出格式如下: [(key, value), (key, value), ...]並按出現次數的重要性遞減排序。

因此,要僅獲取名稱而不是出現次數,您應該替換:

post_info["MOST COMMON HASHTAGS"] = most_common

經過

post_info["MOST COMMON HASHTAGS"] = [x[0] for x in most_common]

你有一個元組列表。 此語句動態構建每個元組的第一個元素的列表,保持排序順序。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM