简体   繁体   English

如何将pickle文件的文件夹转换为单个csv文件

[英]how to convert folder of pickle files into single csv file

I have a directory containing about 1700 pickle file, that every file is all Twitter post of the user, I want to convert it into a folder of CSV files, that every CSV file name is the name of the pickle file and each row contains one tweet of user... after that, I want just the top 20 CSV with more samples than others... how can I do that?我有一个包含大约 1700 个 pickle 文件的目录,每个文件都是用户的 Twitter 帖子,我想将其转换为 CSV 文件的文件夹,每个 CSV 文件名都是 pickle 文件的名称,每行包含一个用户的推文……在那之后,我只想要前 20 个 CSV 的样本比其他人多……我该怎么做?

# khabarlist = open_file_linebyline(pkl_path)
def open_dir_in_dict(input_path):
    files = os.scandir(input_path)
    my_dict = {}
    for file in files:
        # if len(file.name.split()) > 1:
        #     continue
        # if file.split('.')[-1] != "pkl":

        with open(file, 'r', encoding='utf8') as f:
            items = [i.strip() for i in f.read().split(",")]
        my_dict[file.replace(".pkl", "")] = items
        df = pd.DataFrame(my_dict)
        df.to_excel(file.replace(".pkl", "") + "xlsx")


open_dir_in_dict("Raw/")

I Wrote the sample code for it and it did not work...我为它编写了示例代码,但它不起作用......

def open_dir_in_dict(input_path):
    files = os.scandir(input_path)
    my_dict = {}
    for file in files:
        if len(file.name.split()) > 1:
            continue
        if file.split('.')[-1] != "pkl":

            with open(file, 'r', encoding='utf-8', errors='replace') as f:
                print(f.readlines())
                items = [i.strip() for i in f.read().split(",")]  # encode('utf-8').strip()
        my_dict[file.replace(".pkl", "")] = items
        df = pd.DataFrame(my_dict)
        df.to_excel(file.replace(".pkl", "") + "xlsx")


# open_dir_in_dict("Raw/")

and a better answer...和一个更好的答案......

import os
import pandas as pd
import regex as re

data_path = "/content/drive/My Drive/twint/Data/pkl/Data/"
for path in os.listdir(data_path):
    my_tweets = []
    df = pd.read_pickle(data_path + path)
    for tweet in df.tweet:
        url = re.findall(r"http\S+", tweet)
        if url == []:
            my_tweets.append(tweet)
    new_df = pd.DataFrame({"tweets": my_tweets, "author": path.replace(".pkl", "")})  # path[:-4]
    new_df.to_csv("/content/drive/My Drive/twint/final.csv", index=False, mode="a", )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM