在python中合並具有不同長度和列的數據幀列表

Question

我有 100 個數據幀的列表，我試圖將它們合並到一個數據幀中，但無法這樣做。 所有數據幀都有不同的列並且長度不同。 為了提供一些上下文和背景，每個數據幀包含 4 個情緒分數（使用 VaderSentiment 計算）。 數據框具有以下表示形式：

用戶 1 數據幀

created_at       | positive score of user 1 tweets  |  negative score of user 1   tweets|    neutral score of user 1 tweets  | compound score of user 1 tweets |
23/2/2011 10:00  |           1.12                   |            1.3                    |                1.0                 |                  3.3            |
24/2/2011 11:00  |           1.20                   |            1.1                    |                0.9                 |                  2.5            |

用戶 2 數據幀

created_at       | positive score of user 1 tweets  |  negative score of user 1   tweets|    neutral score of user 1 tweets  | compound score of user 1 tweets |
25/3/2011 23:00  |           0.12                   |            1.1                    |                0.1                 |                  1.1            |
26/3/2011 08:00  |           1.40                   |            1.5                    |                0.4                 |                  1.5            |
01/4/2011 19:00  |           1.80                   |            0.1                    |                1.9                 |                  3.9            |

所有數據幀都有一列共同的，即created_at 。 我想要實現的是合並基於created_at列的所有數據幀，這樣我只能從所有其他數據幀中獲得一個 created_at列和所有其他列。 結果應該有 **400* 列的情緒分數以及created_at列。

我的代碼如下：

import pandas as pd
import glob
import numpy as np
import os
from functools import reduce


path = r'C:\Users\Desktop\Tweets'
allFiles = glob.glob(path + "/*.csv")
list = []
frame = pd.DataFrame()

count=0

for f in allFiles:
    file = open(f, 'r')
    count=count+1
    _, fname = os.path.split(f)
    df = pd.read_csv(f)
    #print(df)
    list.append(df)

frame = pd.concat(list)
print(frame)

問題是，當我運行上面的代碼時，我得到了所需的列排列，但是我沒有得到所有值中的 NaN 值，因此基本上有一個包含 401 列的數據框，其中只有created_at列包含價值觀

任何和所有的幫助表示贊賞。

謝謝

編輯

我已經嘗試了各種不同的解決方案來解決這里發布的不同問題，但它們似乎都不起作用，因此作為最后的手段，我開始了這個線程

編輯 2

我可能已經想出了解決我的問題的方法。 使用下面的代碼，我可以將所有列附加到frames 。 但是，這會創建created_at列的副本，該列恰好是object類型。 如果我可以將所有日期合並為一列，那么我的麻煩就離解決更近了。

for f in allFiles :
file = open(f, 'r')
count=count+1
_, fname = os.path.split(f)
df = pd.read_csv(f)

dates = df.iloc[:,0]
neut = df.iloc[:,1]
pos = df.iloc[:,2]
neg = df.iloc[:,3]
comp = df.iloc[:,4]

all_frames.append(dates)
all_frames.append(neut)
all_frames.append(pos)
all_frames.append(neg)
all_frames.append(comp)

frame = pd.concat(all_frames,axis=1)

任何幫助，將不勝感激

Answer 1

我強烈建議你修改你的數據模型。 擁有這么多列通常表示出現問題。 話雖如此，這是一種方法。 list也是一個內置的數據類型。 不要用變量名覆蓋它。

我假設除了created_at ，每個文件中的列都是唯一的。

all_frames = []
for f in allFiles:
    file = open(f, 'r')
    count=count+1
    _, fname = os.path.split(f)
    df = pd.read_csv(f, parse_dates=['created_at'], index_col='created_at')
    all_frames.append(df)

# This will create a dataframe of size n * 400
# n is the total number of rows between all files
frame = pd.concat(all_frames, join='outer', copy=False, sort=False)

# If you want to line up the hour across all users
frame.groupby(level=0)[frame.columns].first()

在python中合並具有不同長度和列的數據幀列表

問題描述

1 個解決方案

解決方案1
0 2019-08-14 21:06:24

在python中合並具有不同長度和列的數據幀列表

問題描述

1 個解決方案

解決方案1 0 2019-08-14 21:06:24

解決方案1
0 2019-08-14 21:06:24