简体   繁体   中英

Merging a list of dataframes with different lengths and columns in python

I have list of 100 dataframes that I am trying to merge into a single dataframe but am unable to do so. All the dataframes have different columns and are of different lengths. To give a bit of context and background, each dataframe consist of 4 sentiment scores (calculated using VaderSentiment). The dataframes have the following representation :

USER 1 DATAFRAME

created_at       | positive score of user 1 tweets  |  negative score of user 1   tweets|    neutral score of user 1 tweets  | compound score of user 1 tweets |
23/2/2011 10:00  |           1.12                   |            1.3                    |                1.0                 |                  3.3            |
24/2/2011 11:00  |           1.20                   |            1.1                    |                0.9                 |                  2.5            |

USER 2 DATAFRAME

created_at       | positive score of user 1 tweets  |  negative score of user 1   tweets|    neutral score of user 1 tweets  | compound score of user 1 tweets |
25/3/2011 23:00  |           0.12                   |            1.1                    |                0.1                 |                  1.1            |
26/3/2011 08:00  |           1.40                   |            1.5                    |                0.4                 |                  1.5            |
01/4/2011 19:00  |           1.80                   |            0.1                    |                1.9                 |                  3.9            |

All the dataframes have one column in common, namely created_at . What I am trying to achieve is to merge all the dataframes based on the created_at column such that I get only one created_at column and all the other columns from all the other dataframes. The result should have **400* columns of sentiment scores and along with on created_at column.

My code is as follows :

import pandas as pd
import glob
import numpy as np
import os
from functools import reduce


path = r'C:\Users\Desktop\Tweets'
allFiles = glob.glob(path + "/*.csv")
list = []
frame = pd.DataFrame()

count=0

for f in allFiles:
    file = open(f, 'r')
    count=count+1
    _, fname = os.path.split(f)
    df = pd.read_csv(f)
    #print(df)
    list.append(df)

frame = pd.concat(list)
print(frame)

The problem is that when I run the code as above, I get the desired arrangement of columns, but instead of getting the values i get NaN in all the values, thus essentially having a dataframe with 401 columns out of which only the created_at column contains values

Any and all help is appreciated.

Thank you

EDIT

I have tried various different solutions to different questions posted here but none of them seem to work and thus as a last resort I have started this thread

EDIT 2

I have perhaps come up with a solution to my problem. Using the code below, I can append all the columns into frames . However, this creates a duplicate of created_at column which happens to be type object . If I could merge all the dates into one column, then my troubles would be much closer to being solved.

for f in allFiles :
file = open(f, 'r')
count=count+1
_, fname = os.path.split(f)
df = pd.read_csv(f)

dates = df.iloc[:,0]
neut = df.iloc[:,1]
pos = df.iloc[:,2]
neg = df.iloc[:,3]
comp = df.iloc[:,4]

all_frames.append(dates)
all_frames.append(neut)
all_frames.append(pos)
all_frames.append(neg)
all_frames.append(comp)

frame = pd.concat(all_frames,axis=1)

Any help would be appreciated

I strongly suggest you revist your data model. Having that many columns usually signals something is wrong. Having said that, here's one way to do it. Also list is a built-in data type. Don't override it with a variable name.

I assume that other than created_at , the columns from each file are unique.

all_frames = []
for f in allFiles:
    file = open(f, 'r')
    count=count+1
    _, fname = os.path.split(f)
    df = pd.read_csv(f, parse_dates=['created_at'], index_col='created_at')
    all_frames.append(df)

# This will create a dataframe of size n * 400
# n is the total number of rows between all files
frame = pd.concat(all_frames, join='outer', copy=False, sort=False)

# If you want to line up the hour across all users
frame.groupby(level=0)[frame.columns].first()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM