简体   繁体   中英

Create Multilevel DataFrame by reading in data from multiple files using read_csv() [SOLVED]

I have 10 files with the following identical format and column names (values are different across different files):

    event_code  timestamp   counter
0   9071        1165783     NaN
1   9070        1165883     NaN
2   8071        1166167     NaN
3   7529        NaN         0.0
4   8529        NaN         1.0
5   9529        NaN         1.0

Due to the nature of the files, I am trying to store these data in multilevel dataframe like the following: (Eventually, I would want the box_num level to go all the way to 10)

box_num                1                                 2                ...   
col_names   event_code  timestamp   counter |event_code timestamp   counter
      0     9071          1270451     1     |   8529       NaN       1    ...
      1     9070          1270484     0     |   9529       NaN       0    ...
      2     9071          1270736     1     |   5520       3599167   2    ...
      3     9070          1272337     3     |   7171       3599169   1    ...

I initially thought I could make a multilevel dataframe with a dictionary using the keys as the hierarchical index and the dataframe as the subjugated dataframe

col_names = ['event_code','timestamp', 'counter']

df_dict = {}
for i in range(len(files)):
    f = files[i]  # actual file name

    df = pd.read_csv(f, sep=":", header=None, names=col_names)
    df_dict[i+1] = df   # 'i+1' so that dict_key can correspond to actual box number 

But I soon realized that I can't create a multilevel index or dataframe from a dictionary. So to create a Multilevel Index, this is what I did, but now I am stuck on what to do next...

(box_num, col_list) = df_dict.keys(), list(df_dict.values())[0].columns

If there are other more efficient, concise ways to approach this problem, please let me know as well. Ideally, I would like to create the multilevel dataframe right after the for loop

::UPDATE:: [SOLVED]

So I eventually figured out a way to create a multilevel dataframe from the for loop using pd.concat(). I'll post my answer below. Hopefully it's helpful to someone.

col_names = ['event_code', 'timestamp', 'counter']

result = []
box_num = []

for i in range(len(files)):
    f = files[i]
    box_num.append(i+1)  # box_number 

    df = pd.read_csv(f, sep=":", header=None, names=col_names)
    result.append(df)

# # pd.concat() combines all the Series in the 'result' list
# # 'Keys' option adds a hierarchical index at the outermost level of the data.

final_df = pd.concat(result, axis=1, keys=box_num, names=['Box Number','Columns'])


I think you should use a pivot table or the pandas groupby function for this task. Neither will give you exactly what you have requested above, but it will be simpler to use.

Using your code as a starting point:

col_names = ['event_code','timestamp', 'counter']
data = pd.DataFrame()

for i in range(len(files)):
    f = files[i]
    df = pd.read_csv(f, sep=":", header=None, names=col_names)
    # instead of a dictionary try creating a master DataFrame
    df['box_num'] = i
    data = pd.concat([data, df]).reset_index(drop=True)
    data['idx'] = data.index
# option 1 create a pivot table 
pivot = data.pivot(index='idx', columns='box_num', values=col_names)

# option 2 use pandas groupby function
group = data.groupby(['idx', 'box_num']).mean()

Hopefully one of these can help you get going in the right direction or work for what you are trying to accomplish. Good luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM