简体   繁体   中英

How to load csv with variables in one column into dataframes

I have a csv with just one column:

demand;;;
1;4;3;2
2;3;4;2
3;3;3;4
4;4;3;2
workhours;;;
1;160;;
2;80;;
3;40;;
4;80;;

How can i load it into python such that the first dataframe will be:

 index    column1 column2 column3
 1          1       4        3     
 2          2       3        4    
 3          3       3        3    
 4          4       4        3  

And the second will be

 index    column1 column2 column3
 1          160     0        0    
 2          80      0        0   
 3          40      0        0  
 4          80      0        0   

On approach is to read in the whole csv, and create groups based on where there are NaN in all 3 columns. Then create a dictionary based on those groups:

from io import StringIO
from pprint import pprint

import pandas as pd

df = pd.read_csv(StringIO('''demand;;;
1;4;3;2
2;3;4;2
3;3;3;4
4;4;3;2
workhours;;;
1;160;;
2;80;;
3;40;;
4;80;;
'''), sep=';', index_col=0, names=['column1', 'column2', 'column3'])

# Create Groups where all row values are NaN
groups = tuple(df.groupby((~df.index.str.isnumeric()).cumsum()))
dfs = {}
for i, (_, sub_df) in enumerate(groups):
    # Get Name From Index as Key and set value as rest of the Frame
    dfs[sub_df.index[0]] = sub_df.iloc[1:]

pprint(dfs)

dfs :

{'demand':    column1  column2  column3
1      4.0      3.0      2.0
2      3.0      4.0      2.0
3      3.0      3.0      4.0
4      4.0      3.0      2.0,
 'workhours':    column1  column2  column3
1    160.0      NaN      NaN
2     80.0      NaN      NaN
3     40.0      NaN      NaN
4     80.0      NaN      NaN}

Some groupby options:

  1. Assuming that there are no rows with all NaNs except the row which where table breaks occur:
groups = tuple(df.groupby(df.isna().all(axis=1).cumsum()))
  1. Assuming that all indexes except the row where breaks occur are numeric:
groups = tuple(df.groupby((~df.index.str.isnumeric()).cumsum()))
  1. Assuming there are both occasionally non-numeric indexes and rows with NaNs, but that only rows where both occur are breaks:
groups = tuple(df.groupby((~df.index.str.isnumeric() & df.isna().all(axis=1)).cumsum()))

Given the shape of the data (index column containing name of the data -demand, workhours- then indexes) you gave you can try:

df = pd.read_csv("yourcsv.csv", 
                sep=";", 
                names=["column1","column2","column3"]).fillna(0)

And then create a new column label having the name of the data for the whole corresponding chunk:

df["label"] = df["index"].apply(lambda x: x if not x.isnumeric() else pd.NA).fillna(method = "ffill")
df = df[df["index"] != df["label"]]

This will give you a dataframe looking like this

在此处输入图像描述

You can then separate the data based on the "label" column:

df_demand = df.loc[df.label=="demand"].set_index('index').drop(['label'], axis=1)
df_workhours = df.loc[df.label=="workhours"].set_index('index').drop(['label'], axis=1)

df_workhours

df_demand

You can use

pd.read_csv('filename.csv', index_col = 0, sep = ';')

where 0 in index_col = 0 is the position of the column you want to use as an index

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM