简体   繁体   中英

Pandas Dataframes: how to build them efficiently

I have a file with 1M rows that I'm trying to read into 20 DataFrames. I do not know in advance which row belongs to which DataFrame or how large each DataFrame will be. How can I process this file into DataFrames efficiently? I've tried to do this several different ways. Here is what I currently have:

data = pd.read_csv(r'train.data', sep=" ", header = None) # Not slow
def collectData(row):
    id = row[0]
    df = dictionary[id] # Row content determines which dataframe this row belongs to
    next = len(df.index)
    df.loc[next] = row
data.apply(collectData, axis=1)

It's very slow. What am I doing wrong? If I just apply an empty function, my code runs in 30 sec. With the actual function it takes at least 10 minutes and I'm not sure if it would finish.

Here are a few sample rows from the dataset:

1 1 4
1 2 2
1 3 10
1 4 4

The full dataset is available here (if you click on Matlab version)

Since, the full data set is easily loaded into memory, the following should be fairly quick

data_split = {i: data[data[0] == i] for i in range(1, 21)}
# to access each dataframe, do a dictionary lookup, i.e.
data_split[2].head()
     0   1  2
769  2  12  4
770  2  16  2
771  2  23  4
772  2  27  2
773  2  29  6

you may also want to reset the indices or copy the data frame when you're slicing the data frame into smaller data frames.

additional reading:

Your approach is not a vectored one, because you apply a python function row by row.

Rather that creating 20 dataframes , make a dictionary containing an index (in range(20)) for each key[0]. Then add this information to your DataFrame:

 data['dict']=data[0].map(dictionary)

Then reorganize :

 data2=data.reset_index().set_index(['dict','index'])

data2 is like :

            0  1   2
dict index          
12   0      1  1   4
     1      1  2   2
     2      1  3  10
     3      1  4   4 
     4      1  5   2
     ....

and data2.loc[i] is one of the Dataframe you want.

EDIT:

It seems that dictionary is describe in train.label .

You can set the dictionary before by:

with open(r'train.label') as f: u=f.readlines()
v=[int(x) for x in u] # len(v) = 11269 = data[0].max()
dictionary=dict(zip(range(1,len(v)+1),v))

If you want to build them efficiently, I think you need some good raw materials:

  • wood
  • cement

Are robust and durable. Try to avoid using hay as the dataframe can be blown up with a little wind.

Hope that helps

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM