简体   繁体   中英

Padding one dimension of ndarray with 0s

I have a dataset consisting of IDs, each of which exist over some subset of a range of timestamps. There are 1813 timestamps [0, ..., 1812] and some IDs exists over all timestamps, some over the range (0, n), some over (n, m) and some over (m, 1812). Each ID has 108 features at each timestamp that it exists.

I currently create an ndarray with the following line:

# Shape: (1424, ?, 108) = (numIDs, numIDTimestamps, numFeatures)
inputMatrix = np.array([df.loc[df['id'] == ID, [feature for feature in features]].as_matrix() for ID in IDs])

Here each element in dimension 1 is of length equal to the number of timestamps that this ID exists over. Instead I need every element in this dimension to be of length 1813, padding any non-existent timestamp for a given ID with an array 0s of lenght 108.

In pseudocode:

for each ID:
    for each timestamps:
        if ID exists at timestamp:
            append its array of 108 features
        else:
            append array of 108 0s

What is the most effiicient, Pythonic way to achieve this in a similar fashion to what I have done previously?

EDIT

Here is a sample structure of my dataset which I import into a Pandas DataFrame:

id      timestamp   derived_0   ...     technical_108     y
10      0           0.370326    ...     NaN             -0.011753
11      0           0.014765    ...     NaN             -0.001240
12      0           -0.010622   ...     NaN             -0.020940
25      0           NaN         ...     NaN             -0.015959
26      0           0.176693    ...     NaN             -0.007338

...     ...         ...         ...     ...             ...

2150    1812        -0.123364   ...     0.001004        0.004604
2151    1812        -10.437184  ...     0.044597        -0.009241
2154    1812        -0.077930   ...     0.030816        -0.006852
2156    1812        -0.269845   ...     -0.011706       -0.000785
2158    1812        NaN         ...     NaN             0.003497

And this is the processing I have done up to the imputMatrix line above:

df = df.fillna(df.mean())

# SORT BY LAST TIMESTAMP
df = df.assign(start=df.groupby('id')['timestamp'].transform('min'),
               end=df.groupby('id')['timestamp'].transform('max'))\
               .sort_values(by=['end', 'start', 'timestamp'])

cols = list(df)
featureNames = ['derived', 'fundamental', 'technical']
features = [col for col in cols if col.split('_')[0] in featureNames]
numFeatures = len(features)
IDs = list((df['id'].unique()))                 # Sorted by ascending last timestamp
timestamps = list(df['timestamp'].unique())     # Sorted

"Sort by last timestamp" means that the rows of the DataFrame are reordered so that the IDs with the lowest ending timestamp are first and are still ordered by their timestamps.

eg:

id      timestamp    ...
1314    0            ...
1314    1
1314    2
1699    0
1699    1
1699    2
1699    3

...

You can append a series for every id with timestamps from 0 to 1812 and then remove the cases where the timestamp and id have duplicated and the y column is missing.

A rough sketch of this code is below:

for ID in IDs:
    df.ix[df['id']==ID, 'timestamp'] = df.ix[df['id']==ID, 'timestamp'].append(pd.Series(range(0, 1813)))

df.drop[df.duplicated(subset=('id', 'timestamp'), keep=False) and pd.isnull(df['y'])] 

After this you can apply your existing code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM