简体   繁体   中英

python from a list of integers extract a one-hot encoded sequence for every 'x' elements (sliding window)

I have a list of numbers taken from a dataframe:

import pandas as pd
import numpy as np

file=pd.read_csv('chr1.bed', sep='\t',header=None)
file.head()
0       chr1    3037109    3037259
1       chr1    3037323    3037473
2       chr1    3037534    3037684
3       chr1    3037771    3037921
4       chr1    3038073    3038223

centers=(file[2]-file[1])/2+file[1]
centers=centers.apply(lambda x: round(x))

hence the array of 'centers' is the difference between the second and third column. I would like to convert this sequence into a binary sequence as follows:

start=file.iloc[0][1]
last=file.shape[0]
end=file.iloc[last-1][2]
values=np.zeros(end-start)
for i in centers:
        values[int(i-start+1)]=1

ie the sequence must be an array of zeros starting from the first value in the second column of the dataframe to the last value in the third column. Then using the 'centers' array as an index mark the positions of centers as 1 in the sequence.

This operation was fine, however I now have the problem that I want to perform this operation in a sliding window in sizes of 100, taking chunks of 10000 from the sequence. Initially I tried doing this by taking the 'values' array and moving through it in steps of 100 and taking the next 10000 values:

df=pd.DataFrame.transpose(pd.DataFrame(values[0:10000]))

for i in xrange(100,len(values)-10000,100):
        print(float(i)/float(len(values))) # time assessment
        df=df.append(pd.DataFrame.transpose(pd.DataFrame(values[i:i+10000])))

with open('test.csv', 'w') as f:
        df.to_csv(f, header=False)

according to the line that I have used to asses how long its taking- this will complete after 4 days.... There has to be a faster way of doing this..

Overall my question is can one convert a long sequence of unevenly placed integers into a series of one-hot encoded vectors in windows?

Here's a way of doing what you want without using for loops (which tend to be much slower than using numpy syntax)

Just for the record "convert a long sequence of unevenly placed integers into a series of continuous binary" is called "one-hot" encoding, and here it's as easy as writing values[centers-start+1]=1 . Now for the second part, the idea is to loop your sequence of n values in an array of n+1 columns so that you will get that rolling window effect you are after.

A note though is that this method is building several fairly large arrays (close to the initial sequence length squared ) so you may have to split the sequence in chunks (a sequence of 10000 is working just fine on my 8GB of RAM but 30000 is too much) and/or make the code a bit more memory efficient.

import numpy as np
import pandas as pd

#example values to have an MCVE
centers =  np.array([5,6,10])
start=5
end=15
values=np.zeros(end-start)

#no need for a loop, we can use an array of indexes to assign the values
values[centers-start+1]=1

print("starting values:",values)

#here I'm choosing the window size
window_size = 4

#we start by duplicating the sequence using broadcasting with an array filled of ones 
broadcasted_arr = np.ones(values.shape[0]-window_size+2)[:,None]*values[None,:]
looped_values = broadcasted_arr.ravel()

#raveled array containing the whole sequence repeating the appropiate number of times to fill our final array
print("looped values :",looped_values)

#to create our rolling window trick we fill an array with one extra column
#that way each line will "eat" a value of the sequence shifting the rest by one column

#to be able to reshape to the desired shape we need to keep the exact number of values to fill the array
size = values.shape[0]+1
cropped_looped_values = looped_values[:size*int(looped_values.shape[0]/size)]
#here's where the aforementioned trick happens
rolling_array = cropped_looped_values.reshape(-1,values.shape[0]+1)
#finaly we crop the result to discard the part of the array that isn't relevant
result = rolling_array[:,:window_size]

print("final result :",result)

And here's the output :

starting values: 
[ 0.  1.  1.  0.  0.  0.  1.  0.  0.  0.]
looped values : 
[ 0.  1.  1.  0.  0.  0.  1.  0.  0.  0.  0.  1.  1.  0.  0.  0.  1.  0.
  0.  0.  0.  1.  1.  0.  0.  0.  1.  0.  0.  0.  0.  1.  1.  0.  0.  0.
  1.  0.  0.  0.  0.  1.  1.  0.  0.  0.  1.  0.  0.  0.  0.  1.  1.  0.
  0.  0.  1.  0.  0.  0.  0.  1.  1.  0.  0.  0.  1.  0.  0.  0.  0.  1.
  1.  0.  0.  0.  1.  0.  0.  0.]
final result : 
[[ 0.  1.  1.  0.]
 [ 1.  1.  0.  0.]
 [ 1.  0.  0.  0.]
 [ 0.  0.  0.  1.]
 [ 0.  0.  1.  0.]
 [ 0.  1.  0.  0.]
 [ 1.  0.  0.  0.]]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM