简体   繁体   English

python从整数列表中为每个'x'元素提取一个热编码序列(滑动窗口)

[英]python from a list of integers extract a one-hot encoded sequence for every 'x' elements (sliding window)

I have a list of numbers taken from a dataframe: 我有一个从数据框中获取的数字列表:

import pandas as pd
import numpy as np

file=pd.read_csv('chr1.bed', sep='\t',header=None)
file.head()
0       chr1    3037109    3037259
1       chr1    3037323    3037473
2       chr1    3037534    3037684
3       chr1    3037771    3037921
4       chr1    3038073    3038223

centers=(file[2]-file[1])/2+file[1]
centers=centers.apply(lambda x: round(x))

hence the array of 'centers' is the difference between the second and third column. 因此,“中心”的数组是第二列与第三列之间的差。 I would like to convert this sequence into a binary sequence as follows: 我想将此序列转换为二进制序列,如下所示:

start=file.iloc[0][1]
last=file.shape[0]
end=file.iloc[last-1][2]
values=np.zeros(end-start)
for i in centers:
        values[int(i-start+1)]=1

ie the sequence must be an array of zeros starting from the first value in the second column of the dataframe to the last value in the third column. 也就是说,该序列必须是一个零数组,从数据帧第二列的第一个值到第三列的最后一个值开始。 Then using the 'centers' array as an index mark the positions of centers as 1 in the sequence. 然后使用“中心”数组作为索引,将序列中的中心位置标记为1。

This operation was fine, however I now have the problem that I want to perform this operation in a sliding window in sizes of 100, taking chunks of 10000 from the sequence. 该操作很好,但是现在我有一个问题,我想在大小为100的滑动窗口中执行此操作,从序列中取出10000个块。 Initially I tried doing this by taking the 'values' array and moving through it in steps of 100 and taking the next 10000 values: 最初,我尝试通过采用“值”数组并以100的步长移动它并采用下一个10000值来做到这一点:

df=pd.DataFrame.transpose(pd.DataFrame(values[0:10000]))

for i in xrange(100,len(values)-10000,100):
        print(float(i)/float(len(values))) # time assessment
        df=df.append(pd.DataFrame.transpose(pd.DataFrame(values[i:i+10000])))

with open('test.csv', 'w') as f:
        df.to_csv(f, header=False)

according to the line that I have used to asses how long its taking- this will complete after 4 days.... There has to be a faster way of doing this.. 根据我习惯评估的行数,这需要4天后才能完成。...必须有一种更快的方法。

Overall my question is can one convert a long sequence of unevenly placed integers into a series of one-hot encoded vectors in windows? 总的来说,我的问题是能否将Windows中不规则放置的整数的长序列转换为Windows中一系列单编码的矢量?

Here's a way of doing what you want without using for loops (which tend to be much slower than using numpy syntax) 这是一种无需使用for循环即可完成所需操作的方法(与使用numpy语法相比,这通常会慢得多)

Just for the record "convert a long sequence of unevenly placed integers into a series of continuous binary" is called "one-hot" encoding, and here it's as easy as writing values[centers-start+1]=1 . 仅仅为了记录“将不规则放置的整数的长序列转换为一系列连续的二进制数”被称为“ one-hot”编码,在这里就像写values[centers-start+1]=1一样容易。 Now for the second part, the idea is to loop your sequence of n values in an array of n+1 columns so that you will get that rolling window effect you are after. 现在,对于第二部分,想法是将n个值的序列循环到n + 1列的数组中,以便获得所需的滚动窗口效果。

A note though is that this method is building several fairly large arrays (close to the initial sequence length squared ) so you may have to split the sequence in chunks (a sequence of 10000 is working just fine on my 8GB of RAM but 30000 is too much) and/or make the code a bit more memory efficient. 不过要注意的是,此方法正在构建几个相当大的数组(接近于初始序列长度的平方 ),因此您可能必须将序列拆分为块(10000序列在我的8GB RAM上工作正常,但30000也是这样) )和/或使代码的存储效率更高。

import numpy as np
import pandas as pd

#example values to have an MCVE
centers =  np.array([5,6,10])
start=5
end=15
values=np.zeros(end-start)

#no need for a loop, we can use an array of indexes to assign the values
values[centers-start+1]=1

print("starting values:",values)

#here I'm choosing the window size
window_size = 4

#we start by duplicating the sequence using broadcasting with an array filled of ones 
broadcasted_arr = np.ones(values.shape[0]-window_size+2)[:,None]*values[None,:]
looped_values = broadcasted_arr.ravel()

#raveled array containing the whole sequence repeating the appropiate number of times to fill our final array
print("looped values :",looped_values)

#to create our rolling window trick we fill an array with one extra column
#that way each line will "eat" a value of the sequence shifting the rest by one column

#to be able to reshape to the desired shape we need to keep the exact number of values to fill the array
size = values.shape[0]+1
cropped_looped_values = looped_values[:size*int(looped_values.shape[0]/size)]
#here's where the aforementioned trick happens
rolling_array = cropped_looped_values.reshape(-1,values.shape[0]+1)
#finaly we crop the result to discard the part of the array that isn't relevant
result = rolling_array[:,:window_size]

print("final result :",result)

And here's the output : 这是输出:

starting values: 
[ 0.  1.  1.  0.  0.  0.  1.  0.  0.  0.]
looped values : 
[ 0.  1.  1.  0.  0.  0.  1.  0.  0.  0.  0.  1.  1.  0.  0.  0.  1.  0.
  0.  0.  0.  1.  1.  0.  0.  0.  1.  0.  0.  0.  0.  1.  1.  0.  0.  0.
  1.  0.  0.  0.  0.  1.  1.  0.  0.  0.  1.  0.  0.  0.  0.  1.  1.  0.
  0.  0.  1.  0.  0.  0.  0.  1.  1.  0.  0.  0.  1.  0.  0.  0.  0.  1.
  1.  0.  0.  0.  1.  0.  0.  0.]
final result : 
[[ 0.  1.  1.  0.]
 [ 1.  1.  0.  0.]
 [ 1.  0.  0.  0.]
 [ 0.  0.  0.  1.]
 [ 0.  0.  1.  0.]
 [ 0.  1.  0.  0.]
 [ 1.  0.  0.  0.]]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM