I have a very long list of sequences(suppose of length 16 each) consisting of 0 and 1. eg
s = ['0100100000010111', '1100100010010101', '1100100000010000', '0111100011110111', '1111100011010111']
Now I want to treat each bit as a feature so I need to convert it into numpy array or pandas dataframe. In order to do that I need to comma separate all the bits present in the sequences which is impossible for big datasets.
So what I have tried is to generate all the positions in the string:
slices = []
for j in range(len(s[0])):
slices.append((j,j+1))
print(slices)
[(0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9), (9, 10), (10, 11), (11, 12), (12, 13), (13, 14), (14, 15), (15, 16)]
new = []
for i in range(len(s)):
seq = s[i]
for j in range(len(s[i])):
## I have tried both of these LOC but couldn't figure out
## how it could be done
new.append([s[slice(*slc)] for slc in slices])
new.append(s[j:j+1])
print(new)
Expected o/p:
new = [[0,1,0,0,1,0,0,0,0,0,0,1,0,1,1,1], [1,1,0,0,1,0,0,0,1,0,0,1,0,1,0,1], [1,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0], [0,1,1,1,1,0,0,0,1,1,1,1,0,1,1,1], [1,1,1,1,1,0,0,0,1,1,0,1,0,1,1,1]]
Thanks in advance!!
Using the np.array
constructor and a list comprehension:
np.array([list(row) for row in s], dtype=int)
array([[0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1],
[1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1],
[1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1],
[1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1]])
In one line, without for
loops:
np.array(s).view('<U1').astype(int).reshape(len(s), -1)
array([[0, 1, 0, ..., 1, 1, 1],
[1, 1, 0, ..., 1, 0, 1],
[1, 1, 0, ..., 0, 0, 0],
[0, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]])
Still a bit slower than list comprehension though
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.