I have been working on this problem for a while now and I would like some advice from the community. I have a string containing 150k char, all AGCT (genomic data). I have used the following to reduce this down into 1000 char chunks as follows:
substrings = []
n = 1000 #length
y = 0 #interval
for i in range(0,len(reference_string),n):
reference_substrings = reference_string[i+y:i+n]
substrings.append(reference_substrings)
Similarly, I then reduce the strings further into chunks of 40char length with the goal of transposing this subset to columns. So that each 1000 char above make up 1 row in my data frame, with each column containing a substring of 40char.
# goes through 1000 substrings and breaks down into 25 set of 40 char substrings
#split genome into equal parts of 40 char = 25 col
#per 1000 chars
seeds = []
n = 40 #seed length
y = 0 # seed interval
for j in range(0, len(substrings)):
for i in range(0, len(genome_string), n):
reference_seeds = genome_string[i+y:i+n]
seeds.append(reference_seeds)
The issue I am having is producing either a list of lists or a new list for each chunk iteratively. I have tried similar techniques iterating through the list or first applying the list to a data frame however always end up with either one large string of 25*40 or 25 identical cols.
Any directions would be greatly appreciated. Thanks
You can try this:
def chunk_large_string(init_str, first_split_size=1000, second_split_size=40):
# Split the initial string in substrings of size first_split_size
substrs_list = [init_str[i:i + first_split_size] for i in range(0, len(init_str), first_split_size)]
# Split each substring in subsubstrings of size second_split_size
final_list = [[substr[i:i + second_split_size] for i in range(0, len(substr), second_split_size)] for substr in substrs_list]
return final_list
This function returns, in your case, a list with 150 elements, each element is another list of 25 elements, and each of this elements is a string of 40 characters.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.