简体   繁体   中英

iteratively transpoing and appending dataframe for conditional checking using python

I have been working on this problem for a while now and I would like some advice from the community. I have a string containing 150k char, all AGCT (genomic data). I have used the following to reduce this down into 1000 char chunks as follows:

substrings = []

n = 1000 #length
y = 0 #interval

for i in range(0,len(reference_string),n):
    reference_substrings = reference_string[i+y:i+n]
    substrings.append(reference_substrings)

Similarly, I then reduce the strings further into chunks of 40char length with the goal of transposing this subset to columns. So that each 1000 char above make up 1 row in my data frame, with each column containing a substring of 40char.

# goes through 1000 substrings and breaks down into 25 set of 40 char substrings 
#split genome into equal parts of 40 char = 25 col 
#per 1000 chars
seeds = []
n = 40 #seed length
y = 0 # seed interval 
for j in range(0, len(substrings)):
    for i in range(0, len(genome_string), n):
        reference_seeds = genome_string[i+y:i+n]
        seeds.append(reference_seeds)

The issue I am having is producing either a list of lists or a new list for each chunk iteratively. I have tried similar techniques iterating through the list or first applying the list to a data frame however always end up with either one large string of 25*40 or 25 identical cols.

Any directions would be greatly appreciated. Thanks

You can try this:

def chunk_large_string(init_str, first_split_size=1000, second_split_size=40):

    # Split the initial string in substrings of size first_split_size
    substrs_list = [init_str[i:i + first_split_size] for i in range(0, len(init_str), first_split_size)]

    # Split each substring in subsubstrings of size second_split_size
    final_list = [[substr[i:i + second_split_size] for i in range(0, len(substr), second_split_size)] for substr in substrs_list]

    return final_list

This function returns, in your case, a list with 150 elements, each element is another list of 25 elements, and each of this elements is a string of 40 characters.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM