简体   繁体   English

从字符串列表中提取子字符串,其中子字符串由一致的字符限定

[英]Extract substrings from a list of strings, where substrings are bounded by consistent characters

I have a list of lists of strings containing the taxonomies of different bacterial species.我有一个字符串列表,其中包含不同细菌种类的分类法。 Each list has a consistent format:每个列表都有一致的格式:

['d__domain;p__phylum;c__class;o__order;f__family;g__genus;s__species','...','...'] ['d__domain;p__phylum;c__class;o__order;f__family;g__genus;s__species','...','...']

I'm trying to pull out the genera of each string in each list to find the unique genera.我正在尝试提取每个列表中每个字符串的属以找到唯一的属。 To do this, my idea was to make nested for loops that would split each string by ';'为此,我的想法是制作嵌套的 for 循环,用 ';' 分割每个字符串and use list comprehension to search for 'g__', then lstrip off the g__ and append the associated genus name to a new, complimentary list.并使用列表推导搜索“g__”,然后将 g__ 和 append 关联的属名删除到一个新的免费列表中。 I attempted this in the code below:我在下面的代码中尝试了这个:

finalList = []

for i in range(32586):
    
    outputList = []
    j = 0
    for j in taxonomyData.loc[i,'GTDB Taxonomy'][j]:
        
        ## Access Taxonomy column of Pandas dataframe and split by ;
        taxa = taxonomyData.loc[i,'GTDB Taxonomy'].split(';')
        
        ## Use list comprehension to pull out genus by g__
        genus = [x for x in taxa if 'g__' in x]
        if genus == [] :
            genus = 'None'
            
        ## lstrip off g__
        else:
            genus = genus[0].lstrip('g__')
            
            ## Append genus to new list of genera
            outputList.append(genus)
    ## Append new list of genera to larger list    
    finalList.append(outputList)
    print(finalList)
    break
    
    print(genus)

I tested this for loop and it successfully pulled the genus out of the first string of the first list, but when I let the for loop run, it skipped to the next list, leaving all the other items in the first list.我测试了这个 for 循环,它成功地将属从第一个列表的第一个字符串中拉出来,但是当我让 for 循环运行时,它跳到下一个列表,将所有其他项目留在第一个列表中。 Any advice on how I can get this loop to iterate through all the strings in the first list before moving on to subsequent lists?关于如何让这个循环在进入后续列表之前遍历第一个列表中的所有字符串的任何建议?

Solved解决了

Final Code:最终代码:

finalList = []

for i in range(32586):
        
    ## Access Taxonomy column of Pandas dataframe and split by ;
    if pd.isna(taxonomyData.loc[i,'GTDB Taxonomy']) == True :
        genus_unique = ['None']
        finalList.append(genus_unique)
    else:
        taxa = taxonomyData.loc[i,'GTDB Taxonomy'].split(';')
        
        ## Use list comprehension to pull out genus by g__
        genus_unique = {x[3:] for x in taxa if x.startswith('g__')}
        genus_unique = list(genus_unique)
        
   
        ## Append new list of genera to larger list    
        finalList.append(genus_unique)
print(finalList)

Here's how you can get unique genus entries from a list with a single set comprehension:以下是如何使用单个集合理解从列表中获取唯一的属条目:

taxa = ['d__abc', 'g__def', 'p__ghi', 'g__jkl', 'd__abc', 'g__def']
genus_unique = {x[3:] for x in taxa if x.startswith('g__')}
print(genus_unique)

Result:结果:

{'def', 'jkl'}

You can also convert it into a list afterwards with list(genus_unique) if you need that.如果需要,您还可以在之后使用list(genus_unique)将其转换为列表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM