简体   繁体   中英

How to not include '\n' and the next index entry while appending values from a certain index to a list

I have a data set that is in the format

100    domain    bacteria    phylum    chloroflexi    genus    caldilinea


200    domain    bacteria    phylum    acuuhgsdiuh    genus    blahblahbl


300

basically what i have been trying to do is create a function that scans through the different indexes separated by tabs and when it finds the desired entry, it appends the entry after to a list [eg search for 'domain' append 'bacteria'] . what i have works, except for the last entry where I would search for 'genus' it would append 'caldilinea\\n\\n200' which makes sense because it has line breaks after it but i don't know how to make it so it only appends the last index ['caldilinea' in this case] instead of the last index + line breaks + the first index on the row beneath it .

here is my code as of now:

in_file = open(input_file,'r')
lines = in_file.read()
segment_tab = lines.split('\t')

next_index = [segment_tab[position + 1] for position, entry in enumerate(segment_tab) if entry == 'genus']

when I print next_index it should give me

'caldilinea','blahblahbl'

but instead it is giving me

'caldilinea\\n\\n200','blahblahbl\\n\\n300'

my data is a lot more complex than this and has hundreds of rows

How can i get it to not include the line breaks and the beginning index of the next row?

You should either split by lines and then split by tabs, or simultaneously split by both.

The former could be done like this:

lines = in_file.readlines()
segment_tab = [line.split('\t') for line in lines]

More idiomatic would be something like:

segment_tab = [line.split('\t') for line in in_file]

Note that this will give you a list of lists of strings, not just a list of strings. This is different than what you seem to expect, but is the more conventional approach.

The other approach is to split by both, like this:

lines = in_file.read()
segment_tab = re.split(r'\t|\n+', lines)

This is kind of unconventional (it treats groups of newlines just like a tab), but seems to be what you're asking for.

Note that you'll need to import re for this to work.

for line in open('input_file', 'r'):
    segment_tab = line.strip().split('\t')

This will give you segment_tab = ['100', 'domain', 'bacteria', 'phylum', 'chloroflexi', 'genus', 'caldilinea'] for each line. Is this good enough?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM