How to not include '\n' and the next index entry while appending values from a certain index to a list

Question

I have a data set that is in the format

100    domain    bacteria    phylum    chloroflexi    genus    caldilinea


200    domain    bacteria    phylum    acuuhgsdiuh    genus    blahblahbl


300

basically what i have been trying to do is create a function that scans through the different indexes separated by tabs and when it finds the desired entry, it appends the entry after to a list [eg search for 'domain' append 'bacteria'] . what i have works, except for the last entry where I would search for 'genus' it would append 'caldilinea\\n\\n200' which makes sense because it has line breaks after it but i don't know how to make it so it only appends the last index ['caldilinea' in this case] instead of the last index + line breaks + the first index on the row beneath it .

here is my code as of now:

in_file = open(input_file,'r')
lines = in_file.read()
segment_tab = lines.split('\t')

next_index = [segment_tab[position + 1] for position, entry in enumerate(segment_tab) if entry == 'genus']

when I print next_index it should give me

'caldilinea','blahblahbl'

but instead it is giving me

'caldilinea\\n\\n200','blahblahbl\\n\\n300'

my data is a lot more complex than this and has hundreds of rows

How can i get it to not include the line breaks and the beginning index of the next row?

Answer 1

You should either split by lines and then split by tabs, or simultaneously split by both.

The former could be done like this:

lines = in_file.readlines()
segment_tab = [line.split('\t') for line in lines]

More idiomatic would be something like:

segment_tab = [line.split('\t') for line in in_file]

Note that this will give you a list of lists of strings, not just a list of strings. This is different than what you seem to expect, but is the more conventional approach.

The other approach is to split by both, like this:

lines = in_file.read()
segment_tab = re.split(r'\t|\n+', lines)

This is kind of unconventional (it treats groups of newlines just like a tab), but seems to be what you're asking for.

Note that you'll need to import re for this to work.

Answer 2

for line in open('input_file', 'r'):
    segment_tab = line.strip().split('\t')

This will give you segment_tab = ['100', 'domain', 'bacteria', 'phylum', 'chloroflexi', 'genus', 'caldilinea'] for each line. Is this good enough?

How to not include '\n' and the next index entry while appending values from a certain index to a list

Question

2 answers

solution1
2 ACCPTED 2012-03-07 03:15:30

solution2
0 2012-03-07 03:25:34

How to not include '\n' and the next index entry while appending values from a certain index to a list

Question

2 answers

solution1 2 ACCPTED 2012-03-07 03:15:30

solution2 0 2012-03-07 03:25:34

solution1
2 ACCPTED 2012-03-07 03:15:30

solution2
0 2012-03-07 03:25:34