I have a data set that is in the format
100 domain bacteria phylum chloroflexi genus caldilinea
200 domain bacteria phylum acuuhgsdiuh genus blahblahbl
300
basically what i have been trying to do is create a function that scans through the different indexes separated by tabs and when it finds the desired entry, it appends the entry after to a list [eg search for 'domain' append 'bacteria'] . what i have works, except for the last entry where I would search for 'genus' it would append 'caldilinea\\n\\n200' which makes sense because it has line breaks after it but i don't know how to make it so it only appends the last index ['caldilinea' in this case] instead of the last index + line breaks + the first index on the row beneath it .
here is my code as of now:
in_file = open(input_file,'r')
lines = in_file.read()
segment_tab = lines.split('\t')
next_index = [segment_tab[position + 1] for position, entry in enumerate(segment_tab) if entry == 'genus']
when I print next_index it should give me
'caldilinea','blahblahbl'
but instead it is giving me
'caldilinea\\n\\n200','blahblahbl\\n\\n300'
my data is a lot more complex than this and has hundreds of rows
How can i get it to not include the line breaks and the beginning index of the next row?
You should either split by lines and then split by tabs, or simultaneously split by both.
The former could be done like this:
lines = in_file.readlines()
segment_tab = [line.split('\t') for line in lines]
More idiomatic would be something like:
segment_tab = [line.split('\t') for line in in_file]
Note that this will give you a list of lists of strings, not just a list of strings. This is different than what you seem to expect, but is the more conventional approach.
The other approach is to split by both, like this:
lines = in_file.read()
segment_tab = re.split(r'\t|\n+', lines)
This is kind of unconventional (it treats groups of newlines just like a tab), but seems to be what you're asking for.
Note that you'll need to import re
for this to work.
for line in open('input_file', 'r'):
segment_tab = line.strip().split('\t')
This will give you segment_tab = ['100', 'domain', 'bacteria', 'phylum', 'chloroflexi', 'genus', 'caldilinea']
for each line. Is this good enough?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.