I have been scratching my head pretty hard on this one. I have several text files, all in the same format:
99.00% 2874 2874 U 0 unclassified
1.00% 29 0 R 1 root
1.00% 29 0 R1 131567 cellular organisms
1.00% 29 0 D 2759 Eukaryota
1.00% 29 0 D1 33154 Opisthokonta
1.00% 29 0 K 4751 Fungi
1.00% 29 0 K1 451864 Dikarya
I want to extract the 6th column from all these files and print it to a new file.
Here is the code I have so far:
import sys
import os
import glob
# Usage: python extract_species.py path/to/folder > output.txt
def extractSpecies(fileContent, allSpecies):
for line in fileContent.split('\n'):
allSpecies.append(line.split('\t')[0])
def file_get_contents(filename):
with open(filename) as f:
return f.read()
def listdir_fullpath(d):
return [os.path.join(d, f) for f in os.listdir(d)]
allFiles = listdir_fullpath(sys.argv[1]) # List all files in the folder provided by system arg.
# Read all files and store content in memory
filesContent = [] # a list is created with one item per file.
for filePath in allFiles:
filesContent.append(file_get_contents(filePath))
# Extract all species and create a unique list
allSpecies = []
for fileContent in filesContent:
extractSpecies(fileContent, allSpecies)
print(allSpecies)
But this code provides only the values of the first column of data:
99.00% 1.00% 1.00% 1.00% 1.00% 1.00% 1.00%
If I remove the [0] argument in line 7 (after "allSpecies.append(line.split('\\t')"), then the object allSpecies contains all the data in the files.
[' 99.00%', '2874', '2874', 'U', '0', 'unclassified'] [' 1.00%', '29', '0', 'R', '1', 'root'] [' 1.00%', '29', '0', 'R1', '131567', ' cellular organisms'] [' 1.00%', '29', '0', 'D', '2759', ' Eukaryota'] [' 1.00%', '29', '0', 'D1', '33154', ' Opisthokonta'] etc
I thought I could simply change the [0] by the number of the column I am interested in (from 1 to 5), but no, if I do that I get an error saying:
IndexError: list index out of range
Which really baffles me. There must be something I really don't get: how can I extract the value of the first column but not of any other column. Any suggestion is welcome at this point...
I think you're on the right path with removing the zero. You can then iterate through allSpecies and grab the columns by the index.
column6 = []
for x in allSpecies:
column6.append(allSpecies[x][5])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.