简体   繁体   English

Python:从多个文本文件中提取一列数据

[英]Python: extract a column of data from several text files

I have been scratching my head pretty hard on this one. 我一直在努力地抓这个头。 I have several text files, all in the same format: 我有几个文本文件,都具有相同的格式:

   99.00%   2874    2874    U   0   unclassified
  1.00% 29  0   R   1   root
  1.00% 29  0   R1  131567    cellular organisms
  1.00% 29  0   D   2759        Eukaryota
  1.00% 29  0   D1  33154         Opisthokonta
  1.00% 29  0   K   4751            Fungi
  1.00% 29  0   K1  451864            Dikarya

I want to extract the 6th column from all these files and print it to a new file. 我想从所有这些文件中提取第六列,并将其打印到新文件中。

Here is the code I have so far: 这是我到目前为止的代码:

import sys
import os
import glob

# Usage: python extract_species.py path/to/folder > output.txt

def extractSpecies(fileContent, allSpecies):
    for line in fileContent.split('\n'):
        allSpecies.append(line.split('\t')[0])

def file_get_contents(filename):
    with open(filename) as f:
        return f.read()

def listdir_fullpath(d):
    return [os.path.join(d, f) for f in os.listdir(d)]

allFiles = listdir_fullpath(sys.argv[1]) # List all files in the folder provided by system arg.

# Read all files and store content in memory
filesContent = [] # a list is created with one item per file.
for filePath in allFiles:
    filesContent.append(file_get_contents(filePath))

# Extract all species and create a unique list
allSpecies = []
for fileContent in filesContent:
    extractSpecies(fileContent, allSpecies)

print(allSpecies)

But this code provides only the values of the first column of data: 但是此代码仅提供数据第一列的值:

99.00%   1.00%   1.00%   1.00%   1.00%   1.00%   1.00%

If I remove the [0] argument in line 7 (after "allSpecies.append(line.split('\\t')"), then the object allSpecies contains all the data in the files. 如果删除第7行中的[0]参数(在“ allSpecies.append(line.split('\\ t')”之后)),则对象allSpecies将包含文件中的所有数据。

[' 99.00%', '2874', '2874', 'U', '0', 'unclassified'] ['  1.00%', '29', '0', 'R', '1', 'root'] ['  1.00%', '29', '0', 'R1', '131567', '  cellular organisms'] ['  1.00%', '29', '0', 'D', '2759', '    Eukaryota'] ['  1.00%', '29', '0', 'D1', '33154', '      Opisthokonta'] etc

I thought I could simply change the [0] by the number of the column I am interested in (from 1 to 5), but no, if I do that I get an error saying: 我以为我可以简单地将[0]更改为我感兴趣的列的编号(从1更改为5),但是如果不这样做,我会收到一条错误消息:

IndexError: list index out of range

Which really baffles me. 这真的让我感到困惑。 There must be something I really don't get: how can I extract the value of the first column but not of any other column. 确实有一些我真正不了解的东西:如何提取第一列的值,但不能提取其他任何列的值。 Any suggestion is welcome at this point... 欢迎提出任何建议。

I think you're on the right path with removing the zero. 我认为您在消除零的正确道路上。 You can then iterate through allSpecies and grab the columns by the index. 然后,您可以遍历allSpecies并按索引获取列。

column6 = []
for x in allSpecies:
    column6.append(allSpecies[x][5])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用Python从多个.txt文件中提取文本? - How to extract text from several .txt files with Python? 使用python从多个元数据文件中提取特定文本 - extract specific text from several metadata files using python 从python中具有多种数据类型的列中提取大于10的数字 - extract digits >10 from a column with several data types in python 从多个文本文件中提取 URLS 的循环 - A loop to extract URLS from several text files 如何从python中的文本中提取列数据(正则表达式) - How to extract column data from a text in python (regex) Python 数据帧 | 从列中提取部分文本到 3 个新列中 - Python Data Frame | Extract part of a text from a column into 3 new columns Python:在 Pandas 中,根据条件从数据帧中的几列中提取数据,并添加到列上的不同数据帧匹配中 - Python: In Pandas extract data from several columns in a dataframe based on a condition and add to different dataframe matching on a column 处理来自多个文本文件的数据 - Process data from several text files 使用Python将一个数据文本文件拆分成几个用于MySQL的文本文件 - Using Python to split a data text file into several text files for MySQL 在同一目录的多个连续文本文件中删除列python - column removal in a several consecutive text files of the same directory, python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM