简体   繁体   English

在Python中使用制表符分隔的文件读取和导出单个列

[英]Read and export a single column from a tab-separated file in Python

I have a many large tab-separated files saved as .txt , which each have seven columns with the following headers: 我有许多用制表符分隔的大文件另存为.txt ,每个文件都有七列,带有以下标题:

#column_titles = ["col1", "col2", "col3", "col4", "col5", "col6", "text"]    

I would like to simply extract the final column named text and save it into a new file with each row being a row from the original file, while are all strings. 我想简单地提取名为text的最后一列,并将其保存到新文件中,每一行都是原始文件的一行,而所有都是字符串。

EDIT: This is not a duplicate of a similar problem , as splitlines() was not necessary in my case. 编辑:这不是类似问题的重复,因为在我的情况下, splitlines()是不必要的。 Only the order of things needed to be improved 只有事情的顺序需要改进

Based on - several - other - posts , here is my current attempt: 根据其他 几个 帖子 ,这是我目前的尝试:

import csv

# File names: to read in from and read out to
input_file = "tester_2014-10-30_til_2014-08-01.txt"
output_file = input_file + "-SA_input.txt"

## ==================== ##
##  Using module 'csv'  ##
## ==================== ##
with open(input_file) as to_read:
    reader = csv.reader(to_read, delimiter = "\t")

    desired_column = [6]        # text column

    for row in reader:
    myColumn = list(row[i] for i in desired_column)

with open(output_file, "wb") as tmp_file:
    writer = csv.writer(tmp_file)

for row in myColumn:
    writer.writerow(row)

What I am getting, is simply the text field from the 2624th row form my input file, with each of the letters in that string being separated out: 我得到的只是输入文件中第2624行的文本字段,该字符串中的每个字母都被分离出来:

H,o,w, ,t,h,e, ,t.e.a.m, ,d,i,d, ,T,h,u,r,s,d,a,y, ,-, ,s,e,e , ,h,e,r,e

I know very little in the world of programming is random, but this is definitely strange! 我知道编程世界中很少是随机的,但这绝对是奇怪的!

This post is pretty similar to my needs, but misses the writing and saving parts, which I am also not sure about. 这篇文章与我的需求非常相似,但是缺少写作和保存部分,我也不确定。

I have looked into using the pandas toolbox (as per one of those links above), but I am unable to due my Python installation, so please only solutions using csv or other built in modules! 我已经研究过使用pandas工具箱(根据上面的那些链接之一),但是由于无法安装Python,因此请仅使用csv或其他内置模块的解决方案!

I would go for this simple solution: 我将寻求这个简单的解决方案:

    text_strings = [] # empty array to store the last column text
    with open('my_file') as ff:
        ss = ff.readlines() # read all strings in a string array 

    for s in ss:
        text_strings.append(s.split('\t')[-1]) # last column to the text array



    with open('out_file') as outf:
        outf.write('\n'.join(text_strings)) # write everything to output file

Using the list comprehension, you can translate the last columns of ss strings to text_strings faster and in one line: 使用列表text_strings ,您可以将一行ss字符串的最后一列更快地转换为text_strings

    text_strings = [k.split("\t")[-1] for k in ss]

There are other simplifications possible, you get the idea) 还有其他可能的简化,您可以理解)

The problem in your code appears at these two lines: 您的代码中的问题出现在这两行:

        for row in reader:
        myColumn = list(row[i] for i in desired_column)

First, there is no indentation, so there is nothing happening. 首先,没有缩进,因此没有任何反应。 Actually, on my computer, it throws an error, so there is a possibility that it is a typo. 实际上,在我的计算机上,它会引发错误,因此很可能是错字。 But in this case, at each step of the for-loop, you overwrite the myColumn value with that coming from the new row, hence in the end you have a string from the last row of the file. 但是在这种情况下,在for循环的每一步,您都用新行中的值覆盖myColumn值,因此最后您有了文件最后一行的字符串。 Second, list applied to a string (as in your code), converts the string to the list of chars: 其次,将list应用于字符串(如您的代码中所示),将字符串转换为字符列表:

    In [5]: s = 'AAAA'

    In [6]: list(s)
    Out[6]: ['A', 'A', 'A', 'A']

which is exactly what you see in the output. 这正是您在输出中看到的。

You must process the file one row at a time: read, parse and write. 您必须一次处理一行文件:读取,解析和写入。

import csv

# File names: to read in from and read out to
input_file = "tester_2014-10-30_til_2014-08-01.txt"
output_file = input_file + "-SA_input.txt"

## ==================== ##
##  Using module 'csv'  ##
## ==================== ##
with open(input_file) as to_read:
    with open(output_file, "wb") as tmp_file:
        reader = csv.reader(to_read, delimiter = "\t")
        writer = csv.writer(tmp_file)

        desired_column = [6]        # text column

        for row in reader:     # read one row at a time
            myColumn = list(row[i] for i in desired_column)   # build the output row (process)
            writer.writerow(myColumn) # write it

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM