在Python中从CSV文件的特定列中提取数据

Question

我需要使用Python读取CSV文件并将其存储在“数据类型”文件中的快速帮助，以便在将所有数据存储在不同文件中后使用数据进行图形绘制。

我已经搜索了它，但是在所有情况下，我都发现数据中包含标题。 我的数据没有标题部分。 它们是制表符分隔的。 而且我只需要存储数据的特定列。 例如：

12345601 2345678@abcdef 1 2 365 places

在这种情况下，例如，我只想在新的python文件中存储“ 2345678 @ abcdef”和“ 365”，以便将来使用它来创建图形。

此外，我在一个文件夹中有多个1个csv文件，我需要在每个文件中都做。 我发现的消息来源没有谈论它，仅提及：

# open csv file
with open(csv_file, 'rb') as csvfile:

有人可以将我引介到已经回答的问题或帮我解决这个问题吗？

Answer 1

。。。 将所有数据存储在不同的文件中之后，将其存储在一个PY文件中以使用数据作图。 。。

。。。 我只想在新的python文件中存储“ 2345678 @ abcdef”和“ 365”。 。。

您确定要将数据存储在python文件中吗？ Python文件应该包含python代码，并且应该可由python解释器执行。 将您的数据存储在数据类型的文件（例如preprocessed_data.csv ）中是一个更好的主意。

要获取与模式匹配的文件列表，可以使用python的内置glob库。

这是一个示例，说明如何读取目录中的多个csv文件并从每个目录中提取所需的列：

import glob

# indices of columns you want to preserve
desired_columns = [1, 4]
# change this to the directory that holds your data files
csv_directory = '/path/to/csv/files/*.csv'

# iterate over files holding data
extracted_data = []
for file_name in glob.glob(csv_directory):
    with open(file_name, 'r') as data_file:
        while True:
            line = data_file.readline()
            # stop at the end of the file
            if len(line) == 0:
                break

            # splits the line by whitespace
            tokens = line.split()
            # only grab the columns we care about
            desired_data = [tokens[i] for i in desired_columns]
            extracted_data.append(desired_data)

将提取的数据写入新文件会很容易。 以下示例显示了如何将数据保存到csv文件中。

output_string = ''
for row in extracted_data:
    output_string += ','.join(row) + '\n'

with open('./preprocessed_data.csv', 'w') as csv_file:
    csv_file.write(output_string)

编辑：

如果您不想合并所有的csv文件，请使用以下版本，该版本可以一次处理一个：

def process_file(input_path, output_path, selected_columns):
    extracted_data = []    
    with open(input_path, 'r') as in_file:
        while True:
            line = in_file.readline()
            if len(line) == 0: break
            tokens = line.split()
            extracted_data.append([tokens[i] for i in selected_columns])

    output_string = ''
    for row in extracted_data:
        output_string += ','.join(row) + '\n'

    with open(output_path, 'w') as out_file:
        out_file.write(output_string)

# whenever you need to process a file:
process_file(
    '/path/to/input.csv', 
    '/path/to/processed/output.csv',
    [1, 4])

# if you want to process every file in a directory:
target_directory = '/path/to/my/files/*.csv'
for file in glob.glob(target_directory):
    process_file(file, file + '.out', [1, 4])

编辑2：

以下示例将处理目录中的每个文件，并将结果写入另一个目录中名称相似的输出文件：

import os
import glob

input_directory = '/path/to/my/files/*.csv'
output_directory = '/path/to/output'
for file in glob.glob(input_directory):
    file_name = os.path.basename(file) + '.out'
    out_file = os.path.join(output_directory, file_name)
    process_file(file, out_file, [1, 4])

如果要向输出添加标头，则可以像这样修改process_file ：

def process_file(input_path, output_path, selected_columns, column_headers=[]):
    extracted_data = []    
    with open(input_path, 'r') as in_file:
        while True:
            line = in_file.readline()
            if len(line) == 0: break
            tokens = line.split()
            extracted_data.append([tokens[i] for i in selected_columns])

    output_string = ','.join(column_headers) + '\n'
    for row in extracted_data:
        output_string += ','.join(row) + '\n'

    with open(output_path, 'w') as out_file:
        out_file.write(output_string)

Answer 2

这是使用namedtuple的另一种方法，它将帮助从csv文件中提取选定的字段，然后让您将它们写到新的csv文件中。

from collections import namedtuple    
import csv

# Setup named tuple to receive csv data
# p1 to p5 are arbitrary field names associated with the csv file
SomeData = namedtuple('SomeData', 'p1, p2, p3, p4, p5, p6')

# Read data from the csv file and create a generator object to hold a reference to the data
# We use a generator object rather than a list to reduce the amount of memory our program will use
# The captured data will only have data from the 2nd & 5th column from the csv file
datagen = ((d.p2, d.p5) for d in map(SomeData._make, csv.reader(open("mydata.csv", "r"))))

# Write the data to a new csv file
with open("newdata.csv","w", newline='') as csvfile:
    cvswriter = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)

    # Use the generator created earlier to access the filtered data and write it out to a new csv file
    for d in datagen:
        cvswriter.writerow(d)

“ mydata.csv”中的原始数据：

12345601,2345678@abcdef,1,2,365,places  
4567,876@def,0,5,200,noplaces

在“ newdata.csv”中输出数据：

2345678@abcdef,365  
876@def,200

编辑1：对于制表符分隔的数据，请对代码进行以下更改：
更改
datagen = ((d.p2, d.p5) for d in map(SomeData._make, csv.reader(open("mydata.csv", "r"))))
至
datagen = ((d.p2, d.p5) for d in map(SomeData._make, csv.reader(open("mydata2.csv", "r"), delimiter='\\t', quotechar='"')))
和
cvswriter = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
至
cvswriter = csv.writer(csvfile, delimiter='\\t', quotechar='"', quoting=csv.QUOTE_MINIMAL)

在Python中从CSV文件的特定列中提取数据

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-06-18 13:17:58

解决方案2
1 2019-06-19 01:31:39

在Python中从CSV文件的特定列中提取数据

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-06-18 13:17:58

解决方案2 1 2019-06-19 01:31:39

解决方案1
1 已采纳 2019-06-18 13:17:58

解决方案2
1 2019-06-19 01:31:39