[英]Pull out specific columns from multiple CSV files in a directory in Python
[英]Pulling out data from CSV files' specific columns in Python
我需要使用Python讀取CSV文件並將其存儲在“數據類型”文件中的快速幫助,以便在將所有數據存儲在不同文件中后使用數據進行圖形繪制。
我已經搜索了它,但是在所有情況下,我都發現數據中包含標題。 我的數據沒有標題部分。 它們是制表符分隔的。 而且我只需要存儲數據的特定列。 例如:
12345601 2345678@abcdef 1 2 365 places
在這種情況下,例如,我只想在新的python文件中存儲“ 2345678 @ abcdef”和“ 365”,以便將來使用它來創建圖形。
此外,我在一個文件夾中有多個1個csv文件,我需要在每個文件中都做。 我發現的消息來源沒有談論它,僅提及:
# open csv file
with open(csv_file, 'rb') as csvfile:
有人可以將我引介到已經回答的問題或幫我解決這個問題嗎?
。 。 。 將所有數據存儲在不同的文件中之后,將其存儲在一個PY文件中以使用數據作圖。 。 。
。 。 。 我只想在新的python文件中存儲“ 2345678 @ abcdef”和“ 365”。 。 。
您確定要將數據存儲在python文件中嗎? Python文件應該包含python代碼,並且應該可由python解釋器執行。 將您的數據存儲在數據類型的文件(例如preprocessed_data.csv
)中是一個更好的主意。
要獲取與模式匹配的文件列表,可以使用python的內置glob
庫 。
這是一個示例,說明如何讀取目錄中的多個csv文件並從每個目錄中提取所需的列:
import glob
# indices of columns you want to preserve
desired_columns = [1, 4]
# change this to the directory that holds your data files
csv_directory = '/path/to/csv/files/*.csv'
# iterate over files holding data
extracted_data = []
for file_name in glob.glob(csv_directory):
with open(file_name, 'r') as data_file:
while True:
line = data_file.readline()
# stop at the end of the file
if len(line) == 0:
break
# splits the line by whitespace
tokens = line.split()
# only grab the columns we care about
desired_data = [tokens[i] for i in desired_columns]
extracted_data.append(desired_data)
將提取的數據寫入新文件會很容易。 以下示例顯示了如何將數據保存到csv文件中。
output_string = ''
for row in extracted_data:
output_string += ','.join(row) + '\n'
with open('./preprocessed_data.csv', 'w') as csv_file:
csv_file.write(output_string)
編輯:
如果您不想合並所有的csv文件,請使用以下版本,該版本可以一次處理一個:
def process_file(input_path, output_path, selected_columns):
extracted_data = []
with open(input_path, 'r') as in_file:
while True:
line = in_file.readline()
if len(line) == 0: break
tokens = line.split()
extracted_data.append([tokens[i] for i in selected_columns])
output_string = ''
for row in extracted_data:
output_string += ','.join(row) + '\n'
with open(output_path, 'w') as out_file:
out_file.write(output_string)
# whenever you need to process a file:
process_file(
'/path/to/input.csv',
'/path/to/processed/output.csv',
[1, 4])
# if you want to process every file in a directory:
target_directory = '/path/to/my/files/*.csv'
for file in glob.glob(target_directory):
process_file(file, file + '.out', [1, 4])
編輯2:
以下示例將處理目錄中的每個文件,並將結果寫入另一個目錄中名稱相似的輸出文件:
import os
import glob
input_directory = '/path/to/my/files/*.csv'
output_directory = '/path/to/output'
for file in glob.glob(input_directory):
file_name = os.path.basename(file) + '.out'
out_file = os.path.join(output_directory, file_name)
process_file(file, out_file, [1, 4])
如果要向輸出添加標頭,則可以像這樣修改process_file
:
def process_file(input_path, output_path, selected_columns, column_headers=[]):
extracted_data = []
with open(input_path, 'r') as in_file:
while True:
line = in_file.readline()
if len(line) == 0: break
tokens = line.split()
extracted_data.append([tokens[i] for i in selected_columns])
output_string = ','.join(column_headers) + '\n'
for row in extracted_data:
output_string += ','.join(row) + '\n'
with open(output_path, 'w') as out_file:
out_file.write(output_string)
這是使用namedtuple的另一種方法,它將幫助從csv文件中提取選定的字段,然后讓您將它們寫到新的csv文件中。
from collections import namedtuple
import csv
# Setup named tuple to receive csv data
# p1 to p5 are arbitrary field names associated with the csv file
SomeData = namedtuple('SomeData', 'p1, p2, p3, p4, p5, p6')
# Read data from the csv file and create a generator object to hold a reference to the data
# We use a generator object rather than a list to reduce the amount of memory our program will use
# The captured data will only have data from the 2nd & 5th column from the csv file
datagen = ((d.p2, d.p5) for d in map(SomeData._make, csv.reader(open("mydata.csv", "r"))))
# Write the data to a new csv file
with open("newdata.csv","w", newline='') as csvfile:
cvswriter = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
# Use the generator created earlier to access the filtered data and write it out to a new csv file
for d in datagen:
cvswriter.writerow(d)
“ mydata.csv”中的原始數據:
12345601,2345678@abcdef,1,2,365,places
4567,876@def,0,5,200,noplaces
在“ newdata.csv”中輸出數據:
2345678@abcdef,365
876@def,200
編輯1:對於制表符分隔的數據,請對代碼進行以下更改:
更改
datagen = ((d.p2, d.p5) for d in map(SomeData._make, csv.reader(open("mydata.csv", "r"))))
至
datagen = ((d.p2, d.p5) for d in map(SomeData._make, csv.reader(open("mydata2.csv", "r"), delimiter='\\t', quotechar='"')))
和
cvswriter = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
至
cvswriter = csv.writer(csvfile, delimiter='\\t', quotechar='"', quoting=csv.QUOTE_MINIMAL)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.