简体   繁体   English

根据文件名中是否存在特定字符串,将大约150多个(.csv)文件中的数据读取为两类

[英]Reading data from ~150+ (.csv) files into two categories based on the presence of a particular string in the file name

I'm trying to set up a way to automate a lot of data analysis in Python 3. For now, most of the actual analysis is fairly simple (plotting 2 curves based on 4 input files and doing a few calculations). 我正在尝试建立一种在Python 3中自动执行大量数据分析的方法。到目前为止,大多数实际分析都非常简单(根据4个输入文件绘制2条曲线并进行一些计算)。 As I will always have a minimum of 4 files, I currently have something like this to read in all of the data from a .csv file from the 4 files I am looking at. 由于我将始终至少有4个文件,因此我目前有类似的东西可以读取我正在查看的4个文件中.csv文件中的所有数据。 Realistically, there are ~150+ files at any given time and I need a way to compare all of them very quickly. 实际上,任何给定时间都有大约150多个文件,我需要一种非常快速地比较所有文件的方法。

For some background: 对于某些背景:

1). 1)。 All files will be located in the same folder with the same path (except for the specific file name). 所有文件都将位于具有相同路径的相同文件夹中(特定文件名除外)。 2). 2)。 There are two categories (I call them 'Control' and 'NP') and each category has 2 files corresponding to it: Control-A, Control-B, and NP-A and NP-B. 有两个类别(我称它们为“控件”和“ NP”),每个类别都有2个与之对应的文件:控件A,控件B和NP-A和NP-B。 3). 3)。 There is a currently a ton of information that is located in the file name (lab conditions and so forth that the data acquisition software is reading live during the measurements), but somewhere in the middle of the filename lies the word either "Dark" or "Illuminated". 当前在文件名中有大量信息(实验室条件等,数据采集软件正在测量过程中实时读取信息),但是文件名中间的位置是“ Dark”(暗)或“ Dark”(暗) “发光”。

With this information, I am trying to find a way to import all of the files at once and separate them based on the file name. 有了这些信息,我试图找到一种方法,一次导入所有文件,然后根据文件名将它们分开。 For example, all files that contain the words "ControlDark" will be grouped together, all files that contain "ControlIlluminated" will be grouped together and so forth for the other two combinations ("NPDark" and "NPIlluminated"). 例如,所有包含单词“ ControlDark”的文件将被分组在一起,所有包含“ ControlIlluminated”的文件将被分组在一起,对于其他两个组合(“ NPDark”和“ NPIlluminated”)依此类推。

Right now, all I have is aa GUI that allows me to manually select 4 files from a specific path (using askopenfilename()). 现在,我所拥有的只是一个GUI,它允许我从特定路径中手动选择4个文件(使用askopenfilename())。 I'm not aware of any good ways to read in hundreds of .csv files at once. 我不知道一次读取数百个.csv文件的任何好方法。

Right now, I can only accommodate 4 data sets at a time as I'm not aware of a way to save entire folders worth of data (without having a corresponding askopenfilename() or np.genfromtxt('path\\filename.csv')) 现在,我一次只能容纳4个数据集,因为我不知道一种保存整个文件夹中有价值数据的方法(没有对应的askopenfilename()或np.genfromtxt('path \\ filename.csv') )

f1 = askopenfilename()
f1_data = pd.read_csv(f1, names = ['A', 'B', 'C'])

f2 = askopenfilename()
f2_data = pd.read_csv(f2, names = ['A', 'B', 'C'])

f3 = askopenfilename()
f3_data = pd.read_csv(f3, names = ['A', 'B', 'C'])

f4 = askopenfilename()
f4_data = pd.read_csv(f4, names = ['A', 'B', 'C'])

Basically I bring up a gui with the askopenfilename() command and manually find the 4 files in question. 基本上,我用askopenfilename()命令调出一个gui并手动找到有问题的4个文件。 However, I want to automate this so that I can dump all ~150+ files into this right from the start. 但是,我想使它自动化,以便可以从一开始就将所有约150多个文件转储到此文件中。

I have found a way to begin, but I'm getting a bit stuck with reading each file into it's own data structure. 我已经找到一种开始的方法,但是在将每个文件读入其自己的数据结构中时遇到了一些麻烦。 So far I have: 到目前为止,我有:

import glob
import pandas as pd
import os

path = r'full\path\here'
all_files = glob.glob(os.path.join(path, "*.csv"))

#Setting up a list for each of the 4 files I need to generate each plot
DarkControl = []
IllControl = []
DarkNP = []
IllNP = []

for f in all_files:
     if "Control" in f and "Dark" in f:
          DarkControl.append(f)
     elif "Control" in f and "Illuminated" in f:
          IllControl.append(f)
     elif "GoldNP" in f and "Dark" in f:
          DarkNP.append(f)
     elif "GoldNP" in f and "Illuminated" in f:
          IllNP.append(f)

So I have a list for each of the categories, but right now it is a list of strings. 因此,我为每个类别都有一个列表,但是现在它是一个字符串列表。 Is there a good way to (possibly using pandas data frames?) to create a data frame for each file f in all_files? 是否有一种很好的方法(可能使用熊猫数据框架?)为all_files中的每个文件f创建数据框架? I want to definitely avoid creating one massive structure with all files. 我绝对要避免为所有文件创建一个庞大的结构。 In each file, the first column is my x variable, the second is my y variable. 在每个文件中,第一列是我的x变量,第二列是我的y变量。 I want to make sure that I can plot the y values of any given f and the y values of some other f against the x values (all x values are the same for all files) 我想确保可以针对x值绘制任何给定f的y值和其他f的y值(所有文件的所有x值都相同)

Usually we would ask for a MVC example we can test our code against. 通常,我们会要求提供一个MVC示例,以用来测试我们的代码。 However, I think I do have some understanding of your problem. 但是,我认为我确实对您的问题有所了解。

If I understand your problem correctly, you have sensor-type data where x is some type of time-like axis and this is repeated across multiple experimental trials. 如果我正确理解了您的问题,则您具有传感器类型的数据,其中x是某种类型的类似时间的轴,并且会在多个实验试验中重复进行。

You are on the right track for sorting the files out but a python list comprehension would probably be a cleaner/more pythonic way of writing this 您正在按照正确的顺序整理文件,但是python列表理解可能是一种更干净/更pythonic的编写方式

Dark_control=[f for f in all_files if "Control" in all_files if "Dark" in all_files]

You could also implement your pattern matching in your glob.glob . 您还可以在glob.glob实现模式匹配。

A data frame would be perfect for this type of structure where depending on your data structure (and how you want it set up) you could use that same list comprehension to read the data as well. 数据框对于这种类型的结构是完美的,根据您的数据结构(以及您希望的设置方式),您也可以使用相同的列表理解来读取数据。

Dark_control=[pd.read_csv(f) for f in all_files if "Control" in all_files if "Dark" in all_files]

The code above would create an array of data frames with all of the values together which you could pd.concat or pd.join depending on how you have your final data. 上面的代码将创建一个包含所有值的数据帧数组,您可以将pd.concatpd.join取决于最终数据的方式。

Not sure why you couldn't have all of your data together in a single large dataframe for analysis (look into using multi-indexing to keep different experimental trials separate). 不确定为什么不能在一个大型数据框中将所有数据汇总在一起进行分析(请考虑使用多索引将不同的实验分开进行)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM