Reading data from ~150+ (.csv) files into two categories based on the presence of a particular string in the file name

Question

I'm trying to set up a way to automate a lot of data analysis in Python 3. For now, most of the actual analysis is fairly simple (plotting 2 curves based on 4 input files and doing a few calculations). As I will always have a minimum of 4 files, I currently have something like this to read in all of the data from a .csv file from the 4 files I am looking at. Realistically, there are ~150+ files at any given time and I need a way to compare all of them very quickly.

For some background:

1). All files will be located in the same folder with the same path (except for the specific file name). 2). There are two categories (I call them 'Control' and 'NP') and each category has 2 files corresponding to it: Control-A, Control-B, and NP-A and NP-B. 3). There is a currently a ton of information that is located in the file name (lab conditions and so forth that the data acquisition software is reading live during the measurements), but somewhere in the middle of the filename lies the word either "Dark" or "Illuminated".

With this information, I am trying to find a way to import all of the files at once and separate them based on the file name. For example, all files that contain the words "ControlDark" will be grouped together, all files that contain "ControlIlluminated" will be grouped together and so forth for the other two combinations ("NPDark" and "NPIlluminated").

Right now, all I have is aa GUI that allows me to manually select 4 files from a specific path (using askopenfilename()). I'm not aware of any good ways to read in hundreds of .csv files at once.

Right now, I can only accommodate 4 data sets at a time as I'm not aware of a way to save entire folders worth of data (without having a corresponding askopenfilename() or np.genfromtxt('path\\filename.csv'))

f1 = askopenfilename()
f1_data = pd.read_csv(f1, names = ['A', 'B', 'C'])

f2 = askopenfilename()
f2_data = pd.read_csv(f2, names = ['A', 'B', 'C'])

f3 = askopenfilename()
f3_data = pd.read_csv(f3, names = ['A', 'B', 'C'])

f4 = askopenfilename()
f4_data = pd.read_csv(f4, names = ['A', 'B', 'C'])

Basically I bring up a gui with the askopenfilename() command and manually find the 4 files in question. However, I want to automate this so that I can dump all ~150+ files into this right from the start.

I have found a way to begin, but I'm getting a bit stuck with reading each file into it's own data structure. So far I have:

import glob
import pandas as pd
import os

path = r'full\path\here'
all_files = glob.glob(os.path.join(path, "*.csv"))

#Setting up a list for each of the 4 files I need to generate each plot
DarkControl = []
IllControl = []
DarkNP = []
IllNP = []

for f in all_files:
     if "Control" in f and "Dark" in f:
          DarkControl.append(f)
     elif "Control" in f and "Illuminated" in f:
          IllControl.append(f)
     elif "GoldNP" in f and "Dark" in f:
          DarkNP.append(f)
     elif "GoldNP" in f and "Illuminated" in f:
          IllNP.append(f)

So I have a list for each of the categories, but right now it is a list of strings. Is there a good way to (possibly using pandas data frames?) to create a data frame for each file f in all_files? I want to definitely avoid creating one massive structure with all files. In each file, the first column is my x variable, the second is my y variable. I want to make sure that I can plot the y values of any given f and the y values of some other f against the x values (all x values are the same for all files)

Answer 1

Usually we would ask for a MVC example we can test our code against. However, I think I do have some understanding of your problem.

If I understand your problem correctly, you have sensor-type data where x is some type of time-like axis and this is repeated across multiple experimental trials.

You are on the right track for sorting the files out but a python list comprehension would probably be a cleaner/more pythonic way of writing this

Dark_control=[f for f in all_files if "Control" in all_files if "Dark" in all_files]

You could also implement your pattern matching in your glob.glob .

A data frame would be perfect for this type of structure where depending on your data structure (and how you want it set up) you could use that same list comprehension to read the data as well.

Dark_control=[pd.read_csv(f) for f in all_files if "Control" in all_files if "Dark" in all_files]

The code above would create an array of data frames with all of the values together which you could pd.concat or pd.join depending on how you have your final data.

Not sure why you couldn't have all of your data together in a single large dataframe for analysis (look into using multi-indexing to keep different experimental trials separate).

Reading data from ~150+ (.csv) files into two categories based on the presence of a particular string in the file name

Question

1 answers

solution1
0 ACCPTED 2019-03-27 02:59:50

Reading data from ~150+ (.csv) files into two categories based on the presence of a particular string in the file name

Question

1 answers

solution1 0 ACCPTED 2019-03-27 02:59:50

solution1
0 ACCPTED 2019-03-27 02:59:50