简体   繁体   English

Pandas:使用循环和分层索引将多个csv文件导入数据框

[英]Pandas: import multiple csv files into dataframe using a loop and hierarchical indexing

I would like to read multiple CSV files (with a different number of columns) from a target directory into a single Python Pandas DataFrame to efficiently search and extract data. 我想从目标目录中读取多个CSV文件(具有不同数量的列)到单个Python Pandas DataFrame中,以便有效地搜索和提取数据。

Example file: 示例文件:

Events 
1,0.32,0.20,0.67
2,0.94,0.19,0.14,0.21,0.94
3,0.32,0.20,0.64,0.32
4,0.87,0.13,0.61,0.54,0.25,0.43 
5,0.62,0.21,0.77,0.44,0.16

Here is what I have so far: 这是我到目前为止:

# get a list of all csv files in target directory
my_dir = "C:\\Data\\"
filelist = []
os.chdir( my_dir )
for files in glob.glob( "*.csv" ) :
    filelist.append(files)

# read each csv file into single dataframe and add a filename reference column 
# (i.e. file1, file2, file 3) for each file read
df = pd.DataFrame()
columns = range(1,100)
for c, f in enumerate(filelist) :
    key = "file%i" % c
    frame = pd.read_csv( (my_dir + f), skiprows = 1, index_col=0, names=columns )
    frame['key'] = key
    df = df.append(frame,ignore_index=True)

(the indexing isn't working properly) (索引工作不正常)

Essentially, the script below is exactly what I want (tried and tested) but needs to be looped through 10 or more csv files: 从本质上讲,下面的脚本正是我想要的(尝试和测试),但需要通过10个或更多csv文件循环:

df1 = pd.DataFrame()
df2 = pd.DataFrame()
columns = range(1,100)
df1 = pd.read_csv("C:\\Data\\Currambene_001y09h00m_events.csv", 
                  skiprows = 1, index_col=0, names=columns)
df2 = pd.read_csv("C:\\Data\\Currambene_001y12h00m_events.csv", 
                  skiprows = 1, index_col=0, names=columns)
keys = [('file1'), ('file2')]
df = pd.concat([df1, df2], keys=keys, names=['fileno'])

I have found many related links, however I am still not able to get this to work: 我找到了许多相关的链接,但是我仍然无法使其工作:

You need to decide in what axis you want to append your files. 您需要决定要在哪个轴上附加文件。 Pandas will always try to do the right thing by: 熊猫总会尝试通过以下方式做正确的事情:

  1. Assuming that each column from each file is different, and appending digits to columns with similar names across files if necessary, so that they don't get mixed; 假设每个文件中的每一列都不同,并在必要时将数字附加到文件中具有相似名称的列,以便它们不会混合;
  2. Items that belong to the same row index across files are placed side by side, under their respective columns. 属于文件中相同行索引的项目并排放置在各自的列下。

The trick to appending efficiently is to tip the files sideways, so you get the desired behaviour to match what pandas.concat will be doing. 有效追加的技巧是侧向提示文件,因此您可以获得所需的行为以匹配pandas.concat将要执行的操作。 This is my recipe: 这是我的食谱:

from pandas import *
files = !ls *.csv # IPython magic
d = concat([read_csv(f, index_col=0, header=None, axis=1) for f in files], keys=files)

Notice that read_csv is transposed with axis=1 , so it will be concatenated on the column axis, preserving its names. 请注意, read_csv是使用axis=1转置的,因此它将在列轴上连接,并保留其名称。 If you need, you can transpose the resulting DataFrame back with dT . 如果需要,可以使用dT将生成的DataFrame转换回来。

EDIT: 编辑:

For different number of columns in each source file, you'll need to supply a header. 对于每个源文件中的不同列数,您需要提供标头。 I understand you don't have a header in your source files, so let's create one with a simple function: 我知道你的源文件中没有标题,所以让我们用一个简单的函数创建一个标题:

def reader(f):
    d = read_csv(f, index_col=0, header=None, axis=1)
    d.columns = range(d.shape[1])
    return d

df = concat([reader(f) for f in files], keys=files)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM