简体   繁体   English

"从多个文件创建熊猫数据框"

[英]creating pandas data frame from multiple files

I am trying to create a pandas DataFrame<\/code> and it works fine for a single file.我正在尝试创建一个 pandas DataFrame<\/code> ,它适用于单个文件。 If I need to build it for multiple files which have the same data structure.如果我需要为具有相同数据结构的多个文件构建它。 So instead of single file name I have a list of file names from which I would like to create the DataFrame<\/code> .因此,我有一个文件名列表,而不是单个文件名,我想从中创建DataFrame<\/code> 。

Not sure what's the way to append to current DataFrame<\/code> in pandas or is there a way for pandas to suck a list of files into a DataFrame<\/code> .不确定在 pandas 中附加到当前DataFrame<\/code>的方法是什么,或者 pandas 有没有办法将文件列表吸入DataFrame<\/code> 。

"

The pandas concat command is your friend here. pandas concat命令是您的朋友。 Lets say you have all you files in a directory, targetdir.假设您将所有文件都放在一个目录 targetdir 中。 You can:你可以:

  1. make a list of the files列出文件列表
  2. load them as pandas dataframes将它们加载为熊猫数据帧
  3. and concatenate them together并将它们连接在一起

` `

import os
import pandas as pd

#list the files
filelist = os.listdir(targetdir) 
#read them into pandas
df_list = [pd.read_table(file) for file in filelist]
#concatenate them together
big_df = pd.concat(df_list)

Potentially horribly inefficient but...可能效率低下,但...

Why not use read_csv , to build two (or more) dataframes, then use join to put them together?为什么不使用read_csv来构建两个(或更多)数据帧,然后使用 join 将它们放在一起?

That said, it would be easier to answer your question if you provide some data or some of the code you've used thus far.也就是说,如果您提供一些数据或迄今为止您使用过的一些代码,那么回答您的问题会更容易。

I might try to concatenate the files before feeding them to pandas.我可能会尝试连接文件,然后再将它们提供给熊猫。 If you're in Linux or Mac you could use cat , otherwise a very simple Python function could do the job for you.如果您使用的是 Linux 或 Mac,您可以使用cat ,否则一个非常简单的 Python 函数就可以为您完成这项工作。

Are these files in a csv format.这些文件是 csv 格式吗? You could use the read_csv.您可以使用 read_csv。 http://pandas.sourceforge.net/io.html http://pandas.sourceforge.net/io.html

Once you have read the files and save it in two dataframes, you could merge the two dataframes or add additional columns to one of the two dataframes( assuming common index).读取文件并将其保存在两个数据帧中后,您可以合并两个数据帧或向两个数据帧之一添加额外的列(假设有公共索引)。 Pandas should be able to fill in missing rows. Pandas 应该能够填充缺失的行。

import os
import pandas as pd
data = []

thisdir = os.getcwd()

for r, d, f in os.walk(thisdir):
    for file in f:
        if ".docx" in file:
            data.append(file)

df = pd.DataFrame(data)

Here is a simple solution that avoids using a list to hold all the data frames, if you don't need them in a list, it creates a dataframe for each file, you can then pd.concat<\/code> them.这是一个简单的解决方案,它避免使用列表来保存所有数据框,如果您不需要它们在列表中,它会为每个文件创建一个数据框,然后您可以pd.concat<\/code>它们。

import fnmatch

# get the CSV files only
files = fnmatch.filter(os.listdir('.'), '*.csv')
files

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM