Pandas - 试图将多个 .txt 文件存储在 a.csv 中

Question

I have a folder with about 500.txt files.我有一个包含大约 500.txt 文件的文件夹。 I would like to store the content in a csv file, with 2 columns, column 1 being the name of the file and column 2 being the file content in string.我想将内容存储在 csv 文件中，有 2 列，第 1 列是文件名，第 2 列是字符串中的文件内容。 So I'd end up with a CSV file with 501 rows.所以我最终会得到一个包含 501 行的 CSV 文件。

I've snooped around SO and tried to find similar questions, and came up with the following code:我已经窥探了 SO 并试图找到类似的问题，并提出了以下代码：

import pandas as pd
from pandas.io.common import EmptyDataError
import os


def Aggregate_txt_csv(path):
    for files in os.listdir(path):
            with open(files, 'r') as file:
                try: 
                    df = pd.read_csv(file, header=None, delim_whitespace=True)
                except EmptyDataError:
                    df = pd.DataFrame()
                
            return df.to_csv('file.csv', index=False)

However it returns an empty.csv file.但是它返回一个空的.csv 文件。 Am I doing something wrong?难道我做错了什么？

Answer 1

There are several problems on your code.您的代码有几个问题。 One of them is that pd.read_csv is not opening file because you're not passing the path to the given file.其中之一是 pd.read_csv 没有打开file ，因为您没有将路径传递给给定文件。 I think you should try to play from this code我认为您应该尝试使用此代码进行播放

import os
import pandas as pd
from pandas.io.common import EmptyDataError

def Aggregate_txt_csv(path):
    files = os.listdir(path)
    df = []
    for file in files:
        try: 
            d = pd.read_csv(os.path.join(path, file), header=None, delim_whitespace=True)
            d["file"] = file
        except EmptyDataError:
            d = pd.DataFrame({"file":[file]})
        df.append(d)
    df = pd.concat(df, ignore_index=True)
    df.to_csv('file.csv', index=False)

Answer 2

Use pathlib使用路径库
- Path.glob() to find all the files Path.glob()查找所有文件
- When using path objects, file.stem returns the file name from the path.使用路径对象时， file.stem从路径中返回文件名。
Use pandas.concat to combine the dataframes in df_list使用pandas.concat组合df_list中的数据帧

from pathlib import Path
import pandas as pd

p = Path('e:/PythonProjects/stack_overflow')  # path to files
files = p.glob('*.txt')  # get all txt files

df_list = list()  # create an empty list for the dataframes
for file in files:  # iterate through each file
    with file.open('r') as f:
        text = '\n'.join([line.strip() for line in f.readlines()])  # join all rows in list as a single string separated with \n
        
    df_list.append(pd.DataFrame({'filename': [file.stem], 'contents': [text]}))  # create and append a dataframe


df_all = pd.concat(df_list)  # concat all the dataframes

df_all.to_csv('files.txt', index=False)  # save to csv

Answer 3

I noticed there's already an answer, but I've gotten it to work with a relatively simple piece of code.我注意到已经有一个答案，但我已经让它与一段相对简单的代码一起工作。 I've only edited the file read-in a little bit, and the dataframe is outputting successfully.我只是稍微编辑了读入的文件，dataframe 输出成功。

Link here 链接在这里

import pandas as pd
from pandas.io.common import EmptyDataError
import os


def Aggregate_txt_csv(path):
    result = []
    print(os.listdir(path))
    for files in os.listdir(path):
        fullpath = os.path.join(path, files)
        if not os.path.isfile(fullpath):
            continue

        with open(fullpath, 'r', errors='replace') as file:
            try:
                content = '\n'.join(file.readlines())
                result.append({'title': files, 'body': content})
            except EmptyDataError:
                result.append({'title': files, 'body': None})
            
    df = pd.DataFrame(result)
    return df

df = Aggregate_txt_csv('files')
print(df)
df.to_csv('result.csv')

Most importantly here, I am appending to an array so as not to run pandas' concatenate function too much, as that would be pretty bad for performance.最重要的是，我将附加到一个数组，以免运行 pandas 的串联 function 太多，因为这对性能非常不利。 Additionally, reading in the file should not need read_csv, as there isn't a set format for the file.此外，读取文件不需要 read_csv，因为文件没有固定的格式。 So using '\n'.join(file.readlines()) allows you to read in the file plainly and take out all lines into a string.因此，使用'\n'.join(file.readlines())可以让您清楚地读取文件并将所有行取出到一个字符串中。

At the end, I convert the array of dictionaries into a final dataframe, and it returns the result.最后，我将字典数组转换为最终的 dataframe，并返回结果。

EDIT : for paths that aren't the current directory, I updated it to append the path so that it could find the necessary files, apologies for the confusion编辑：对于不是当前目录的路径，我将其更新为 append 路径，以便它可以找到必要的文件，为混淆道歉

Pandas - 试图将多个 .txt 文件存储在 a.csv 中

问题描述

3 个解决方案

解决方案1
1 2020-06-25 20:13:08

解决方案2
1 2020-06-25 20:18:03

解决方案3
0 已采纳 2020-06-25 20:18:44

Pandas - 试图将多个 .txt 文件存储在 a.csv 中

问题描述

3 个解决方案

解决方案1 1 2020-06-25 20:13:08

解决方案2 1 2020-06-25 20:18:03

解决方案3 0 已采纳 2020-06-25 20:18:44

解决方案1
1 2020-06-25 20:13:08

解决方案2
1 2020-06-25 20:18:03

解决方案3
0 已采纳 2020-06-25 20:18:44