如何在 python 中将导入的 txt 文件的文件名添加到 dataframe

Question

我已将数千个 txt 文件从文件夹导入pandas dataframe 。 有什么方法可以创建一个列，从其中导入的 txt 文件的文件名中添加一个子字符串？ 这是为了通过唯一的名称来识别 dataframe 中的每个文本文件。

文本文件被命名为1001example.txt, 1002example.txt, 1003example.txt等等。 我想要这样的东西：

filename        text
1001            this is an example text
1002            this is another example text
1003            this is the last example text
....

我用来导入数据的代码如下。 但是，我不知道如何通过文件名的子字符串创建列。 任何帮助，将不胜感激。 谢谢。

import glob
import os
import pandas as pd

file_list = glob.glob(os.path.join(os.getcwd(), "K:\\text_all", "*.txt"))

corpus = []

for file_path in file_list:
    with open(file_path, encoding="latin-1") as f_input:
        corpus.append(f_input.read())

df = pd.DataFrame({'text':corpus})

Answer 1

这应该有效。 它从文件名中获取数字。

import glob
import os
import pandas as pd

file_list = glob.glob(os.path.join(os.getcwd(), "K:\\text_all", "*.txt"))

corpus = []
files = []

for file_path in file_list:
    with open(file_path, encoding="latin-1") as f_input:
        corpus.append(f_input.read())
        files.append(''.join([n for n in os.path.basename(file_path) if n.isdigit()]))

df = pd.DataFrame({'file':files, 'text':corpus})

Answer 2

有一个单行：

df = pd.concat([pd.read_csv(f, encoding='latin-1').
                assign(Filename=os.path.basename(f)) for f in glob.glob('K:\\text_all*.txt')])
df['Filename'] = df['Filename'].str.extract('(\d+)').astype(int)

如何在 python 中将导入的 txt 文件的文件名添加到 dataframe

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-07-14 05:12:51

解决方案2
1 2020-07-14 05:22:53

如何在 python 中将导入的 txt 文件的文件名添加到 dataframe

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-07-14 05:12:51

解决方案2 1 2020-07-14 05:22:53

解决方案1
2 已采纳 2020-07-14 05:12:51

解决方案2
1 2020-07-14 05:22:53