简体   繁体   English

如何保存,然后从dataframe中的文件名中提取一些信息

[英]how to save and then extract some information from the file names in dataframe

I have almost 1000000 or even more files in a path. 我在路径中有近1000000甚至更多的文件。 My final goal is to extract some information from just names of the files. 我的最终目标是从文件的names中提取一些信息。 Till now I have saved the names of the file in a list. 直到现在我已将文件的名称保存在列表中。

what information in names of the files? 文件名中有哪些信息?

so the format of the names of the file is something like this: 所以文件名的格式是这样的:

09066271_142468576_1_Haha_-Haha-haha_2016-10-07_haha-false_haha2427.txt

all haha are other different text that does not matter. 哈哈是其他不同的文本。

I want to extract 09066271 and 2016-10-07 out of the names and save in a dataframe. 我想从名称中提取090662712016-10-07并保存在数据框中。 the first number is always 8 character. 第一个数字总是8个字符。

Till now , I have saved the whole text file names in the list: 直到现在,我已将整个文本文件名保存在列表中:

path = 'path to the saved txt files/fldr'
file_list = os.listdir(path)

firstly I wanted to save the whole txt file names in the dataframe and then do these operations on them. 首先,我想将整个txt文件名保存在数据框中,然后对它们执行这些操作。 it seems I have to firstly read to numpy then reshape it to be readable in pandas. 似乎我必须首先阅读numpy然后重塑它在熊猫中可读。 however I do not now before what will be the reshape numbers. 但是我现在还没有重塑数字。

df = pd.DataFrame(np.array(file_list).reshape(,))

I would appreciate if you can give me your idea and what will be the efficient way of doing this :) 如果你能给我你的想法以及这样做的有效方法,我将不胜感激:)

You can use os to list all of the files. 您可以使用os列出所有文件。 Then just construct a DataFrame and use the string methods to get the parts of the filenames you need. 然后只构造一个DataFrame并使用字符串方法来获取所需文件名的各个部分。

import pandas as pd
import os

path = 'path to the saved txt files/fldr'
file_list = os.listdir(path)

df = pd.DataFrame(file_list, columns=['file_name'])
df['data'] = df.file_name.str[0:8]
df['date'] = df.file_name.str.extract('(\d{4}-\d{2}-\d{2})', expand=True)

                                           file_name      data        date
0  09066271_142468576_1_Haha_-Haha-haha_2016-10-0...  09066271  2016-10-07
1  09014271_142468576_1_Haha_-Haha-haha_2013-02-1...  09014271  2013-02-18

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM