[英]how to save and then extract some information from the file names in dataframe
I have almost 1000000 or even more files in a path. 我在路径中有近1000000甚至更多的文件。 My final goal is to extract some information from just
names
of the files. 我的最终目标是从文件的
names
中提取一些信息。 Till now I have saved the names of the file in a list. 直到现在我已将文件的名称保存在列表中。
what information in names of the files? 文件名中有哪些信息?
so the format of the names of the file is something like this: 所以文件名的格式是这样的:
09066271_142468576_1_Haha_-Haha-haha_2016-10-07_haha-false_haha2427.txt
all haha are other different text that does not matter. 哈哈是其他不同的文本。
I want to extract 09066271
and 2016-10-07
out of the names and save in a dataframe. 我想从名称中提取
09066271
和2016-10-07
并保存在数据框中。 the first number is always 8 character. 第一个数字总是8个字符。
Till now , I have saved the whole text file names in the list: 直到现在,我已将整个文本文件名保存在列表中:
path = 'path to the saved txt files/fldr'
file_list = os.listdir(path)
firstly I wanted to save the whole txt file names in the dataframe and then do these operations on them. 首先,我想将整个txt文件名保存在数据框中,然后对它们执行这些操作。 it seems I have to firstly read to numpy then reshape it to be readable in pandas.
似乎我必须首先阅读numpy然后重塑它在熊猫中可读。 however I do not now before what will be the reshape numbers.
但是我现在还没有重塑数字。
df = pd.DataFrame(np.array(file_list).reshape(,))
I would appreciate if you can give me your idea and what will be the efficient way of doing this :) 如果你能给我你的想法以及这样做的有效方法,我将不胜感激:)
You can use os
to list all of the files. 您可以使用
os
列出所有文件。 Then just construct a DataFrame
and use the string methods to get the parts of the filenames you need. 然后只构造一个
DataFrame
并使用字符串方法来获取所需文件名的各个部分。
import pandas as pd
import os
path = 'path to the saved txt files/fldr'
file_list = os.listdir(path)
df = pd.DataFrame(file_list, columns=['file_name'])
df['data'] = df.file_name.str[0:8]
df['date'] = df.file_name.str.extract('(\d{4}-\d{2}-\d{2})', expand=True)
file_name data date
0 09066271_142468576_1_Haha_-Haha-haha_2016-10-0... 09066271 2016-10-07
1 09014271_142468576_1_Haha_-Haha-haha_2013-02-1... 09014271 2013-02-18
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.