I have a folder with multiple txt files. Each file contain information of a client of my friend's business that he entered manually from a hardcopy document. These information can be e-mails, addresses, requestID, etc. Each time he get a new client he creates a new txt file in that folder.
Using Python, I want to create a CSV file that contain all information about all clients from the txt files so that I can open it on Excel. The files content looks like this:
Date:24/02/2021
Email:*****@gmail.com
Product:Hard Drives
Type:Sandisk
Size:128GB
Some files have additional information. And each file is labeled by an ID (which is the name of the txt file).
What I'm thinking of is to make the code creates a dictionary for each file. Each dict will be named by the name of the txt file. The data types (date,email,product.etc) will be the indexes and (keep in mind that not all files has the same number of indexes as some files have more or less data than others) then there are the values. And then convert this collection of dicts into one CSV file that when opened in Excel should look like this:
FileID | Date | Address | Product | Type | Color | Size | |
---|---|---|---|---|---|---|---|
01-2021 | 02-01-2021 | Hard Drive | SanDisk | 128GB |
Is this a good way to achieve this goal? or there is a shorter and more effective one?
This code by @dukkee seems to logically fulfill the task required:
import os
import pandas as pd
FOLDER_PATH = "folder_path"
raw_data = []
for filename in os.listdir(FOLDER_PATH):
with open(os.path.join(FOLDER_PATH, filename)) as fp:
file_data = dict(line.split(":", 1) for line in fp if line)
file_data["FileID"] = filename
raw_data.append(file_data)
frame = pd.DataFrame(raw_data)
frame.to_csv("output.csv", index=False)
However, it keeps showing me this error:
The following code by @dm2 should also work but it also shows me an error which I couldn't figure why:
import pandas as pd
import os
files = os.listdir('test/')
df_list = [pd.read_csv(f'test/{file}', sep = ':', header = None).set_index(0).T for file in files]
df_out = pd.concat(df_list)
# to reindex by filename
df_out.index = [file.strip('.txt') for file in files]
I made sure that all txt files has no empty lines but this wasn't the solution for these errors.
You can use smth like this:
import os
import pandas as pd
FOLDER_PATH = "folder_path"
raw_data = []
for filename in os.listdir(FOLDER_PATH):
with open(os.path.join(FOLDER_PATH, filename), errors="ignore") as fp:
file_data = dict(line.split(":", 1) for line in fp if line)
file_data["FileID"] = filename
raw_data.append(file_data)
frame = pd.DataFrame(raw_data)
frame.to_csv("output.csv", index=False)
You can actually read these files into pandas DataFrames and then concatenate them into one single DataFrame.
I've made a test folder with 5 slightly different test files (named '1.txt.', '2.txt.', ...).
Code:
import pandas as pd
import os
files = os.listdir('test/')
df_list = [pd.read_csv(f'test/{file}', sep = ':', header = None).set_index(0).T for file in files]
df_out = pd.concat(df_list)
# to reindex by filename
df_out.index = [file.strip('.txt') for file in files]
df_out:
0 Date Email Product Type Size Size2 Type2 Test
1 24/02/2021 *****@gmail.com Hard Drives Sandisk 128GB 128GB NaN NaN
2 24/02/2021 *****@gmail.com Hard Drives Sandisk 128GB NaN Sandisk NaN
3 24/02/2021 *****@gmail.com Hard Drives Sandisk 128GB NaN NaN Test
4 24/02/2021 *****@gmail.com Hard Drives Sandisk 128GB NaN NaN 2
5 24/02/2021 *****@gmail.com Hard Drives Sandisk 128GB NaN NaN NaN
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.