I want to convert a text file to a csv file with the columns such name,date,Description Im new to python so not getting a proper way to do this can someone guide me regarding this. below is the sample text file.
================================================== ====
Title: Whole case
Location: oyuri
From: Aki
Date: 2018/11/30 (Friday) 11:55:29
================================================== =====
1: Aki
2018/12/05 (Wed) 17:33:17
An approval notice has been sent.
-------------------------------------------------- ------------------
2: Aki
2018/12/06 (Thursday) 17:14:30
I was notified by Mr. Id, the agent of the other party.
-------------------------------------------------- ------------------
3: kano, etc.
2018/12/07 (Friday) 11:44:45
Please call rito.
-------------------------------------------------- ------------------
I outline below a very simplistic approach to achieving your task. The general idea is to:
open()
list
list
pandas
I would recommend using Jupyter Notebooks to get a better idea of what I have done here.
import pandas as pd
# open file and extract text
text_path = 'text.txt'
with open(text_path) as f:
text = f.read()
# split text into a list
lines = text.split('\n')
# remove heading
len_heading = 6
lines = lines[6:]
# seperate information using divider
divider = '-----'
data = []
start = 0
for i, line in enumerate(lines):
# add elements to data if divider found
if line.startswith(divider):
data.append(lines[start:i])
start = i+1
# extract name, date and description from data
names, dates, description = [], [], []
for info in data:
# this is a very simplistic approach, please add checks
# to make sure you are getting the right data
name = info[0][2:]
date = info[1][:11]
desc = info[2]
names.append(name)
dates.append(date)
description.append(desc)
# create pandas dataframe
df = pd.DataFrame({'name': names, 'date': dates, 'description': description})
# export dataframe to csv
df.to_csv('converted_text.csv', index=False)
You should get a CSV file that looks like this.
np.where(cond, 1, 0).cumsum()
to tag every separate msg.# read the file with only one col
df = pd.read_csv(file, sep='\n', header=None)
# located the row contains ------ or ======
cond = df[0].str.contains('-----|======')
df['tag'] = np.where(cond, 1, 0).cumsum()
# filter the line contains msg
cond2 = df['tag'] >=2
dfn = df[(~cond & cond2)].copy()
# output
df_output = (dfn.groupby('tag')[0]
.apply('\n'.join)
.str.split('\n', n=2, expand=True))
df_output.columns = ['name', 'date', 'Description']
output:
name date \
tag
2.0 1: Aki 2018/12/05 (Wed) 17:33:17
3.0 2: Aki 2018/12/06 (Thursday) 17:14:30
4.0 3: kano, etc. 2018/12/07 (Friday) 11:44:45
Description
tag
2.0 An approval notice has been sent.
3.0 I was notified by Mr. Id, the agent of the oth...
4.0 Please call rito.
df:
0 tag
0 ==============================================... 1
1 Title: Whole case 1
2 Location: oyuri 1
3 From: Aki 1
4 Date: 2018/11/30 (Friday) 11:55:29 1
5 ==============================================... 2
6 1: Aki 2
7 2018/12/05 (Wed) 17:33:17 2
8 An approval notice has been sent. 2
9 ----------------------------------------------... 3
10 2: Aki 3
11 2018/12/06 (Thursday) 17:14:30 3
12 I was notified by Mr. Id, the agent of the oth... 3
13 ----------------------------------------------... 4
14 3: kano, etc. 4
15 2018/12/07 (Friday) 11:44:45 4
16 Please call rito. 4
17 ----------------------------------------------... 5
you can continue handle the name:
obj = df_output['name'].str.strip().str.split(':\s*')
df_output['name'] = obj.str[-1]
df_output['idx'] = obj.str[0]
df_output = df_output.set_index('idx')
name date \
idx
1 Aki 2018/12/05 (Wed) 17:33:17
2 Aki 2018/12/06 (Thursday) 17:14:30
3 kano, etc. 2018/12/07 (Friday) 11:44:45
Description
idx
1 An approval notice has been sent.
2 I was notified by Mr. Id, the agent of the oth...
3 Please call rito.
add more header columns:
cond = (df['tag'] == 1) & (df[0].str.contains(':'))
header_dict = dict(df.loc[cond, 0].str.split(': ', n=1).values)
# {'Title': 'Whole case',
# 'Location': 'oyuri',
# 'From': 'Aki ',
# 'Date': '2018/11/30 (Friday) 11:55:29'}
for k,v in header_dict.items():
df_output[k] = v
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.