简体   繁体   中英

Converting a text file to csv with columns

I want to convert a text file to a csv file with the columns such name,date,Description Im new to python so not getting a proper way to do this can someone guide me regarding this. below is the sample text file.

================================================== ====
Title: Whole case
Location: oyuri
From: Aki 
Date: 2018/11/30 (Friday) 11:55:29
================================================== =====
1: Aki 
2018/12/05 (Wed) 17:33:17
An approval notice has been sent.
-------------------------------------------------- ------------------
2: Aki
2018/12/06 (Thursday) 17:14:30
I was notified by Mr. Id, the agent of the other party.

-------------------------------------------------- ------------------
3: kano, etc.
2018/12/07 (Friday) 11:44:45
Please call rito.
-------------------------------------------------- ------------------

I outline below a very simplistic approach to achieving your task. The general idea is to:

  1. Read in your text file using open()
  2. Split the text into a list
  3. Isolate the information in each element of the list
  4. Export the information to a csv using pandas

I would recommend using Jupyter Notebooks to get a better idea of what I have done here.

import pandas as pd

# open file and extract text
text_path = 'text.txt'
with open(text_path) as f:
    text = f.read()

# split text into a list
lines = text.split('\n')

# remove heading
len_heading = 6
lines = lines[6:]

# seperate information using divider
divider = '-----'
data = []
start = 0
for i, line in enumerate(lines):
    
    # add elements to data if divider found
    if line.startswith(divider):
        data.append(lines[start:i])
        start = i+1

# extract name, date and description from data
names, dates, description = [], [], []
for info in data:
    
    # this is a very simplistic approach, please add checks
    # to make sure you are getting the right data
    name = info[0][2:]
    date = info[1][:11]
    desc = info[2]
    
    names.append(name)
    dates.append(date)
    description.append(desc)

# create pandas dataframe
df = pd.DataFrame({'name': names, 'date': dates, 'description': description})

# export dataframe to csv
df.to_csv('converted_text.csv', index=False)

You should get a CSV file that looks like this.

在此处输入图像描述

  1. find the rows contains msg sep line, eg '-----', '======'
  2. then use np.where(cond, 1, 0).cumsum() to tag every separate msg.
  3. filter the lines without '-----' or '======'
  4. groupby tag, and join with sep '\n', then use str.split to expand the columns.
# read the file with only one col
df = pd.read_csv(file, sep='\n', header=None)

# located the row contains ------ or ======
cond = df[0].str.contains('-----|======')
df['tag'] = np.where(cond, 1, 0).cumsum()

# filter the line contains msg
cond2 = df['tag'] >=2
dfn = df[(~cond & cond2)].copy()

# output
df_output = (dfn.groupby('tag')[0]
            .apply('\n'.join)
            .str.split('\n', n=2, expand=True))
df_output.columns = ['name', 'date', 'Description']

output:

              name                            date  \
tag                                                  
2.0        1: Aki        2018/12/05 (Wed) 17:33:17   
3.0         2: Aki  2018/12/06 (Thursday) 17:14:30   
4.0  3: kano, etc.    2018/12/07 (Friday) 11:44:45   

                                           Description  
tag                                                     
2.0                  An approval notice has been sent.  
3.0  I was notified by Mr. Id, the agent of the oth...  
4.0                                  Please call rito.  

df:

                                                    0  tag
0   ==============================================...    1
1                                   Title: Whole case    1
2                                     Location: oyuri    1
3                                          From: Aki     1
4                  Date: 2018/11/30 (Friday) 11:55:29    1
5   ==============================================...    2
6                                             1: Aki     2
7                           2018/12/05 (Wed) 17:33:17    2
8                   An approval notice has been sent.    2
9   ----------------------------------------------...    3
10                                             2: Aki    3
11                     2018/12/06 (Thursday) 17:14:30    3
12  I was notified by Mr. Id, the agent of the oth...    3
13  ----------------------------------------------...    4
14                                      3: kano, etc.    4
15                       2018/12/07 (Friday) 11:44:45    4
16                                  Please call rito.    4
17  ----------------------------------------------...    5

you can continue handle the name:

obj = df_output['name'].str.strip().str.split(':\s*')
df_output['name'] = obj.str[-1]
df_output['idx'] = obj.str[0]
df_output = df_output.set_index('idx')
           name                            date  \
idx                                               
1           Aki       2018/12/05 (Wed) 17:33:17   
2           Aki  2018/12/06 (Thursday) 17:14:30   
3    kano, etc.    2018/12/07 (Friday) 11:44:45   

                                           Description  
idx                                                     
1                    An approval notice has been sent.  
2    I was notified by Mr. Id, the agent of the oth...  
3                                    Please call rito.

add more header columns:

cond = (df['tag'] == 1) & (df[0].str.contains(':'))
header_dict = dict(df.loc[cond, 0].str.split(': ', n=1).values)

    # {'Title': 'Whole case',
    #  'Location': 'oyuri',
    #  'From': 'Aki ',
    #  'Date': '2018/11/30 (Friday) 11:55:29'}

for k,v in header_dict.items():
    df_output[k] = v

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM