将文本文件转换为带有列的 csv

Question

I want to convert a text file to a csv file with the columns such name,date,Description Im new to python so not getting a proper way to do this can someone guide me regarding this.我想将文本文件转换为 csv 文件，其中包含名称、日期、描述等列我是 python 的新手，所以没有正确的方法来做这件事有人可以指导我。 below is the sample text file.下面是示例文本文件。

================================================== ====
Title: Whole case
Location: oyuri
From: Aki 
Date: 2018/11/30 (Friday) 11:55:29
================================================== =====
1: Aki 
2018/12/05 (Wed) 17:33:17
An approval notice has been sent.
-------------------------------------------------- ------------------
2: Aki
2018/12/06 (Thursday) 17:14:30
I was notified by Mr. Id, the agent of the other party.

-------------------------------------------------- ------------------
3: kano, etc.
2018/12/07 (Friday) 11:44:45
Please call rito.
-------------------------------------------------- ------------------

Answer 1

I outline below a very simplistic approach to achieving your task.我在下面概述了一种非常简单的方法来完成您的任务。 The general idea is to:总体思路是：

Read in your text file using open()使用open()读入你的文本文件
Split the text into a list将文本拆分为list
Isolate the information in each element of the list隔离list中每个元素中的信息
Export the information to a csv using pandas使用 pandas 将信息导出到pandas

I would recommend using Jupyter Notebooks to get a better idea of what I have done here.我建议使用 Jupyter Notebooks 来更好地了解我在这里所做的事情。

import pandas as pd

# open file and extract text
text_path = 'text.txt'
with open(text_path) as f:
    text = f.read()

# split text into a list
lines = text.split('\n')

# remove heading
len_heading = 6
lines = lines[6:]

# seperate information using divider
divider = '-----'
data = []
start = 0
for i, line in enumerate(lines):
    
    # add elements to data if divider found
    if line.startswith(divider):
        data.append(lines[start:i])
        start = i+1

# extract name, date and description from data
names, dates, description = [], [], []
for info in data:
    
    # this is a very simplistic approach, please add checks
    # to make sure you are getting the right data
    name = info[0][2:]
    date = info[1][:11]
    desc = info[2]
    
    names.append(name)
    dates.append(date)
    description.append(desc)

# create pandas dataframe
df = pd.DataFrame({'name': names, 'date': dates, 'description': description})

# export dataframe to csv
df.to_csv('converted_text.csv', index=False)

You should get a CSV file that looks like this.你应该得到一个看起来像这样的 CSV 文件。

Answer 2

find the rows contains msg sep line, eg '-----', '======'查找包含 msg sep 行的行，例如 '-----'、'======'
then use np.where(cond, 1, 0).cumsum() to tag every separate msg.然后使用np.where(cond, 1, 0).cumsum()标记每个单独的味精。
filter the lines without '-----' or '======'过滤没有 '-----' 或 '======' 的行
groupby tag, and join with sep '\n', then use str.split to expand the columns. groupby 标签，并加入 sep '\n'，然后使用 str.split 展开列。

# read the file with only one col
df = pd.read_csv(file, sep='\n', header=None)

# located the row contains ------ or ======
cond = df[0].str.contains('-----|======')
df['tag'] = np.where(cond, 1, 0).cumsum()

# filter the line contains msg
cond2 = df['tag'] >=2
dfn = df[(~cond & cond2)].copy()

# output
df_output = (dfn.groupby('tag')[0]
            .apply('\n'.join)
            .str.split('\n', n=2, expand=True))
df_output.columns = ['name', 'date', 'Description']

output: output：

              name                            date  \
tag                                                  
2.0        1: Aki        2018/12/05 (Wed) 17:33:17   
3.0         2: Aki  2018/12/06 (Thursday) 17:14:30   
4.0  3: kano, etc.    2018/12/07 (Friday) 11:44:45   

                                           Description  
tag                                                     
2.0                  An approval notice has been sent.  
3.0  I was notified by Mr. Id, the agent of the oth...  
4.0                                  Please call rito.

df:东风：

                                                    0  tag
0   ==============================================...    1
1                                   Title: Whole case    1
2                                     Location: oyuri    1
3                                          From: Aki     1
4                  Date: 2018/11/30 (Friday) 11:55:29    1
5   ==============================================...    2
6                                             1: Aki     2
7                           2018/12/05 (Wed) 17:33:17    2
8                   An approval notice has been sent.    2
9   ----------------------------------------------...    3
10                                             2: Aki    3
11                     2018/12/06 (Thursday) 17:14:30    3
12  I was notified by Mr. Id, the agent of the oth...    3
13  ----------------------------------------------...    4
14                                      3: kano, etc.    4
15                       2018/12/07 (Friday) 11:44:45    4
16                                  Please call rito.    4
17  ----------------------------------------------...    5

you can continue handle the name:您可以继续处理名称：

obj = df_output['name'].str.strip().str.split(':\s*')
df_output['name'] = obj.str[-1]
df_output['idx'] = obj.str[0]
df_output = df_output.set_index('idx')

           name                            date  \
idx                                               
1           Aki       2018/12/05 (Wed) 17:33:17   
2           Aki  2018/12/06 (Thursday) 17:14:30   
3    kano, etc.    2018/12/07 (Friday) 11:44:45   

                                           Description  
idx                                                     
1                    An approval notice has been sent.  
2    I was notified by Mr. Id, the agent of the oth...  
3                                    Please call rito.

add more header columns:添加更多 header 列：

cond = (df['tag'] == 1) & (df[0].str.contains(':'))
header_dict = dict(df.loc[cond, 0].str.split(': ', n=1).values)

    # {'Title': 'Whole case',
    #  'Location': 'oyuri',
    #  'From': 'Aki ',
    #  'Date': '2018/11/30 (Friday) 11:55:29'}

for k,v in header_dict.items():
    df_output[k] = v

将文本文件转换为带有列的 csv

问题描述

2 个解决方案

解决方案1
1 2021-03-03 05:06:12

解决方案2
1 已采纳 2021-03-03 05:59:13

将文本文件转换为带有列的 csv

问题描述

2 个解决方案

解决方案1 1 2021-03-03 05:06:12

解决方案2 1 已采纳 2021-03-03 05:59:13

解决方案1
1 2021-03-03 05:06:12

解决方案2
1 已采纳 2021-03-03 05:59:13