解析文本文件并转换为 csv

Question

我需要将一个大文本文件转换为 csv 文件，同时解析文本并仅获取创建新表所需的信息。 这就是我所拥有的：

# -- name1 --
country: Italy
age:30
height: 1,8
weight: 80
# -- name2 --
age:20
height: 1,6
weight: 50
# -- name3 --
City: Berlin
country: Italy
age:33
height: 1,7
weight: 82

而我需要的 output 是：

Name    Age    Height
name1   30      1,8
name2   20      1,6
name3   33      1,7

我想使用 Pandas 是可行的，但我在开始编码时遇到了一些困难。 你能帮我解决这个问题吗？ 谢谢。

Answer 1

您也可以不使用 pandas 来做到这一点。

with open('res.txt') as file:
  contents = file.read().split('\n')
  name,age,height = [],[],[]
  for ele in contents:
    if ele.__contains__('#'):
        lis = [name if name!='#' and name!='--' else '' for name in ele.split()]
        name.append(' '.join(lis))
    elif ele.__contains__('age'):
        age.append(ele.split(':')[1])
    elif ele.__contains__('height'):
        height.append(ele.split(':')[1])
with open('output_csv.csv','a') as file:
  file.write('Name,Age,Height\n')
  for ele in list(zip(name,age,height)):
    file.write(f'"{ele[0]}",{ele[1]},"{ele[2]}"\n')

Answer 2

上一个答案需要手动添加所有键，并忽略国家和城市的键值对。

如果您想使用 pandas，您不妨执行以下操作：

import re
import pandas as pd

# Create empty list to store dicts for every block
dict_list = []
# Split string into blocks using the comment at the beginning
blocks = re.split('#.*--', string)
# Iterate over all blocks
for block in blocks:
    # Use regex to find all key-value pairs, assuming they are split by
    # either ':' or ': '.
    tuples = re.findall('(.*):\s?(.*)', block)
    # Create an empty dictionary to convert the list of tuples
    # returned by findall
    d = {}
    # Iterate over all tuples in list
    for t in tuples:
        # Create a key-value pair from the tuple
        d[t[0].lower()] = t[1]
    if tuples:
        # Append the dictionary to dict_list
        dict_list.append(d)

# Create a data frame from the list of dicts.
df = pd.DataFrame(dict_list)

一般的挑战是您的数据是键值对的形式，而不是pd.read_csv()可以处理的表格格式。 假设它实际上不是.yaml文件（然后只需使用pyyaml并从 dict 创建数据框），您需要解析文件。

请参阅内联评论以获得进一步的解释。

Answer 3

没那么难，但它需要一个自定义解析器。

规则：

新名称以井号 ( # ) 开头的行开头
- 格式为# -- name -- ：名称是拆分时的第二个字段--
字段行包含：
- 一个字段名
- 冒号 ( : )
- 可选的空白字符
- 字段值
- 可选的空白字符，包括行尾
最后你想要一个 csv 字段名称，年龄和身高，带有大写的标题

代码可以是：

with open('file.txt') as fdin, open('file.csv', newline='') as fdout:
    wr = csv.DictWriter(fdout, fieldnames=['Name', 'Age', 'Height'],
                        extrasaction='ignore')  # ignore unwanted fields
    row = {}
    wr.writeheader()                            # write the header line
    for line in fdin:
        if line.startswith('#'):                # process a name line
            name = line.split('--')[1].strip()
            if len(row) != 0:                   # if we have a row write it
                wr.writerow(row)
            row = {'Name': name}                # initialize a new row
        else:
            field, value = line.split(':')      # process a field line
            row[field.strip().capitalize()] = value.strip()
    if len(row) != 0:                           # do not forget the last row
        wr.writerow(row)

解析文本文件并转换为 csv

问题描述

3 个解决方案

解决方案1
2 2021-02-19 12:30:58

解决方案2
1 2021-02-19 12:44:51

解决方案3
1 2021-02-19 13:04:44

解析文本文件并转换为 csv

问题描述

3 个解决方案

解决方案1 2 2021-02-19 12:30:58

解决方案2 1 2021-02-19 12:44:51

解决方案3 1 2021-02-19 13:04:44

解决方案1
2 2021-02-19 12:30:58

解决方案2
1 2021-02-19 12:44:51

解决方案3
1 2021-02-19 13:04:44