如何从转置格式将 a.txt 文件读入 pandas DataFrame

Question

I'm trying to read a dataset into a pandas dataframe.我正在尝试将数据集读入 pandas dataframe。 The dataset is currently in a.txt file, and it looks something like this:数据集当前位于 a.txt 文件中，它看起来像这样：

name: hello_world
rating: 5
description: basic program

name: python
rating: 10
description: programming language

As you can see, the column names start each line, followed by the data.如您所见，列名开始于每一行，然后是数据。 Different rows of the dataframe are separated by an extra line. dataframe 的不同行由额外的行分隔。 Is there a simple way to read this type of file into pandas, or do I just have to do it manually?有没有一种简单的方法可以将这种类型的文件读入 pandas，还是我只需要手动完成？

Thanks!谢谢！

Edit: Thanks everyone for the help.编辑：感谢大家的帮助。 It seems that the answer is, yes, you have to do it manually.答案似乎是，是的，您必须手动完成。 I've posted the way I did it manually below, though I'm sure there are other, more efficient methods.我已经在下面发布了我手动执行的方式，但我确信还有其他更有效的方法。

Answer 1

data.txt:数据.txt：

name: hello_world
rating: 5
description: basic program

name: python
rating: 10
description: programming language

Code:代码：

import pandas as pd
with open('data.txt', 'rt') as fin:
    lst = [line[:-1] for line in fin if line[:-1]]
print(lst)

# Soln 1
d = dict()
d['name'] = [ele.split(':')[1] for ele in lst if ele.startswith('name:')]
d['rating'] = [ele.split(':')[1] for ele in lst if ele.startswith('rating:')]
d['description'] = [ele.split(':')[1] for ele in lst if ele.startswith('description:')]
df = pd.DataFrame(data=d)
print(df)

#OR ＃或者

data_tuples_lst = [(lst[i].split(':')[1], lst[i+1].split(':')[1], lst[i+2].split(':')[1]) for  i in range(0, len(lst), 3) ]
df1 = pd.DataFrame(data=data_tuples_lst, columns = ['name', 'rating', 'description'])
print(df1)

Output: Output：

['name: hello_world', 'rating: 5', 'description: basic program', 'name: python', 'rating: 10', 'description: programming language']
           name rating            description
0   hello_world      5          basic program
1        python     10   programming language
           name rating            description
0   hello_world      5          basic program
1        python     10   programming language

Answer 2

My take.我的看法。 Again as part of my learning pandas.再次作为我学习 pandas 的一部分。

import pandas as pd
from io import StringIO

data = '''\
name: hello_world
rating: 5
description: basic program

name: python
rating: 10
description: programming language

name: foo
rating: 20
description: bar
'''
buffer = StringIO()
buffer.write('field: value\n')  # add column headers
buffer.write(data)
buffer.seek(0)

df = pd.read_csv(buffer, delimiter=':')

transposed = df.T

_, col_count = transposed.shape

x = []
for i in range(0, col_count, 3):
    tmp = transposed[[i, i + 1, i + 2]]
    columns = tmp.iloc[0]
    tmp = tmp[1:]
    tmp.columns = columns
    x.append(tmp)

out = pd.concat(x)
print(out.to_string(index=False))

I'd really appreciate someone experienced with pandas to show a better way.我真的很感谢有人使用 pandas 来展示更好的方法。

Answer 3

Here is one way to approach the 'sideways' data set.这是处理“横向”数据集的一种方法。 This code has been edited for efficiency, over the previous answer.与先前的答案相比，此代码已被编辑以提高效率。

Sample code:示例代码：

import pandas as pd
from collections import defaultdict

# Read the text file into a list.
with open('prog.txt') as f:
    text = [i.strip() for i in f]

# Split the list into lists of key, value pairs.
d = [i.split(':') for i in text if i]
# Create a data container.
data = defaultdict(list)
# Store the data in a DataFrame-ready dict.
for k, v in d:
    data[k].append(v.strip())

# Load the DataFrame.
df = pd.DataFrame(data)

Output: Output：

          name rating           description
0  hello_world      5         basic program
1       python     10  programming language

Answer 4

I think you have to do it manually.我认为你必须手动完成。 If you check the I/O API from Pandas(https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html ) there is no way to define a custom reading procedure.如果您检查来自 Pandas 的 I/O API（https://pandas.pydata.org/pandas-docs/stable/user_guide/io.ZFC35FDC70D5FC69D269883A822C7A5没有定义自定义读取程序的方法。）

Answer 5

In case anyone comes here later, this is what I did.万一以后有人来这里，这就是我所做的。 I simply converted the input file to a csv (except I used '|' as the delimiter because the dataset contained strings).我只是将输入文件转换为 csv（除了我使用“|”作为分隔符，因为数据集包含字符串）。 Thanks everyone for their input, but I neglected to mention that it was a 2GB data file, so I didn't want to do anything to intensive for my poor overworked laptop.感谢大家的意见，但我忘了提到它是一个 2GB 的数据文件，所以我不想为我那可怜的过度劳累的笔记本电脑做任何密集的事情。

import pandas as pd


ofile = open("out_file.csv", 'w')
ifile = open("in_file.txt", 'r', encoding='cp1252')

for l in ifile:
  if l == '\n':
    ofile.write('\n')
  else:
    ofile.write(l.split(':')[1][:-1] + '|')

ofile.close()
ifile.close()

Then I opened the dataframe using:然后我打开 dataframe 使用：

import pandas as pd
df =pd.read_csv('out_file.csv', sep="|", skipinitialspace=True, index_col=False)

Answer 6

After having the list proposed by @aaj-kaal with this code:在使用此代码获得@aaj-kaal 提出的列表后：

import pandas as pd
with open('data.txt', 'rt') as fin:
    lst = [line[:-1] for line in fin if line[:-1]]

you can obtain directly the dataframe by:您可以通过以下方式直接获取 dataframe：

dict_df=pd.DataFrame()
dict_df['name'] = [ele.split(':')[1] for ele in lst if ele.startswith('name:')]
dict_df['rating'] = [ele.split(':')[1] for ele in lst if \
                    ele.startswith('rating:')]
dict_df['description'] = [ele.split(':')[1] for ele in lst\
                         if ele.startswith('description:')]
dict_df

output output

name    rating          description
0       hello_world 5   basic program
1       python  10      programming language

Answer 7

A generic proposal:一个通用的提议：

import pandas as pd
def from_txt_transposed_to_pandas(file):
    """
    take a txt file like this:

    "
    name: hello_world
    rating: 5
    description: basic program

    name: python
    rating: 10
    description: programming language 
    "

    -of any length- and returns a dataframe.
    """
    tabla = pd.read_table(file)
    cols = list(set([x.split(":")[0] for x in tabla.iloc[::,0]]))
    tabla_df= pd.DataFrame(columns = cols)
    elem = list(tabla[tabla.columns[0]])+[tabla.columns[0]]
    for n in range(len(cols)):  
        tabla_df[cols[n]]= [x.split(":")[1] for x in elem if\ 
        x.startswith(cols[n])]
    return tabla_df

如何从转置格式将 a.txt 文件读入 pandas DataFrame

问题描述

7 个解决方案

解决方案1
1 2020-11-27 17:09:05

解决方案2
1 2020-11-27 17:25:04

解决方案3
1 2020-11-27 19:46:27

解决方案4
0 已采纳 2020-11-27 16:33:49

解决方案5
0 2020-11-27 20:21:52

解决方案6
0 2022-07-25 13:02:52

解决方案7
0 2022-07-25 14:12:30

如何从转置格式将 a.txt 文件读入 pandas DataFrame

问题描述

7 个解决方案

解决方案1 1 2020-11-27 17:09:05

解决方案2 1 2020-11-27 17:25:04

解决方案3 1 2020-11-27 19:46:27

解决方案4 0 已采纳 2020-11-27 16:33:49

解决方案5 0 2020-11-27 20:21:52

解决方案6 0 2022-07-25 13:02:52

解决方案7 0 2022-07-25 14:12:30

解决方案1
1 2020-11-27 17:09:05

解决方案2
1 2020-11-27 17:25:04

解决方案3
1 2020-11-27 19:46:27

解决方案4
0 已采纳 2020-11-27 16:33:49

解决方案5
0 2020-11-27 20:21:52

解决方案6
0 2022-07-25 13:02:52

解决方案7
0 2022-07-25 14:12:30