简体   繁体   English

如何从转置格式将 a.txt 文件读入 pandas DataFrame

[英]How to read a .txt file into pandas DataFrame, from transposed format

I'm trying to read a dataset into a pandas dataframe.我正在尝试将数据集读入 pandas dataframe。 The dataset is currently in a.txt file, and it looks something like this:数据集当前位于 a.txt 文件中,它看起来像这样:

name: hello_world
rating: 5
description: basic program

name: python
rating: 10
description: programming language

As you can see, the column names start each line, followed by the data.如您所见,列名开始于每一行,然后是数据。 Different rows of the dataframe are separated by an extra line. dataframe 的不同行由额外的行分隔。 Is there a simple way to read this type of file into pandas, or do I just have to do it manually?有没有一种简单的方法可以将这种类型的文件读入 pandas,还是我只需要手动完成?

Thanks!谢谢!

Edit: Thanks everyone for the help.编辑:感谢大家的帮助。 It seems that the answer is, yes, you have to do it manually.答案似乎是,是的,您必须手动完成。 I've posted the way I did it manually below, though I'm sure there are other, more efficient methods.我已经在下面发布了我手动执行的方式,但我确信还有其他更有效的方法。

data.txt:数据.txt:

name: hello_world
rating: 5
description: basic program

name: python
rating: 10
description: programming language

Code:代码:

import pandas as pd
with open('data.txt', 'rt') as fin:
    lst = [line[:-1] for line in fin if line[:-1]]
print(lst)

# Soln 1
d = dict()
d['name'] = [ele.split(':')[1] for ele in lst if ele.startswith('name:')]
d['rating'] = [ele.split(':')[1] for ele in lst if ele.startswith('rating:')]
d['description'] = [ele.split(':')[1] for ele in lst if ele.startswith('description:')]
df = pd.DataFrame(data=d)
print(df)

#OR #或者

data_tuples_lst = [(lst[i].split(':')[1], lst[i+1].split(':')[1], lst[i+2].split(':')[1]) for  i in range(0, len(lst), 3) ]
df1 = pd.DataFrame(data=data_tuples_lst, columns = ['name', 'rating', 'description'])
print(df1)

Output: Output:

['name: hello_world', 'rating: 5', 'description: basic program', 'name: python', 'rating: 10', 'description: programming language']
           name rating            description
0   hello_world      5          basic program
1        python     10   programming language
           name rating            description
0   hello_world      5          basic program
1        python     10   programming language

My take.我的看法。 Again as part of my learning pandas.再次作为我学习 pandas 的一部分。

import pandas as pd
from io import StringIO

data = '''\
name: hello_world
rating: 5
description: basic program

name: python
rating: 10
description: programming language

name: foo
rating: 20
description: bar
'''
buffer = StringIO()
buffer.write('field: value\n')  # add column headers
buffer.write(data)
buffer.seek(0)

df = pd.read_csv(buffer, delimiter=':')

transposed = df.T

_, col_count = transposed.shape

x = []
for i in range(0, col_count, 3):
    tmp = transposed[[i, i + 1, i + 2]]
    columns = tmp.iloc[0]
    tmp = tmp[1:]
    tmp.columns = columns
    x.append(tmp)

out = pd.concat(x)
print(out.to_string(index=False))

I'd really appreciate someone experienced with pandas to show a better way.我真的很感谢有人使用 pandas 来展示更好的方法。

Here is one way to approach the 'sideways' data set.这是处理“横向”数据集的一种方法。 This code has been edited for efficiency, over the previous answer.与先前的答案相比,此代码已被编辑以提高效率。

Sample code:示例代码:

import pandas as pd
from collections import defaultdict

# Read the text file into a list.
with open('prog.txt') as f:
    text = [i.strip() for i in f]

# Split the list into lists of key, value pairs.
d = [i.split(':') for i in text if i]
# Create a data container.
data = defaultdict(list)
# Store the data in a DataFrame-ready dict.
for k, v in d:
    data[k].append(v.strip())

# Load the DataFrame.
df = pd.DataFrame(data)

Output: Output:

          name rating           description
0  hello_world      5         basic program
1       python     10  programming language

I think you have to do it manually.我认为你必须手动完成。 If you check the I/O API from Pandas(https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html ) there is no way to define a custom reading procedure.如果您检查来自 Pandas 的 I/O API(https://pandas.pydata.org/pandas-docs/stable/user_guide/io.ZFC35FDC70D5FC69D269883A822C7A5没有定义自定义读取程序的方法。)

In case anyone comes here later, this is what I did.万一以后有人来这里,这就是我所做的。 I simply converted the input file to a csv (except I used '|' as the delimiter because the dataset contained strings).我只是将输入文件转换为 csv(除了我使用“|”作为分隔符,因为数据集包含字符串)。 Thanks everyone for their input, but I neglected to mention that it was a 2GB data file, so I didn't want to do anything to intensive for my poor overworked laptop.感谢大家的意见,但我忘了提到它是一个 2GB 的数据文件,所以我不想为我那可怜的过度劳累的笔记本电脑做任何密集的事情。

import pandas as pd


ofile = open("out_file.csv", 'w')
ifile = open("in_file.txt", 'r', encoding='cp1252')

for l in ifile:
  if l == '\n':
    ofile.write('\n')
  else:
    ofile.write(l.split(':')[1][:-1] + '|')

ofile.close()
ifile.close()

Then I opened the dataframe using:然后我打开 dataframe 使用:

import pandas as pd
df =pd.read_csv('out_file.csv', sep="|", skipinitialspace=True, index_col=False)

After having the list proposed by @aaj-kaal with this code:在使用此代码获得@aaj-kaal 提出的列表后:

import pandas as pd
with open('data.txt', 'rt') as fin:
    lst = [line[:-1] for line in fin if line[:-1]]

you can obtain directly the dataframe by:您可以通过以下方式直接获取 dataframe:

dict_df=pd.DataFrame()
dict_df['name'] = [ele.split(':')[1] for ele in lst if ele.startswith('name:')]
dict_df['rating'] = [ele.split(':')[1] for ele in lst if \
                    ele.startswith('rating:')]
dict_df['description'] = [ele.split(':')[1] for ele in lst\
                         if ele.startswith('description:')]
dict_df

output output

name    rating          description
0       hello_world 5   basic program
1       python  10      programming language

A generic proposal:一个通用的提议:

import pandas as pd
def from_txt_transposed_to_pandas(file):
    """
    take a txt file like this:

    "
    name: hello_world
    rating: 5
    description: basic program

    name: python
    rating: 10
    description: programming language 
    "

    -of any length- and returns a dataframe.
    """
    tabla = pd.read_table(file)
    cols = list(set([x.split(":")[0] for x in tabla.iloc[::,0]]))
    tabla_df= pd.DataFrame(columns = cols)
    elem = list(tabla[tabla.columns[0]])+[tabla.columns[0]]
    for n in range(len(cols)):  
        tabla_df[cols[n]]= [x.split(":")[1] for x in elem if\ 
        x.startswith(cols[n])]
    return tabla_df

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM