繁体   English   中英

如何从转置格式将 a.txt 文件读入 pandas DataFrame

[英]How to read a .txt file into pandas DataFrame, from transposed format

我正在尝试将数据集读入 pandas dataframe。 数据集当前位于 a.txt 文件中,它看起来像这样:

name: hello_world
rating: 5
description: basic program

name: python
rating: 10
description: programming language

如您所见,列名开始于每一行,然后是数据。 dataframe 的不同行由额外的行分隔。 有没有一种简单的方法可以将这种类型的文件读入 pandas,还是我只需要手动完成?

谢谢!

编辑:感谢大家的帮助。 答案似乎是,是的,您必须手动完成。 我已经在下面发布了我手动执行的方式,但我确信还有其他更有效的方法。

数据.txt:

name: hello_world
rating: 5
description: basic program

name: python
rating: 10
description: programming language

代码:

import pandas as pd
with open('data.txt', 'rt') as fin:
    lst = [line[:-1] for line in fin if line[:-1]]
print(lst)

# Soln 1
d = dict()
d['name'] = [ele.split(':')[1] for ele in lst if ele.startswith('name:')]
d['rating'] = [ele.split(':')[1] for ele in lst if ele.startswith('rating:')]
d['description'] = [ele.split(':')[1] for ele in lst if ele.startswith('description:')]
df = pd.DataFrame(data=d)
print(df)

#或者

data_tuples_lst = [(lst[i].split(':')[1], lst[i+1].split(':')[1], lst[i+2].split(':')[1]) for  i in range(0, len(lst), 3) ]
df1 = pd.DataFrame(data=data_tuples_lst, columns = ['name', 'rating', 'description'])
print(df1)

Output:

['name: hello_world', 'rating: 5', 'description: basic program', 'name: python', 'rating: 10', 'description: programming language']
           name rating            description
0   hello_world      5          basic program
1        python     10   programming language
           name rating            description
0   hello_world      5          basic program
1        python     10   programming language

我的看法。 再次作为我学习 pandas 的一部分。

import pandas as pd
from io import StringIO

data = '''\
name: hello_world
rating: 5
description: basic program

name: python
rating: 10
description: programming language

name: foo
rating: 20
description: bar
'''
buffer = StringIO()
buffer.write('field: value\n')  # add column headers
buffer.write(data)
buffer.seek(0)

df = pd.read_csv(buffer, delimiter=':')

transposed = df.T

_, col_count = transposed.shape

x = []
for i in range(0, col_count, 3):
    tmp = transposed[[i, i + 1, i + 2]]
    columns = tmp.iloc[0]
    tmp = tmp[1:]
    tmp.columns = columns
    x.append(tmp)

out = pd.concat(x)
print(out.to_string(index=False))

我真的很感谢有人使用 pandas 来展示更好的方法。

这是处理“横向”数据集的一种方法。 与先前的答案相比,此代码已被编辑以提高效率。

示例代码:

import pandas as pd
from collections import defaultdict

# Read the text file into a list.
with open('prog.txt') as f:
    text = [i.strip() for i in f]

# Split the list into lists of key, value pairs.
d = [i.split(':') for i in text if i]
# Create a data container.
data = defaultdict(list)
# Store the data in a DataFrame-ready dict.
for k, v in d:
    data[k].append(v.strip())

# Load the DataFrame.
df = pd.DataFrame(data)

Output:

          name rating           description
0  hello_world      5         basic program
1       python     10  programming language

我认为你必须手动完成。 如果您检查来自 Pandas 的 I/O API(https://pandas.pydata.org/pandas-docs/stable/user_guide/io.ZFC35FDC70D5FC69D269883A822C7A5没有定义自定义读取程序的方法。)

万一以后有人来这里,这就是我所做的。 我只是将输入文件转换为 csv(除了我使用“|”作为分隔符,因为数据集包含字符串)。 感谢大家的意见,但我忘了提到它是一个 2GB 的数据文件,所以我不想为我那可怜的过度劳累的笔记本电脑做任何密集的事情。

import pandas as pd


ofile = open("out_file.csv", 'w')
ifile = open("in_file.txt", 'r', encoding='cp1252')

for l in ifile:
  if l == '\n':
    ofile.write('\n')
  else:
    ofile.write(l.split(':')[1][:-1] + '|')

ofile.close()
ifile.close()

然后我打开 dataframe 使用:

import pandas as pd
df =pd.read_csv('out_file.csv', sep="|", skipinitialspace=True, index_col=False)

在使用此代码获得@aaj-kaal 提出的列表后:

import pandas as pd
with open('data.txt', 'rt') as fin:
    lst = [line[:-1] for line in fin if line[:-1]]

您可以通过以下方式直接获取 dataframe:

dict_df=pd.DataFrame()
dict_df['name'] = [ele.split(':')[1] for ele in lst if ele.startswith('name:')]
dict_df['rating'] = [ele.split(':')[1] for ele in lst if \
                    ele.startswith('rating:')]
dict_df['description'] = [ele.split(':')[1] for ele in lst\
                         if ele.startswith('description:')]
dict_df

output

name    rating          description
0       hello_world 5   basic program
1       python  10      programming language

一个通用的提议:

import pandas as pd
def from_txt_transposed_to_pandas(file):
    """
    take a txt file like this:

    "
    name: hello_world
    rating: 5
    description: basic program

    name: python
    rating: 10
    description: programming language 
    "

    -of any length- and returns a dataframe.
    """
    tabla = pd.read_table(file)
    cols = list(set([x.split(":")[0] for x in tabla.iloc[::,0]]))
    tabla_df= pd.DataFrame(columns = cols)
    elem = list(tabla[tabla.columns[0]])+[tabla.columns[0]]
    for n in range(len(cols)):  
        tabla_df[cols[n]]= [x.split(":")[1] for x in elem if\ 
        x.startswith(cols[n])]
    return tabla_df

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM