[英]How to read a .txt file into pandas DataFrame, from transposed format
我正在尝试将数据集读入 pandas dataframe。 数据集当前位于 a.txt 文件中,它看起来像这样:
name: hello_world
rating: 5
description: basic program
name: python
rating: 10
description: programming language
如您所见,列名开始于每一行,然后是数据。 dataframe 的不同行由额外的行分隔。 有没有一种简单的方法可以将这种类型的文件读入 pandas,还是我只需要手动完成?
谢谢!
编辑:感谢大家的帮助。 答案似乎是,是的,您必须手动完成。 我已经在下面发布了我手动执行的方式,但我确信还有其他更有效的方法。
数据.txt:
name: hello_world
rating: 5
description: basic program
name: python
rating: 10
description: programming language
代码:
import pandas as pd
with open('data.txt', 'rt') as fin:
lst = [line[:-1] for line in fin if line[:-1]]
print(lst)
# Soln 1
d = dict()
d['name'] = [ele.split(':')[1] for ele in lst if ele.startswith('name:')]
d['rating'] = [ele.split(':')[1] for ele in lst if ele.startswith('rating:')]
d['description'] = [ele.split(':')[1] for ele in lst if ele.startswith('description:')]
df = pd.DataFrame(data=d)
print(df)
#或者
data_tuples_lst = [(lst[i].split(':')[1], lst[i+1].split(':')[1], lst[i+2].split(':')[1]) for i in range(0, len(lst), 3) ]
df1 = pd.DataFrame(data=data_tuples_lst, columns = ['name', 'rating', 'description'])
print(df1)
Output:
['name: hello_world', 'rating: 5', 'description: basic program', 'name: python', 'rating: 10', 'description: programming language']
name rating description
0 hello_world 5 basic program
1 python 10 programming language
name rating description
0 hello_world 5 basic program
1 python 10 programming language
我的看法。 再次作为我学习 pandas 的一部分。
import pandas as pd
from io import StringIO
data = '''\
name: hello_world
rating: 5
description: basic program
name: python
rating: 10
description: programming language
name: foo
rating: 20
description: bar
'''
buffer = StringIO()
buffer.write('field: value\n') # add column headers
buffer.write(data)
buffer.seek(0)
df = pd.read_csv(buffer, delimiter=':')
transposed = df.T
_, col_count = transposed.shape
x = []
for i in range(0, col_count, 3):
tmp = transposed[[i, i + 1, i + 2]]
columns = tmp.iloc[0]
tmp = tmp[1:]
tmp.columns = columns
x.append(tmp)
out = pd.concat(x)
print(out.to_string(index=False))
我真的很感谢有人使用 pandas 来展示更好的方法。
这是处理“横向”数据集的一种方法。 与先前的答案相比,此代码已被编辑以提高效率。
示例代码:
import pandas as pd
from collections import defaultdict
# Read the text file into a list.
with open('prog.txt') as f:
text = [i.strip() for i in f]
# Split the list into lists of key, value pairs.
d = [i.split(':') for i in text if i]
# Create a data container.
data = defaultdict(list)
# Store the data in a DataFrame-ready dict.
for k, v in d:
data[k].append(v.strip())
# Load the DataFrame.
df = pd.DataFrame(data)
Output:
name rating description
0 hello_world 5 basic program
1 python 10 programming language
我认为你必须手动完成。 如果您检查来自 Pandas 的 I/O API(https://pandas.pydata.org/pandas-docs/stable/user_guide/io.ZFC35FDC70D5FC69D269883A822C7A5没有定义自定义读取程序的方法。)
万一以后有人来这里,这就是我所做的。 我只是将输入文件转换为 csv(除了我使用“|”作为分隔符,因为数据集包含字符串)。 感谢大家的意见,但我忘了提到它是一个 2GB 的数据文件,所以我不想为我那可怜的过度劳累的笔记本电脑做任何密集的事情。
import pandas as pd
ofile = open("out_file.csv", 'w')
ifile = open("in_file.txt", 'r', encoding='cp1252')
for l in ifile:
if l == '\n':
ofile.write('\n')
else:
ofile.write(l.split(':')[1][:-1] + '|')
ofile.close()
ifile.close()
然后我打开 dataframe 使用:
import pandas as pd
df =pd.read_csv('out_file.csv', sep="|", skipinitialspace=True, index_col=False)
在使用此代码获得@aaj-kaal 提出的列表后:
import pandas as pd
with open('data.txt', 'rt') as fin:
lst = [line[:-1] for line in fin if line[:-1]]
您可以通过以下方式直接获取 dataframe:
dict_df=pd.DataFrame()
dict_df['name'] = [ele.split(':')[1] for ele in lst if ele.startswith('name:')]
dict_df['rating'] = [ele.split(':')[1] for ele in lst if \
ele.startswith('rating:')]
dict_df['description'] = [ele.split(':')[1] for ele in lst\
if ele.startswith('description:')]
dict_df
output
name rating description
0 hello_world 5 basic program
1 python 10 programming language
一个通用的提议:
import pandas as pd
def from_txt_transposed_to_pandas(file):
"""
take a txt file like this:
"
name: hello_world
rating: 5
description: basic program
name: python
rating: 10
description: programming language
"
-of any length- and returns a dataframe.
"""
tabla = pd.read_table(file)
cols = list(set([x.split(":")[0] for x in tabla.iloc[::,0]]))
tabla_df= pd.DataFrame(columns = cols)
elem = list(tabla[tabla.columns[0]])+[tabla.columns[0]]
for n in range(len(cols)):
tabla_df[cols[n]]= [x.split(":")[1] for x in elem if\
x.startswith(cols[n])]
return tabla_df
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.