简体   繁体   English

从原始文本文件创建 pandas df

[英]Create pandas df from raw text file

I have a text file that I would like to be formatted into a pandas dataframe.我有一个文本文件,我想将其格式化为 pandas dataframe。 It is read as a string in the form of:它被读取为以下形式的字符串:
print(text)=

product: 1
description: product 1 desc
rating: 7.8
review: product 1 review

product: 2
description: product 2 desc
rating: 4.5
review: product 2 review

product: 3
description: product 3 desc
rating: 8.5
review: product 3 review

I figured I would split them by using text.split('\n\n') to group them into lists.我想我会通过使用text.split('\n\n')将它们分组到列表中来拆分它们。 I would assume iterating each into a dict, then loading to a pandas df would be a good route, but I am having trouble doing so.我会假设将每个迭代到一个字典中,然后加载到 pandas df 将是一个很好的路线,但我在这样做时遇到了麻烦。 Is this the best route, and could someone please help me get this into a pandas df?这是最好的路线吗,有人可以帮我把它变成 pandas df 吗?

You can use read_csv with create groups by compare first column by product string and pivot :您可以通过按product字符串和pivot比较第一列来将read_csv与创建组一起使用:

df = pd.read_csv('file.txt', header=None, sep=': ', engine='python')
df = df.assign(g = df[0].eq('product').cumsum()).pivot('g',0,1)
print (df)
0      description product rating             review
g                                                   
1   product 1 desc       1    7.8   product 1 review
2   product 2 desc       2    4.5   product 2 review
3   product 3 desc       3    8.5   product 3 review

Or create list of dictionaries:或创建字典列表:

#https://stackoverflow.com/a/18970794/2901002
data = []
current = {}
with open('file.txt') as f:
    for line in f:
        pair = line.split(':', 1)
        if len(pair) == 2:
            if pair[0] == 'product' and current:
                # start of a new block
                data.append(current)
                current = {}
            current[pair[0]] = pair[1].strip()
    if current:
        data.append(current)
        
df = pd.DataFrame(data)
print (df)
  product     description rating            review
0       1  product 1 desc    7.8  product 1 review
1       2  product 2 desc    4.5  product 2 review
2       3  product 3 desc    8.5  product 3 review

Or reshape each 4 values to 2d numpy array and pass to DataFrame constructor:或者将每 4 个值重塑为 2d numpy 数组并传递给DataFrame构造函数:

df = pd.read_csv('file.txt', header=None, sep=': ', engine='python')

df = pd.DataFrame(df[1].to_numpy().reshape(-1, 4), columns=df[0].iloc[:4].tolist())
print (df)
  product     description rating            review
0       1  product 1 desc    7.8  product 1 review
1       2  product 2 desc    4.5  product 2 review
2       3  product 3 desc    8.5  product 3 review

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM