[英]Create pandas df from raw text file
我有一個文本文件,我想將其格式化為 pandas dataframe。 它被讀取為以下形式的字符串:
print(text)=
product: 1
description: product 1 desc
rating: 7.8
review: product 1 review
product: 2
description: product 2 desc
rating: 4.5
review: product 2 review
product: 3
description: product 3 desc
rating: 8.5
review: product 3 review
我想我會通過使用text.split('\n\n')
將它們分組到列表中來拆分它們。 我會假設將每個迭代到一個字典中,然后加載到 pandas df 將是一個很好的路線,但我在這樣做時遇到了麻煩。 這是最好的路線嗎,有人可以幫我把它變成 pandas df 嗎?
您可以通過按product
字符串和pivot
比較第一列來將read_csv
與創建組一起使用:
df = pd.read_csv('file.txt', header=None, sep=': ', engine='python')
df = df.assign(g = df[0].eq('product').cumsum()).pivot('g',0,1)
print (df)
0 description product rating review
g
1 product 1 desc 1 7.8 product 1 review
2 product 2 desc 2 4.5 product 2 review
3 product 3 desc 3 8.5 product 3 review
或創建字典列表:
#https://stackoverflow.com/a/18970794/2901002
data = []
current = {}
with open('file.txt') as f:
for line in f:
pair = line.split(':', 1)
if len(pair) == 2:
if pair[0] == 'product' and current:
# start of a new block
data.append(current)
current = {}
current[pair[0]] = pair[1].strip()
if current:
data.append(current)
df = pd.DataFrame(data)
print (df)
product description rating review
0 1 product 1 desc 7.8 product 1 review
1 2 product 2 desc 4.5 product 2 review
2 3 product 3 desc 8.5 product 3 review
或者將每 4 個值重塑為 2d numpy 數組並傳遞給DataFrame
構造函數:
df = pd.read_csv('file.txt', header=None, sep=': ', engine='python')
df = pd.DataFrame(df[1].to_numpy().reshape(-1, 4), columns=df[0].iloc[:4].tolist())
print (df)
product description rating review
0 1 product 1 desc 7.8 product 1 review
1 2 product 2 desc 4.5 product 2 review
2 3 product 3 desc 8.5 product 3 review
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.