從.txt填充熊貓數據框，從多條.txt行讀取單個數據框行信息

Question

我想從大的.txt文件中讀取熊貓數據框信息，該信息以以下形式排列：

    elm1 x1 x2 x3 
    cont x4 x5 x6
    cont x7 x8
    elm2 x9 x10 x11
    cont x12 x13 x14
    cont x15 x16 
....

數據幀應按以下方式排列：

elm_ID col1 col2 col3 col4 col5 col6 col7 col8
elm_1 x1 x2 x3 x4 x5 x6 x7 x8
elm_2 x9 x10 x11 x12 x13 x14 x15 x16
.......

有人有主意嗎？ 非常感謝。

JA

Answer 1

是的，您可以輕松地將數據轉換為數據框。 首先，我們通過逐行從文本文件中讀取數據來創建我們需要轉換為數據框的數據列表：

import re

df_list = [] #as you want these as your headers 
with open(infile) as f:
    for line in f:
        # remove whitespace at the start and the newline at the end
        line = line.strip()
        # split each column on whitespace
        columns = re.split('\s+', line, maxsplit=4)
        df_list.append(columns)

然后我們可以簡單地使用以下命令將該列表轉換為數據框

import pandas as pd
df = pd.DataFrame(df_list,columns=[elm_ID col1 col2 col3 col4 col5 col6 col7 col8])

Answer 2

首先，通過pd.read_csv(path_to_file, sep='\\t')讀取txt文件。

然后，假設我們有這個數據框：

      a    b    c
0  elm1   x1   x2
1  cont   x4   x5
2  cont   x7   x8
3  elm2   x9  x10
4  cont  x12  x13
5  cont  x15  x16

我們想要以下輸出：

       0    1    2    3    4    5                      
elm1  x1   x4   x7   x2   x5   x8
elm2  x9  x12  x15  x10  x13  x16

我試圖使用pandas函數完全解決它：

df = pd.DataFrame([("elm1", "x1", "x2" ),
    ("cont", "x4", "x5"),
    ("cont", "x7", "x8"),
    ("elm2", "x9", "x10"),
    ("cont", "x12", "x13"),
    ("cont", "x15", "x16")] , columns=list('abc'))
df['d'] = df['a'] != 'cont'
df['e'] = df['a']
df['e'][~df['d']] = np.nan
df['e'] = df['e'].fillna(method='ffill')
df2 = df.groupby('e').apply(lambda x: pd.concat([x['b'], x['c']])).to_frame().reset_index()
df2['ct'] = df2.reset_index().groupby('e').cumcount()
df3 = df2.pivot(index='e', values=[0], columns='ct')
df3.columns = range(len(df3.columns))
df3.index.name = ''

從.txt填充熊貓數據框，從多條.txt行讀取單個數據框行信息

問題描述

2 個解決方案

解決方案1
0 2019-03-19 12:48:15

解決方案2
-1 2019-03-19 13:19:10

從.txt填充熊貓數據框，從多條.txt行讀取單個數據框行信息

問題描述

2 個解決方案

解決方案1 0 2019-03-19 12:48:15

解決方案2 -1 2019-03-19 13:19:10

解決方案1
0 2019-03-19 12:48:15

解決方案2
-1 2019-03-19 13:19:10