簡體   English   中英

python 正則表達式讀取文本文件並將行拆分為列

[英]python regex to read text file and split row to column

我的文本文件格式如下

ID  col_A   col_B   col_C
1   0.26    0.11    0.18
2   0.27    0.12    0.17
3   0.21    0.10    0.15
----------------------------
AVG 0.25    0.11    0.17
----------------------------
ID  col_D   col_E   col_F
1   0.23    0.18    0.20
2   0.24    0.14    0.17
3   0.23    0.10    0.13
----------------------------
AVG 0.23    0.14    0.17
----------------------------

我正在嘗試使用 python 和正則表達式導出兩個單獨的 csv 文件,格式如下

表格1

ID 可樂 col_B col_C 寒冷的 油菜 col_F
1 0.26 0.11 0.18 0.23 0.18 0.20
2 0.27 0.12 0.17 0.24 0.14 0.17
3 0.21 0.10 0.15 0.23 0.10 0.13

表 2

可樂 col_B col_C 寒冷的 油菜 col_F
平均 0.25 0.11 0.17 0.23 0.14 0.17

這是我的代碼:

import re
import pandas as pd


with open('test.txt') as file:
        lines = file.readlines()
        regex = r'\A(?P<ID>\S+)\s*(?P<COL_A>\S+)\s*(?P<COL_B>\S+)\s*(?P<COL_C>\S+)'
        data = []
        
        for line in lines:
            m = re.search(regex, line)
            if m != None:
                data.append([m.group(1),m.group(2),m.group(3),m.group(4)])
        
        df = pd.DataFrame(data)
        
df.to_csv('test.csv', index = False)

我的代碼會導致一種奇怪的格式,比如

0 1 2 3
ID 可樂 col_B col_C
1 0.26 0.11 0.18
2 0.27 0.12 0.17
3 0.21 0.10 0.15
------ --------- --------- ---------
平均 0.25 0.11 0.17
------ --------- --------- ---------
ID 寒冷的 油菜 col_F
1 0.23 0.18 0.20
2 0.24 0.14 0.17
3 0.23 0.10 0.13
------ --------- --------- ---------
平均 0.23 0.14 0.17
------ --------- --------- ---------

如何修改我的代碼以實現我的請求? 謝謝!

您可以使用

import re, itertools
import pandas as pd

data = []
colnames = []
avg = []
start = True
is_next_avg = False
id_start_label = ''

with open('test.txt', 'r') as file:
    for line in file:
        line = line.strip()
        if start and line.replace('-', '').strip():
            id_start_label = line.split()[0]
            colnames.extend(line.split()[1:])
            start = False
        elif '---' in line and not line.replace('-', '').strip():
            if start:
                continue
            is_next_avg = not is_next_avg
        elif is_next_avg:
            avg.extend(line.split()[1:])
            is_next_avg = not is_next_avg
            start = not start
        else:
            data.append(line.split())

colnames.insert(0, id_start_label)
data = sorted(data, key=lambda x: x[0])
new_data = []
for key, g in itertools.groupby(data, key=lambda x: x[0]):
    entry = [key]
    entry.extend( [i for x in g for i in x[1:]] )
    new_data.append(entry)

df = pd.DataFrame(new_data, columns=colnames)
df2 = pd.DataFrame(avg, columns=['AVG'])
df2 = df2.transpose()
df2.columns = df.columns[1:]

Output:

>>> df
  ID col_A col_B col_C col_D col_E col_F
0  1  0.26  0.11  0.18  0.23  0.18  0.20
1  2  0.27  0.12  0.17  0.24  0.14  0.17
2  3  0.21  0.10  0.15  0.23  0.10  0.13

>>> df2
    col_A col_B col_C col_D col_E col_F
AVG  0.25  0.11  0.17  0.23  0.14  0.17

像這樣的東西應該適合你:

import re
import pandas as pd
import numpy as np

data = []
headers = []
averages = []
ids = []
with open('/content/text.txt') as file:
  for l in file.readlines():
      s = re.findall(r'\S+', l.rstrip().split(',')[0])
      if '-' not in s[0]:
        if 'ID' in s[0]: headers.append(s)
        elif 'AVG' in s[0]: averages.append(s)
        else: 
          data.append(s[1:])
          ids.append(s[0])

data = np.hstack([data[:len(data)//len(headers)], data[len(data)//len(headers):]])
data = np.concatenate([np.expand_dims(np.transpose(sorted(set(ids))), axis=-1), data], axis=1)
columns = sorted(set(np.concatenate(headers)))
df = pd.DataFrame(data, columns=columns).astype({'ID': 'int64'})
main_df = df.set_index('ID')

表 1

   col_A col_B col_C col_D col_E col_F
ID                                    
1   0.26  0.11  0.18  0.23  0.18  0.20
2   0.27  0.12  0.17  0.24  0.14  0.17
3   0.21  0.10  0.15  0.23  0.10  0.13

表 2

average_df = pd.DataFrame(np.expand_dims(list(map(float, [d for d in np.concatenate(averages) if 'AVG' not in d])), axis=0), columns=columns[1:])
average_df = average_df.rename_axis('AVG')
     col_A  col_B  col_C  col_D  col_E  col_F
AVG                                          
0     0.25   0.11   0.17   0.23   0.14   0.17

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM