Python Pandas - 讀取帶有注釋標題行的 csv

Question

我想用熊貓讀取和處理一個 csv 文件。 該文件（如下所示）包含多個標題行，由#標記指示。 我可以通過使用輕松導入該文件

import pandas as pd

file = "data.csv"
data = pd.read_csv(file, delimiter="\s+",
                   names=["Time", "Cd", "Cs", "Cl", "CmRoll", "CmPitch", "CmYaw", "Cd(f)",
                           "Cd(r)", "Cs(f)", "Cs(r)", "Cl(f)", "Cl(r)"],
                   skiprows=13)

但是，我有很多具有不同標頭名稱的此類文件，我不想手動命名它們（ Time Cd Cs... ）。 每個文件之間的注釋行數也不同。 所以我想自動執行該任務。

在將數據傳遞到熊貓數據框之前，我是否必須在這里使用正則表達式之類的東西？

感謝您的任何建議。

是的，標頭名稱也以#開頭。

數據.csv：

# Force coefficients    
# dragDir               : (9.9735673312816520e-01 7.2660490528994301e-02 0.0000000000000000e+00)
# sideDir               : (0.0000000000000000e+00 0.0000000000000000e+00 -1.0000000000000002e+00)
# liftDir               : (-7.2660490528994315e-02 9.9735673312816520e-01 0.0000000000000000e+00)
# rollAxis              : (9.9735673312816520e-01 7.2660490528994301e-02 0.0000000000000000e+00)
# pitchAxis             : (0.0000000000000000e+00 0.0000000000000000e+00 -1.0000000000000002e+00)
# yawAxis               : (-7.2660490528994315e-02 9.9735673312816520e-01 0.0000000000000000e+00)
# magUInf               : 4.5000000000000000e+01
# lRef                  : 5.9399999999999997e-01
# Aref                  : 3.5639999999999999e-03
# CofR                  : (1.4999999999999999e-01 0.0000000000000000e+00 0.0000000000000000e+00)
#
# Time                      Cd                          Cs                          Cl                          CmRoll                      CmPitch                     CmYaw                       Cd(f)                       Cd(r)                       Cs(f)                       Cs(r)                       Cl(f)                       Cl(r)                   
5e-06                       1.8990180226147195e+00  1.4919925634649792e-11  2.1950119509976829e+00  -1.1085971520784955e-02 -1.0863798447281650e+00 9.5910040927874810e-03  9.3842303978657482e-01  9.6059498282814471e-01  9.5910041002474442e-03  -9.5910040853275178e-03 1.1126130770676479e-02  2.1838858202270064e+00
1e-05                       2.1428508927716594e+00  1.0045114197556737e-08  2.5051633252700962e+00  -1.2652317494411272e-02 -1.2367567798452046e+00 1.0822379290263353e-02  1.0587731288914184e+00  1.0840777638802410e+00  1.0822384312820453e-02  -1.0822374267706254e-02 1.5824882789843508e-02  2.4893384424802525e+00
...

Answer 1

在讀取文件之前提取標題怎么樣？ 我們只假設您的標題行以#開頭。 標題的提取及其在文件中的位置是自動的。 我們還確保不會讀取超過必要的行（第一條數據行除外）。

with open(file) as f:
    line = f.readline()
    cnt = 0
    while line.startswith('#'):
        prev_line = line
        line = f.readline()
        cnt += 1
        # print(prev_line)

header = prev_line.strip().lstrip('# ').split()

df = pd.read_csv(file, delimiter="\s+",
                   names=header,
                   skiprows=cnt
           )

有了這個，您還可以處理其他標題行。 它還為您提供文件中標題的位置。

Answer 2

這應該可以，它既簡單又高效，它將變量保持在最低限度，並且除了文件名之外不需要任何輸入。

with open(file, 'r') as f:
    for line in f:
        if line.startswith('#'):
            header = line
        else:
            break #stop when there are no more #

header = header[1:].strip().split()

data = pd.read_csv(file, delimiter="\s+", comment='#', names=header)

您首先打開文件並僅讀取注釋行（它會快速且節省內存）。 最后一個有效行將是最終標題，它將被清理並轉換為列表。 最后，您使用帶有comment='#' pandas.read_csv()打開文件，這將跳過注釋行和names=header 。

Answer 3

一點正則表達式可能會有所幫助。

這不是最漂亮的解決方案，所以請隨時發布更好的解決方案。

讓我們讀取任何文件的前 50 行，以找到應該是列名的哈希的最后一次出現。

^ 在行首斷言位置
#匹配字符 # 字面意思（區分大小寫）

import re
n_rows = 50

path_ = 'your_file_location'

with open(path_,'r') as f:
    data = []
    for i in range(n_rows): # read only 50 rows here. 
        for line in f:
            if re.match('^#',line):
                data.append(line)

start_col = max(enumerate(data))[0]


df = pd.read_csv(path_,sep='\s+',skiprows=start_col) # use your actual delimiter.

          #      Time            Cd        Cs        Cl    CmRoll   CmPitch  \
0  0.000005  1.899018  1.491993e-11  2.195012 -0.011086 -1.086380  0.009591   
1  0.000010  2.142851  1.004511e-08  2.505163 -0.012652 -1.236757  0.010822   

      CmYaw     Cd(f)     Cd(r)     Cs(f)     Cs(r)     Cl(f)  Cl(r)  
0  0.938423  0.960595  0.009591 -0.009591  0.011126  2.183886    NaN  
1  1.058773  1.084078  0.010822 -0.010822  0.015825  2.489338    NaN

編輯，處理列名中的`#` 。

我們可以分兩步完成。

我們可以讀取 0 行，但對標題列進行切片。

首先從標題行讀入文件，但將header參數設置為None因此不會設置標題。

然后我們可以手動設置列標題。

df = pd.read_csv(path_,sep='\s+',skiprows=start_col + 1, header=None)
df.columns = pd.read_csv(path_,sep='\s+',skiprows=start_col,nrows=0).columns[1:]

print(df)

       Time        Cd            Cs        Cl    CmRoll   CmPitch     CmYaw  \
0  0.000005  1.899018  1.491993e-11  2.195012 -0.011086 -1.086380  0.009591   
1  0.000010  2.142851  1.004511e-08  2.505163 -0.012652 -1.236757  0.010822   

      Cd(f)     Cd(r)     Cs(f)     Cs(r)     Cl(f)     Cl(r)  
0  0.938423  0.960595  0.009591 -0.009591  0.011126  2.183886  
1  1.058773  1.084078  0.010822 -0.010822  0.015825  2.489338

Answer 4

為了簡化它，並在不使用循環的情況下節省時間，您可以為#注釋行創建 2 個數據幀，其余行。 從那些注釋行中取最后一行 - 這是您的標題，然后使用concat()合並數據數據框和此標題，如果需要將第一行指定為標題，您可以使用df.columns=df.iloc[0]

df = pd.DataFrame({
    'A':['#test1 : (000000)','#test1 (000000)','#test1 (000000)','#test1 (000000)','#Time (000000)','5e-06','1e-05'],
})
print(df)
   

                A
0  #test1 : (000000)
1    #test1 (000000)
2    #test1 (000000)
3    #test1 (000000)
4     #Time (000000)
5              5e-06
6              1e-05

df_header = df[df.A.str.contains('^#')]
print(df_header)
         

          A
0  #test1 : (000000)
1    #test1 (000000)
2    #test1 (000000)
3    #test1 (000000)
4     #Time (000000)
df_data = df[~df.A.str.contains('^#')]
print(df_data)
       A
5  5e-06
6  1e-05

df = (pd.concat([df_header.iloc[[-1]],df_data])).reset_index(drop=True)
df.A=df.A.str.replace(r'^#',"")



print(df)
          

     A
0  Time (000000)
1          5e-06
2          1e-05

Answer 5

假設注釋總是以單個 '#' 開頭並且標題位於最后一個注釋行中：

import csv

def read_comments(csv_file):
    for row in csv_file:
        if row[0] == '#':
            yield row.split('#')[1].strip()

def get_last_commented_line(filename):
    with open(filename, 'r', newline='') as f:
        decommented_lines = [line for line in csv.reader(read_comments(f))]
        header = decommented_lines[-1]
        skiprows = len(decommented_lines)
        return header, skiprows

header, skiprows = get_last_commented_line(path)
pd.read_csv(path, names=header, skiprows=skiprows)

Answer 6

# Read the lines in file
with open(file) as f:
    lines = f.readlines()

# Last commented line is header
header = [line for line in lines if line.startswith('#')][-1]

# Strip line and remove '#' 
header = header[1:].strip().split()

df = pd.read_csv(file, delimiter="\s+", names=header, comment='#')

Python Pandas - 讀取帶有注釋標題行的 csv

問題描述

6 個解決方案

解決方案1
2 已采納 2020-08-26 11:34:52

解決方案2
2 2020-08-26 12:03:23

解決方案3
1 2020-08-26 11:32:36

^ 在行首斷言位置

`#`匹配字符 # 字面意思（區分大小寫）

編輯，處理列名中的`#` 。

解決方案4
0 2020-08-26 11:50:27

解決方案5
0 2020-08-26 11:54:06

解決方案6
0 2021-09-24 08:15:21

Python Pandas - 讀取帶有注釋標題行的 csv

問題描述

6 個解決方案

解決方案1 2 已采納 2020-08-26 11:34:52

解決方案2 2 2020-08-26 12:03:23

解決方案3 1 2020-08-26 11:32:36

^ 在行首斷言位置

#匹配字符 # 字面意思（區分大小寫）

編輯，處理列名中的# 。

解決方案4 0 2020-08-26 11:50:27

解決方案5 0 2020-08-26 11:54:06

解決方案6 0 2021-09-24 08:15:21

解決方案1
2 已采納 2020-08-26 11:34:52

解決方案2
2 2020-08-26 12:03:23

解決方案3
1 2020-08-26 11:32:36

`#`匹配字符 # 字面意思（區分大小寫）

編輯，處理列名中的`#` 。

解決方案4
0 2020-08-26 11:50:27

解決方案5
0 2020-08-26 11:54:06

解決方案6
0 2021-09-24 08:15:21