根據行中的第一個值向數據框添加新列

Question

我有這樣的數據框：

>>> import pandas as pd

>>> pd.read_csv('csv/10_no_headers_with_com.csv')
                  //field  field2
0   //first field is time     NaN
1                 132605     1.0
2                 132750     2.0
3                 132772     3.0
4                 132773     4.0
5                 133065     5.0
6                 133150     6.0

我想添加另一個字段，說明第一個字段的第一個值是否是注釋字符// 。 到目前為止，我有這樣的事情：

# may not have a heading value, so use the index not the key
df[0].str.startswith('//')

使用此值添加新列的正確方法是什么，以便結果如下所示：

pd>>> pd.read_csv('csv/10_no_headers_with_com.csv', header=None)
                       0       1       _starts_with_comment
0                 //field  field2       True
1  //first field is time     NaN       True
2                 132605       1       False
3                 132750       2       False
4                 132772       3       False

Answer 1

一種方法是使用pd.to_numeric ，假設第一列中的非數字數據必須指示注釋：

df = pd.read_csv('csv/10_no_headers_with_com.csv', header=None)
df['_starts_with_comment'] = pd.to_numeric(df[0], errors='coerce').isnull()

請注意，強烈建議不要使用系列中的這種混合類型。 您的前兩個系列將不再支持矢量化操作，因為它們將存儲在object dtype系列中。 你失去了熊貓的一些主要好處。

更好的想法是使用csv模塊在文件頂部提取這些屬性並將它們存儲為單獨的變量。 這是一個如何實現這一目標的例子。

Answer 2

您的命令有什么問題，只需分配給新列？：

df['comment_flag'] = df[0].str.startswith('//')

或者你確實有jpp提到的混合型列？

編輯：
我不太確定，但是從你的評論中我得到的印象是你並不需要額外的評論標記列。 如果您想要將沒有注釋的數據加載到數據框中，但仍然使用在注釋標題中隱藏的字段名稱作為列名稱，您可能需要檢查一下：
所以基於這個文本文件：

//field  field2
//first field is time     NaN
132605     1.0
132750     2.0
132772     3.0
132773     4.0
133065     5.0
133150     6.0

你可以這樣做：

cmt = '//'

header = []
with open(textfilename, 'r') as f:
    for line in f:
        if line.startswith(cmt):
            header.append(line)
        else:                      # leave that out if collecting all comments of entire file is ok/wanted
            break
print(header)
# ['//field  field2\n', '//first field is time     NaN\n']

這樣，您就可以准備用於例如列名的標題信息。
從第一個標題行獲取名稱並將其用於pandas導入就像

nms = header[0][2:].split()
df = pd.read_csv(textfilename, comment=cmt, names=nms, sep='\s+ ', engine='python')

    field  field2                                           
0  132605     1.0                                         
1  132750     2.0                                       
2  132772     3.0                                      
3  132773     4.0                                       
4  133065     5.0                                       
5  133150     6.0

Answer 3

嘗試這個：

import pandas as pd
import numpy as np

df.loc[:,'_starts_with_comment'] = np.where(df[0].str.startswith(r'//'), True, False)

根據行中的第一個值向數據框添加新列

問題描述

3 個解決方案

解決方案1
1 2018-12-20 00:10:05

解決方案2
1 已采納 2018-12-20 00:42:32

解決方案3
1 2018-12-20 01:53:06

根據行中的第一個值向數據框添加新列

問題描述

3 個解決方案

解決方案1 1 2018-12-20 00:10:05

解決方案2 1 已采納 2018-12-20 00:42:32

解決方案3 1 2018-12-20 01:53:06

解決方案1
1 2018-12-20 00:10:05

解決方案2
1 已采納 2018-12-20 00:42:32

解決方案3
1 2018-12-20 01:53:06