简体   繁体   English

根据行中的第一个值向数据框添加新列

[英]Add a new column to a dataframe based on first value in row

I have a dataframe like such: 我有这样的数据框:

>>> import pandas as pd

>>> pd.read_csv('csv/10_no_headers_with_com.csv')
                  //field  field2
0   //first field is time     NaN
1                 132605     1.0
2                 132750     2.0
3                 132772     3.0
4                 132773     4.0
5                 133065     5.0
6                 133150     6.0

I would like to add another field that says whether the first value of the first field is a comment character, // . 我想添加另一个字段,说明第一个字段的第一个值是否是注释字符// So far I have something like this: 到目前为止,我有这样的事情:

# may not have a heading value, so use the index not the key
df[0].str.startswith('//')  

What would be the correct way to add on a new column with this value, so that the result is something like: 使用此值添加新列的正确方法是什么,以便结果如下所示:

pd>>> pd.read_csv('csv/10_no_headers_with_com.csv', header=None)
                       0       1       _starts_with_comment
0                 //field  field2       True
1  //first field is time     NaN       True
2                 132605       1       False
3                 132750       2       False
4                 132772       3       False

One way is to utilise pd.to_numeric , assuming non-numeric data in the first column must indicate a comment: 一种方法是使用pd.to_numeric ,假设第一列中的非数字数据必须指示注释:

df = pd.read_csv('csv/10_no_headers_with_com.csv', header=None)
df['_starts_with_comment'] = pd.to_numeric(df[0], errors='coerce').isnull()

Just note this kind of mixing types within series is strongly discouraged. 请注意,强烈建议不要使用系列中的这种混合类型。 Your first two series will no longer support vectorised operations as they will be stored in object dtype series. 您的前两个系列将不再支持矢量化操作,因为它们将存储在object dtype系列中。 You lose some of the main benefits of Pandas. 你失去了熊猫的一些主要好处。

A much better idea is to use the csv module to extract those attributes at the top of your file and store them as separate variables. 更好的想法是使用csv模块在文件顶部提取这些属性并将它们存储为单独的变量。 Here's an example of how you can achieve this. 这是一个如何实现这一目标的例子。

What is the issue with your command, simply assigned to a new column?: 您的命令有什么问题,只需分配给新列?:

df['comment_flag'] = df[0].str.startswith('//')

Or do you indeed have mixed type columns as mentioned by jpp? 或者你确实有jpp提到的混合型列?


EDIT: 编辑:
I'm not quite sure, but from your comments I get the impression you don't really need an additional column of comment flags. 我不太确定,但是从你的评论中我得到的印象是你并不需要额外的评论标记列。 Just in case you want to load the data without comments into a dataframe but still use field names somewhat hidden in the commented header as column names, you might want to check this out: 如果您想要将没有注释的数据加载到数据框中,但仍然使用在注释标题中隐藏的字段名称作为列名称,您可能需要检查一下:
So based on this textfile: 所以基于这个文本文件:

//field  field2
//first field is time     NaN
132605     1.0
132750     2.0
132772     3.0
132773     4.0
133065     5.0
133150     6.0

You could do: 你可以这样做:

cmt = '//'

header = []
with open(textfilename, 'r') as f:
    for line in f:
        if line.startswith(cmt):
            header.append(line)
        else:                      # leave that out if collecting all comments of entire file is ok/wanted
            break
print(header)
# ['//field  field2\n', '//first field is time     NaN\n']  

This way you have the header information prepared for being used for eg column names. 这样,您就可以准备用于例如列名的标题信息。
Getting the names from the first header line and using it for pandas import would be like 从第一个标题行获取名称并将其用于pandas导入就像

nms = header[0][2:].split()
df = pd.read_csv(textfilename, comment=cmt, names=nms, sep='\s+ ', engine='python')

    field  field2                                           
0  132605     1.0                                         
1  132750     2.0                                       
2  132772     3.0                                      
3  132773     4.0                                       
4  133065     5.0                                       
5  133150     6.0                                       

Try this: 尝试这个:

import pandas as pd
import numpy as np

df.loc[:,'_starts_with_comment'] = np.where(df[0].str.startswith(r'//'), True, False)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pandas Dataframe基于前一行,将值添加到新列,但该列的最大值限于该列 - Pandas Dataframe Add a value to a new Column based on the previous row limited to the maximum value in that column 向 dataframe 添加一个新列,其中每一行根据它来自的 dataframe 的标题采用不同的值 - Add a new column to a dataframe in which each row adopts a different value based on the title of the dataframe it came from 根据列的第一个值递增地添加时间到 dataframe 中的列 - Incrementally add time to column in dataframe based on first value of the column 如何根据另一个 dataframe 的匹配为 dataframe 的新列添加值? - how to add value to a new column to a dataframe based on the match of another dataframe? 根据该组中列的第一行值更改分组 dataframe 中的值 - Changing values in grouped dataframe based on first row value of the column in that group 如何基于熊猫数据框中的行条件添加新列? - How to add new column based on row condition in pandas dataframe? Pandas DataFrame:添加具有基于前一行计算值的新列 - Pandas DataFrame: Add new column with calculated values based on previous row Python Dataframe 根据列名添加新行 - Python Dataframe add new row based on column name 根据 dataframe 中的其他行值添加新列 - add a new column based on other row values in dataframe 如何根据第一列的条件在 pandas 中添加新行? - How to add a new row in pandas based on a condition from the first column?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM