Dataframe和read_csv函数 - Python

Question

I'm using the pandas library to make a simple program. 我正在使用pandas库来制作一个简单的程序。

First of all I have a .csv file called small.csv, which contains the following structure. 首先，我有一个名为small.csv的.csv文件，它包含以下结构。

1,4.0,?,?,none,?
2,2.0,3.0,?,none,38
2,2.5,2.5,?,tc,39

On my main function I have the following code: 在我的主要功能上，我有以下代码：

def main():
    # my code here
    fname = "/home/sergio/PycharmProjects/practica2/small.csv"
    sep = ","
    vars = ["x1", "x2", "x3", "x4", "x5", "x6"]
    na_values = ["?", "none"]
    prefix = "col_"

    df = da.load_data(fname, delimiter=sep, nan=na_values,
                      header=False, pref=prefix)
    print df

The explanation of the main function is the following, depending on the parameters that I pass to the load_data function, you will have to load the data from my .csv file in one way or another. 主函数的解释如下，根据我传递给load_data函数的参数，您必须以某种方式从我的.csv文件加载数据。

These are the possible arguments and the function that they develop: 这些是可能的参数和它们开发的功能：

inputFile: The name of the csv file that contains the data. inputFile：包含数据的csv文件的名称。
delimiter: The character that delimits the data. delimiter：分隔数据的字符。 By default the function must use the comma character (","). 默认情况下，该函数必须使用逗号字符（“，”）。
nan: A list of Strings that will be treated as missing values. nan：将被视为缺失值的字符串列表。 Any occurrence in the input file of one of the strings in this list will be interpreted as NaN . 此列表中某个字符串的输入文件中出现的任何内容都将被解释为NaN。 The default value will be None. 默认值为None。
header: A Boolean flag that will indicate if the file contains a header ( True ) or if not ( False ). header：一个布尔标志，指示文件是否包含标头（True）或不包含（False）。 By default it must be True . 默认情况下，它必须为True。
varNames: A list of * Strings * that will be used as variable names only in the case where header is valid False . varNames：*字符串*的列表，仅在标题有效的情况下用作变量名。 The default value will be None . 默认值为None。
pref: A string that will be used as a prefix for the names of the variables only in the case where header is valid False and the list has not been defined * varNames . pref：仅在标头有效的情况下才用作变量名称前缀的字符串，并且列表尚未定义* varNames。 For example, if pref = "x", the names of the variables will be "x0", "x1", "x2", etc. The default value will be "var_". 例如，如果pref =“x”，则变量的名称将为“x0”，“x1”，“x2”等。默认值为“var_”。

My load_data function: 我的load_data函数：

def load_data(inputFile, delimiter=",", nan=None, header=True,
              varNames=None, pref="var_"):

    data = DataFrame()

    if header == False:
        if not varNames:
            print "header=false and varNames not defined"
            data = pd.read_csv(inputFile, sep=delimiter, na_values=nan,  prefix=pref, header=None)
            listaNum = list(range(len(data.columns)))
            data.columns = listaNum
        else: # varNames defined
            data = pd.read_csv(inputFile, sep=delimiter, na_values=nan,  prefix=pref)
    else:
        return data

This function is responsible for displaying the data based on the parameters we have entered, varying the output depending on the case 此功能负责根据我们输入的参数显示数据，根据具体情况改变输出

One of the cases that I have to evaluate is the following. 我必须评估的一个案例如下。

if header = False and the variable varsNames, which indicates the name of the column is not passed to that function (Null), I have to assign a numerical value from 0 to the number of columns that have, that is, 0 1 2 ... up to max columns. 如果header = False并且变量varsNames（表示列的名称未传递给该函数（Null）），我必须将数值从0分配给具有的列数，即0 1 2。 ..最多列数。

Also in this case I would have to add the prefix that we passed to that number that defines the column, in this case it would be "col_". 同样在这种情况下，我必须添加我们传递给定义列的那个数字的前缀，在这种情况下它将是“col_”。

The result woulb be the following one: 结果如下：

  col_0 col_1   col_2   col_3   col_4   col_5
0   1   4.0      NaN    NaN      NaN    NaN
1   2   2.0      3.0    NaN      NaN    38.0
2   2   2.5      2.5    NaN       tc    39.0

Here is my problem, in the case I have commented that we have to add a prefix to each of the numeric columns, with the variable prefix, I could do it by hand, that is, to each of the elements of my column list, add the string "col_". 这是我的问题，在我评论过的情况下，我们必须为每个数字列添加一个前缀，使用变量前缀，我可以手动完成，也就是说，我的列列表中的每个元素，添加字符串“col_”。

However I think it is wrong, since I do not use the "prefix" option that can be passed through the read_csv function, I have tried it nevertheless and it does not do it correctly. 但是我认为这是错误的，因为我没有使用可以通过read_csv函数传递的“前缀”选项，但我已经尝试了它并且它没有正确地执行它。

This is my result, and as you can see although I pass the prefix argument to read_csv function, it ignores it. 这是我的结果，正如您所看到的，虽然我将前缀参数传递给read_csv函数，但它忽略了它。

   0    1    2   3    4     5
0  1  4.0  NaN NaN  NaN   NaN
1  2  2.0  3.0 NaN  NaN  38.0
2  2  2.5  2.5 NaN   tc  39.0

In addition another one of the doubts, is that since I am calculating the numerical value that I have to assign to the columns, I do it modifying the dataframe that already has been generated and I believe that it is not the most optimal form to realize it. 另外一个疑问是，因为我正在计算我必须分配给列的数值，所以我会修改已经生成的数据帧，并且我认为它不是实现的最佳形式它。

Answer 1

This works well for me on v0.21 . 这对我来说非常适合v0.21 。

import io

text = \
'''1,4.0,?,?,none,?
2,2.0,3.0,?,none,38
2,2.5,2.5,?,tc,39'''

buf = io.StringIO(text)  

df = pd.read_csv(buf, na_values=['?', 'none'], header=None, prefix='col_') 
df

col_0  col_1  col_2  col_3 col_4  col_5
0      1    4.0    NaN    NaN   NaN    NaN
1      2    2.0    3.0    NaN   NaN   38.0
2      2    2.5    2.5    NaN    tc   39.0

Another trick (if this still doesn't work) would be to use add_prefix : 另一个技巧（如果这仍然不起作用）将是使用add_prefix ：

df

   0    1    2   3    4     5
0  1  4.0  NaN NaN  NaN   NaN
1  2  2.0  3.0 NaN  NaN  38.0
2  2  2.5  2.5 NaN   tc  39.0

df = df.add_prefix('col_')    
df

   col_0  col_1  col_2  col_3 col_4  col_5
0      1    4.0    NaN    NaN   NaN    NaN
1      2    2.0    3.0    NaN   NaN   38.0
2      2    2.5    2.5    NaN    tc   39.0

Dataframe和read_csv函数 - Python

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-11-16 00:43:04

Dataframe和read_csv函数 - Python

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-11-16 00:43:04

解决方案1
2 已采纳 2017-11-16 00:43:04