简体   繁体   中英

Dataframe and read_csv function - Python

I'm using the pandas library to make a simple program.

First of all I have a .csv file called small.csv, which contains the following structure.

1,4.0,?,?,none,?
2,2.0,3.0,?,none,38
2,2.5,2.5,?,tc,39

On my main function I have the following code:

def main():
    # my code here
    fname = "/home/sergio/PycharmProjects/practica2/small.csv"
    sep = ","
    vars = ["x1", "x2", "x3", "x4", "x5", "x6"]
    na_values = ["?", "none"]
    prefix = "col_"

    df = da.load_data(fname, delimiter=sep, nan=na_values,
                      header=False, pref=prefix)
    print df

The explanation of the main function is the following, depending on the parameters that I pass to the load_data function, you will have to load the data from my .csv file in one way or another.

These are the possible arguments and the function that they develop:

  • inputFile: The name of the csv file that contains the data.
  • delimiter: The character that delimits the data. By default the function must use the comma character (",").
  • nan: A list of Strings that will be treated as missing values. Any occurrence in the input file of one of the strings in this list will be interpreted as NaN . The default value will be None.
  • header: A Boolean flag that will indicate if the file contains a header ( True ) or if not ( False ). By default it must be True .
  • varNames: A list of * Strings * that will be used as variable names only in the case where header is valid False . The default value will be None .
  • pref: A string that will be used as a prefix for the names of the variables only in the case where header is valid False and the list has not been defined * varNames . For example, if pref = "x", the names of the variables will be "x0", "x1", "x2", etc. The default value will be "var_".

My load_data function:

def load_data(inputFile, delimiter=",", nan=None, header=True,
              varNames=None, pref="var_"):

    data = DataFrame()

    if header == False:
        if not varNames:
            print "header=false and varNames not defined"
            data = pd.read_csv(inputFile, sep=delimiter, na_values=nan,  prefix=pref, header=None)
            listaNum = list(range(len(data.columns)))
            data.columns = listaNum
        else: # varNames defined
            data = pd.read_csv(inputFile, sep=delimiter, na_values=nan,  prefix=pref)
    else:
        return data

This function is responsible for displaying the data based on the parameters we have entered, varying the output depending on the case

One of the cases that I have to evaluate is the following.

if header = False and the variable varsNames, which indicates the name of the column is not passed to that function (Null), I have to assign a numerical value from 0 to the number of columns that have, that is, 0 1 2 ... up to max columns.

Also in this case I would have to add the prefix that we passed to that number that defines the column, in this case it would be "col_".

The result woulb be the following one:

  col_0 col_1   col_2   col_3   col_4   col_5
0   1   4.0      NaN    NaN      NaN    NaN
1   2   2.0      3.0    NaN      NaN    38.0
2   2   2.5      2.5    NaN       tc    39.0

Here is my problem, in the case I have commented that we have to add a prefix to each of the numeric columns, with the variable prefix, I could do it by hand, that is, to each of the elements of my column list, add the string "col_".

However I think it is wrong, since I do not use the "prefix" option that can be passed through the read_csv function, I have tried it nevertheless and it does not do it correctly.

This is my result, and as you can see although I pass the prefix argument to read_csv function, it ignores it.

   0    1    2   3    4     5
0  1  4.0  NaN NaN  NaN   NaN
1  2  2.0  3.0 NaN  NaN  38.0
2  2  2.5  2.5 NaN   tc  39.0

In addition another one of the doubts, is that since I am calculating the numerical value that I have to assign to the columns, I do it modifying the dataframe that already has been generated and I believe that it is not the most optimal form to realize it.

This works well for me on v0.21 .

import io

text = \
'''1,4.0,?,?,none,?
2,2.0,3.0,?,none,38
2,2.5,2.5,?,tc,39'''

buf = io.StringIO(text)  

df = pd.read_csv(buf, na_values=['?', 'none'], header=None, prefix='col_') 
df

col_0  col_1  col_2  col_3 col_4  col_5
0      1    4.0    NaN    NaN   NaN    NaN
1      2    2.0    3.0    NaN   NaN   38.0
2      2    2.5    2.5    NaN    tc   39.0

Another trick (if this still doesn't work) would be to use add_prefix :

df

   0    1    2   3    4     5
0  1  4.0  NaN NaN  NaN   NaN
1  2  2.0  3.0 NaN  NaN  38.0
2  2  2.5  2.5 NaN   tc  39.0

df = df.add_prefix('col_')    
df

   col_0  col_1  col_2  col_3 col_4  col_5
0      1    4.0    NaN    NaN   NaN    NaN
1      2    2.0    3.0    NaN   NaN   38.0
2      2    2.5    2.5    NaN    tc   39.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM