简体   繁体   中英

Instead of missing values in the csv file, write the average of the values in that column(in python)

The thing I want to ask is that I have a csv file which consists of categorical and numeric values. There are some missing values in this csv file. I want to calculate the average value in each column in this file and write this average I calculated instead of the missing values in the column. I loaded the necessary libraries and the files with pd.read_csv also. Namely;

ABCD

 1,2,1,  

  ,1,,  

 2,1,1,2  

I want to write 1 in row 2 of cloumn A in a csv file like above.I will apply this to the other columns in the same way so the csv table I want to get is like this:

    A B C D  

    1,2,1,0.66  

    1,1,0.66,0.66  

    2,1,1,2  

For example, there is one missing value in column A. I want to write the average I calculated for column A instead of this missing value in column A. (so I want to write 1 to second row in column A due to (2 + 1) / 3=1)). I would like to apply this in other columns as well.I want to apply these operations to other columns in the same way.I tried to write the code to do this in the above code. So I tried to write this code:

    rows=list()
    column=list(myfile.columns.values)
    average = 0
    Sum = 0
    row_count = 1
    for row in myfile:
       for row in column:
           n = column
           Sum += n
           row_count += 1
    average = Sum / len(column)
    print('The average is:', average)  

The code is not working correctly. How can I implement this code or is the code completely wrong?

Your example is unclear due to bad formatting. No worries, I also have problems with formatting. Are you sure that you are using pandas?

Dummy dataframe.

df = pd.DataFrame(np.random.randn(50,4), columns=['A', 'B', 'C', 'D'])
df.iloc[2:4,0] = np.nan
df.iloc[3:5,1] = np.nan
df.iloc[4:6,2] = np.nan
df.iloc[5:7,3] = np.nan
df.head(10).round(2)

Results with

    A   B   C   D
0   -0.09   1.77    1.14    1.00
1   -1.24   -2.21   -0.21   -0.36
2   NaN -0.59   -0.77   -0.74
3   NaN NaN 0.37    -1.07
4   -0.19   NaN NaN 1.39
5   0.20    1.08    NaN NaN
6   -0.15   0.64    0.04    NaN
7   0.92    -1.01   1.81    -0.83
8   -0.79   0.13    -0.24   1.96
9   0.11    0.97    -0.97   -1.32

You load your dataframe with

df = pd.read_csv('path/to/your/file.csv')

Additionaly, there's no NaN in your df, so you may want to replace empty cells with NaN .

from numpy import nan
df.replace('', nan)

Or replace any string in these columns

df.loc[:,'A':'D'].replace(r'\s+', nan, regex=True)

Filling nans with column-wise mean:

df = df.apply(lambda x: x.fillna(x.mean()), axis=0)

Filling nans with row-wise mean:

df = df.apply(lambda x: x.fillna(x.mean()), axis=1)

Is that what you were looking for?

Edit after OP's edit:

import pandas as pd
df = pd.DataFrame({
    'A': [1, '', 2],
    'B': [2, 1, 1],
    'C': [1, '', 1],
    'D': ['', '', 2]
})

def isnumber(x):
    try:
        float(x)
        return True
    except:
        return False

df = df[df.applymap(isnumber)]
df = df.apply(lambda x: x.fillna(x.mean()), axis=0)
df

is all you need.

Output

    A   B   C   D
0   1.0 2   1.0 2.0
1   1.5 1   1.0 2.0
2   2.0 1   1.0 2.0

And I think it's the right answer. The mean of the column A with NaN s is (2 + 1) / 2 = 1.5 because you don't have the third value yet, so you can't count it in.

You don't even need Pandas for such a simple task, the built in csv module is more than enough:

import csv

# on Python 3.x use: open("input.csv", "r")  
with open("input.csv", "rb") as f_in:  # open input.csv for reading
    r = csv.reader(f_in)  # create a CSV reader
    header = next(r)  # store the header to recreate in the output
    columns_num = len(header)  # max number of columns
    # read in rows and fill potentially missing elements with 0 to ensure a perfect 2D list
    rows = []  # a storage for our rows
    for row in r:  # go through each CSV row
        columns = []  # a storage for our columns
        for index in range(columns_num):  # loop through each column index
            try:
                columns.append(int(row[index]))  # convert to integer and store in `columns`
            except (IndexError, ValueError, TypeError):  # invalid column value
                columns.append(0)  # store 0 to `columns` as an 'empty' value
        rows.append(columns)  # store the processed columns to the `rows`  storage

total_rows = float(len(rows))  # a number to take into the account for average
rows = zip(*rows)  # flip the CSV columns and rows, on Python 3.x use: list(zip(*rows))
for i, row in enumerate(rows):
    average_real = sum(row) / total_rows  # calculate the real average
    average = int(average_real)  # integer average, use as an average for non-floats
    if average_real - average != 0:  # the average is not an integer
        average = int(average_real * 100) / 100.0  # shorten the float to 2 decimals
    rows[i] = [column or average for column in row]  # apply to empty fields and update

# on Python 3.x use: with open("output.csv", "w", newline='')
with open("output.csv", "wb") as f_out:  # open output.csv for writing
    writer = csv.writer(f_out)
    writer.writerow(header)  # write the header to output CSV
    writer.writerows(zip(*rows))  # flip back rows and colums and write them to output CSV

For an input.csv file with contents as:

A,B,C,D
1,2,1,
,1,,
2,1,1,2

It will produce output.csv as:

A,B,C,D
1,2,1,0.66
1,1,0.66,0.66
2,1,1,2

(NOTE: I've fixed the CSV headers to make it a valid CSV, but it will work even without them provided a perfect 2D list, ie every row having the same number of columns)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM