简体   繁体   English

而不是在csv文件中丢失值,而是在该列中写入平均值(在python中)

[英]Instead of missing values in the csv file, write the average of the values in that column(in python)

The thing I want to ask is that I have a csv file which consists of categorical and numeric values. 我想问的是,我有一个包含分类和数字值的csv文件。 There are some missing values in this csv file. 此csv文件中缺少一些值。 I want to calculate the average value in each column in this file and write this average I calculated instead of the missing values in the column. 我想计算此文件中每列的平均值,然后写出我计算出的平均值,而不是该列中的缺失值。 I loaded the necessary libraries and the files with pd.read_csv also. 我还使用pd.read_csv加载了必要的库和文件。 Namely; 即;

ABCD A B C D

 1,2,1,  

  ,1,,  

 2,1,1,2  

I want to write 1 in row 2 of cloumn A in a csv file like above.I will apply this to the other columns in the same way so the csv table I want to get is like this: 我想在上面的csv文件中的Cloumn A的第2行中写入1,我将以相同的方式将其应用于其他列,因此我想要的csv表如下所示:

    A B C D  

    1,2,1,0.66  

    1,1,0.66,0.66  

    2,1,1,2  

For example, there is one missing value in column A. I want to write the average I calculated for column A instead of this missing value in column A. (so I want to write 1 to second row in column A due to (2 + 1) / 3=1)). 例如,A列中有一个缺失值。我想写我为A列计算的平均值,而不是A列中的此缺失值。(因此,由于(2 + 1)/ 3 = 1))。 I would like to apply this in other columns as well.I want to apply these operations to other columns in the same way.I tried to write the code to do this in the above code. 我也想在其他列中应用此操作。我想以相同的方式将这些操作应用到其他列中。我试图编写代码以在上面的代码中执行此操作。 So I tried to write this code: 所以我试图写这段代码:

    rows=list()
    column=list(myfile.columns.values)
    average = 0
    Sum = 0
    row_count = 1
    for row in myfile:
       for row in column:
           n = column
           Sum += n
           row_count += 1
    average = Sum / len(column)
    print('The average is:', average)  

The code is not working correctly. 该代码无法正常工作。 How can I implement this code or is the code completely wrong? 如何实现此代码,或者代码完全错误?

Your example is unclear due to bad formatting. 由于格式错误,您的示例不清楚。 No worries, I also have problems with formatting. 不用担心,我在格式化方面也遇到了问题。 Are you sure that you are using pandas? 确定要使用熊猫吗?

Dummy dataframe. 虚拟数据框。

df = pd.DataFrame(np.random.randn(50,4), columns=['A', 'B', 'C', 'D'])
df.iloc[2:4,0] = np.nan
df.iloc[3:5,1] = np.nan
df.iloc[4:6,2] = np.nan
df.iloc[5:7,3] = np.nan
df.head(10).round(2)

Results with 结果与

    A   B   C   D
0   -0.09   1.77    1.14    1.00
1   -1.24   -2.21   -0.21   -0.36
2   NaN -0.59   -0.77   -0.74
3   NaN NaN 0.37    -1.07
4   -0.19   NaN NaN 1.39
5   0.20    1.08    NaN NaN
6   -0.15   0.64    0.04    NaN
7   0.92    -1.01   1.81    -0.83
8   -0.79   0.13    -0.24   1.96
9   0.11    0.97    -0.97   -1.32

You load your dataframe with 您使用以下方式加载数据框

df = pd.read_csv('path/to/your/file.csv')

Additionaly, there's no NaN in your df, so you may want to replace empty cells with NaN . 另外,您的df中没有NaN ,因此您可能要用NaN替换空单元格。

from numpy import nan
df.replace('', nan)

Or replace any string in these columns 或替换这些列中的任何字符串

df.loc[:,'A':'D'].replace(r'\s+', nan, regex=True)

Filling nans with column-wise mean: 用按列均值填充nans:

df = df.apply(lambda x: x.fillna(x.mean()), axis=0)

Filling nans with row-wise mean: 用行均值填充nans:

df = df.apply(lambda x: x.fillna(x.mean()), axis=1)

Is that what you were looking for? 那是您要找的东西吗?

Edit after OP's edit: 在OP编辑后进行编辑:

import pandas as pd
df = pd.DataFrame({
    'A': [1, '', 2],
    'B': [2, 1, 1],
    'C': [1, '', 1],
    'D': ['', '', 2]
})

def isnumber(x):
    try:
        float(x)
        return True
    except:
        return False

df = df[df.applymap(isnumber)]
df = df.apply(lambda x: x.fillna(x.mean()), axis=0)
df

is all you need. 是你所需要的全部。

Output 产量

    A   B   C   D
0   1.0 2   1.0 2.0
1   1.5 1   1.0 2.0
2   2.0 1   1.0 2.0

And I think it's the right answer. 我认为这是正确的答案。 The mean of the column A with NaN s is (2 + 1) / 2 = 1.5 because you don't have the third value yet, so you can't count it in. 具有NaN s的A列的平均值为(2 + 1) / 2 = 1.5因为您还没有第三个值,因此无法将其计算在内。

You don't even need Pandas for such a simple task, the built in csv module is more than enough: 您甚至不需要Pandas即可完成如此简单的任务,内置的csv模块已绰绰有余:

import csv

# on Python 3.x use: open("input.csv", "r")  
with open("input.csv", "rb") as f_in:  # open input.csv for reading
    r = csv.reader(f_in)  # create a CSV reader
    header = next(r)  # store the header to recreate in the output
    columns_num = len(header)  # max number of columns
    # read in rows and fill potentially missing elements with 0 to ensure a perfect 2D list
    rows = []  # a storage for our rows
    for row in r:  # go through each CSV row
        columns = []  # a storage for our columns
        for index in range(columns_num):  # loop through each column index
            try:
                columns.append(int(row[index]))  # convert to integer and store in `columns`
            except (IndexError, ValueError, TypeError):  # invalid column value
                columns.append(0)  # store 0 to `columns` as an 'empty' value
        rows.append(columns)  # store the processed columns to the `rows`  storage

total_rows = float(len(rows))  # a number to take into the account for average
rows = zip(*rows)  # flip the CSV columns and rows, on Python 3.x use: list(zip(*rows))
for i, row in enumerate(rows):
    average_real = sum(row) / total_rows  # calculate the real average
    average = int(average_real)  # integer average, use as an average for non-floats
    if average_real - average != 0:  # the average is not an integer
        average = int(average_real * 100) / 100.0  # shorten the float to 2 decimals
    rows[i] = [column or average for column in row]  # apply to empty fields and update

# on Python 3.x use: with open("output.csv", "w", newline='')
with open("output.csv", "wb") as f_out:  # open output.csv for writing
    writer = csv.writer(f_out)
    writer.writerow(header)  # write the header to output CSV
    writer.writerows(zip(*rows))  # flip back rows and colums and write them to output CSV

For an input.csv file with contents as: 对于具有input.csv内容的input.csv文件:

A,B,C,D
1,2,1,
,1,,
2,1,1,2

It will produce output.csv as: 它将产生output.csv为:

A,B,C,D
1,2,1,0.66
1,1,0.66,0.66
2,1,1,2

(NOTE: I've fixed the CSV headers to make it a valid CSV, but it will work even without them provided a perfect 2D list, ie every row having the same number of columns) (注意:我已经修复了CSV标头,使其成为有效的CSV,但即使它们没有提供完美的2D列表(即,每行具有相同的列数),它也可以工作)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM