[英]Instead of missing values in the csv file, write the average of the values in that column(in python)
The thing I want to ask is that I have a csv file which consists of categorical and numeric values. 我想问的是,我有一个包含分类和数字值的csv文件。 There are some missing values in this csv file.
此csv文件中缺少一些值。 I want to calculate the average value in each column in this file and write this average I calculated instead of the missing values in the column.
我想计算此文件中每列的平均值,然后写出我计算出的平均值,而不是该列中的缺失值。 I loaded the necessary libraries and the files with pd.read_csv also.
我还使用pd.read_csv加载了必要的库和文件。 Namely;
即;
ABCD A B C D
1,2,1,
,1,,
2,1,1,2
I want to write 1 in row 2 of cloumn A in a csv file like above.I will apply this to the other columns in the same way so the csv table I want to get is like this: 我想在上面的csv文件中的Cloumn A的第2行中写入1,我将以相同的方式将其应用于其他列,因此我想要的csv表如下所示:
A B C D
1,2,1,0.66
1,1,0.66,0.66
2,1,1,2
For example, there is one missing value in column A. I want to write the average I calculated for column A instead of this missing value in column A. (so I want to write 1 to second row in column A due to (2 + 1) / 3=1)). 例如,A列中有一个缺失值。我想写我为A列计算的平均值,而不是A列中的此缺失值。(因此,由于(2 + 1)/ 3 = 1))。 I would like to apply this in other columns as well.I want to apply these operations to other columns in the same way.I tried to write the code to do this in the above code.
我也想在其他列中应用此操作。我想以相同的方式将这些操作应用到其他列中。我试图编写代码以在上面的代码中执行此操作。 So I tried to write this code:
所以我试图写这段代码:
rows=list()
column=list(myfile.columns.values)
average = 0
Sum = 0
row_count = 1
for row in myfile:
for row in column:
n = column
Sum += n
row_count += 1
average = Sum / len(column)
print('The average is:', average)
The code is not working correctly. 该代码无法正常工作。 How can I implement this code or is the code completely wrong?
如何实现此代码,或者代码完全错误?
Your example is unclear due to bad formatting. 由于格式错误,您的示例不清楚。 No worries, I also have problems with formatting.
不用担心,我在格式化方面也遇到了问题。 Are you sure that you are using pandas?
确定要使用熊猫吗?
Dummy dataframe. 虚拟数据框。
df = pd.DataFrame(np.random.randn(50,4), columns=['A', 'B', 'C', 'D'])
df.iloc[2:4,0] = np.nan
df.iloc[3:5,1] = np.nan
df.iloc[4:6,2] = np.nan
df.iloc[5:7,3] = np.nan
df.head(10).round(2)
Results with 结果与
A B C D
0 -0.09 1.77 1.14 1.00
1 -1.24 -2.21 -0.21 -0.36
2 NaN -0.59 -0.77 -0.74
3 NaN NaN 0.37 -1.07
4 -0.19 NaN NaN 1.39
5 0.20 1.08 NaN NaN
6 -0.15 0.64 0.04 NaN
7 0.92 -1.01 1.81 -0.83
8 -0.79 0.13 -0.24 1.96
9 0.11 0.97 -0.97 -1.32
You load your dataframe with 您使用以下方式加载数据框
df = pd.read_csv('path/to/your/file.csv')
Additionaly, there's no NaN
in your df, so you may want to replace empty cells with NaN
. 另外,您的df中没有
NaN
,因此您可能要用NaN
替换空单元格。
from numpy import nan
df.replace('', nan)
Or replace any string in these columns 或替换这些列中的任何字符串
df.loc[:,'A':'D'].replace(r'\s+', nan, regex=True)
Filling nans with column-wise mean: 用按列均值填充nans:
df = df.apply(lambda x: x.fillna(x.mean()), axis=0)
Filling nans with row-wise mean: 用行均值填充nans:
df = df.apply(lambda x: x.fillna(x.mean()), axis=1)
Is that what you were looking for? 那是您要找的东西吗?
Edit after OP's edit: 在OP编辑后进行编辑:
import pandas as pd
df = pd.DataFrame({
'A': [1, '', 2],
'B': [2, 1, 1],
'C': [1, '', 1],
'D': ['', '', 2]
})
def isnumber(x):
try:
float(x)
return True
except:
return False
df = df[df.applymap(isnumber)]
df = df.apply(lambda x: x.fillna(x.mean()), axis=0)
df
is all you need. 是你所需要的全部。
Output 产量
A B C D
0 1.0 2 1.0 2.0
1 1.5 1 1.0 2.0
2 2.0 1 1.0 2.0
And I think it's the right answer. 我认为这是正确的答案。 The mean of the column A with
NaN
s is (2 + 1) / 2 = 1.5
because you don't have the third value yet, so you can't count it in. 具有
NaN
s的A列的平均值为(2 + 1) / 2 = 1.5
因为您还没有第三个值,因此无法将其计算在内。
You don't even need Pandas for such a simple task, the built in csv
module is more than enough: 您甚至不需要Pandas即可完成如此简单的任务,内置的
csv
模块已绰绰有余:
import csv
# on Python 3.x use: open("input.csv", "r")
with open("input.csv", "rb") as f_in: # open input.csv for reading
r = csv.reader(f_in) # create a CSV reader
header = next(r) # store the header to recreate in the output
columns_num = len(header) # max number of columns
# read in rows and fill potentially missing elements with 0 to ensure a perfect 2D list
rows = [] # a storage for our rows
for row in r: # go through each CSV row
columns = [] # a storage for our columns
for index in range(columns_num): # loop through each column index
try:
columns.append(int(row[index])) # convert to integer and store in `columns`
except (IndexError, ValueError, TypeError): # invalid column value
columns.append(0) # store 0 to `columns` as an 'empty' value
rows.append(columns) # store the processed columns to the `rows` storage
total_rows = float(len(rows)) # a number to take into the account for average
rows = zip(*rows) # flip the CSV columns and rows, on Python 3.x use: list(zip(*rows))
for i, row in enumerate(rows):
average_real = sum(row) / total_rows # calculate the real average
average = int(average_real) # integer average, use as an average for non-floats
if average_real - average != 0: # the average is not an integer
average = int(average_real * 100) / 100.0 # shorten the float to 2 decimals
rows[i] = [column or average for column in row] # apply to empty fields and update
# on Python 3.x use: with open("output.csv", "w", newline='')
with open("output.csv", "wb") as f_out: # open output.csv for writing
writer = csv.writer(f_out)
writer.writerow(header) # write the header to output CSV
writer.writerows(zip(*rows)) # flip back rows and colums and write them to output CSV
For an input.csv
file with contents as: 对于具有
input.csv
内容的input.csv
文件:
A,B,C,D 1,2,1, ,1,, 2,1,1,2
It will produce output.csv
as: 它将产生
output.csv
为:
A,B,C,D 1,2,1,0.66 1,1,0.66,0.66 2,1,1,2
(NOTE: I've fixed the CSV headers to make it a valid CSV, but it will work even without them provided a perfect 2D list, ie every row having the same number of columns) (注意:我已经修复了CSV标头,使其成为有效的CSV,但即使它们没有提供完美的2D列表(即,每行具有相同的列数),它也可以工作)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.