The thing I want to ask is that I have a csv file which consists of categorical and numeric values. There are some missing values in this csv file. I want to calculate the average value in each column in this file and write this average I calculated instead of the missing values in the column. I loaded the necessary libraries and the files with pd.read_csv also. Namely;
ABCD
1,2,1,
,1,,
2,1,1,2
I want to write 1 in row 2 of cloumn A in a csv file like above.I will apply this to the other columns in the same way so the csv table I want to get is like this:
A B C D
1,2,1,0.66
1,1,0.66,0.66
2,1,1,2
For example, there is one missing value in column A. I want to write the average I calculated for column A instead of this missing value in column A. (so I want to write 1 to second row in column A due to (2 + 1) / 3=1)). I would like to apply this in other columns as well.I want to apply these operations to other columns in the same way.I tried to write the code to do this in the above code. So I tried to write this code:
rows=list()
column=list(myfile.columns.values)
average = 0
Sum = 0
row_count = 1
for row in myfile:
for row in column:
n = column
Sum += n
row_count += 1
average = Sum / len(column)
print('The average is:', average)
The code is not working correctly. How can I implement this code or is the code completely wrong?
Your example is unclear due to bad formatting. No worries, I also have problems with formatting. Are you sure that you are using pandas?
Dummy dataframe.
df = pd.DataFrame(np.random.randn(50,4), columns=['A', 'B', 'C', 'D'])
df.iloc[2:4,0] = np.nan
df.iloc[3:5,1] = np.nan
df.iloc[4:6,2] = np.nan
df.iloc[5:7,3] = np.nan
df.head(10).round(2)
Results with
A B C D
0 -0.09 1.77 1.14 1.00
1 -1.24 -2.21 -0.21 -0.36
2 NaN -0.59 -0.77 -0.74
3 NaN NaN 0.37 -1.07
4 -0.19 NaN NaN 1.39
5 0.20 1.08 NaN NaN
6 -0.15 0.64 0.04 NaN
7 0.92 -1.01 1.81 -0.83
8 -0.79 0.13 -0.24 1.96
9 0.11 0.97 -0.97 -1.32
You load your dataframe with
df = pd.read_csv('path/to/your/file.csv')
Additionaly, there's no NaN
in your df, so you may want to replace empty cells with NaN
.
from numpy import nan
df.replace('', nan)
Or replace any string in these columns
df.loc[:,'A':'D'].replace(r'\s+', nan, regex=True)
Filling nans with column-wise mean:
df = df.apply(lambda x: x.fillna(x.mean()), axis=0)
Filling nans with row-wise mean:
df = df.apply(lambda x: x.fillna(x.mean()), axis=1)
Is that what you were looking for?
Edit after OP's edit:
import pandas as pd
df = pd.DataFrame({
'A': [1, '', 2],
'B': [2, 1, 1],
'C': [1, '', 1],
'D': ['', '', 2]
})
def isnumber(x):
try:
float(x)
return True
except:
return False
df = df[df.applymap(isnumber)]
df = df.apply(lambda x: x.fillna(x.mean()), axis=0)
df
is all you need.
Output
A B C D
0 1.0 2 1.0 2.0
1 1.5 1 1.0 2.0
2 2.0 1 1.0 2.0
And I think it's the right answer. The mean of the column A with NaN
s is (2 + 1) / 2 = 1.5
because you don't have the third value yet, so you can't count it in.
You don't even need Pandas for such a simple task, the built in csv
module is more than enough:
import csv
# on Python 3.x use: open("input.csv", "r")
with open("input.csv", "rb") as f_in: # open input.csv for reading
r = csv.reader(f_in) # create a CSV reader
header = next(r) # store the header to recreate in the output
columns_num = len(header) # max number of columns
# read in rows and fill potentially missing elements with 0 to ensure a perfect 2D list
rows = [] # a storage for our rows
for row in r: # go through each CSV row
columns = [] # a storage for our columns
for index in range(columns_num): # loop through each column index
try:
columns.append(int(row[index])) # convert to integer and store in `columns`
except (IndexError, ValueError, TypeError): # invalid column value
columns.append(0) # store 0 to `columns` as an 'empty' value
rows.append(columns) # store the processed columns to the `rows` storage
total_rows = float(len(rows)) # a number to take into the account for average
rows = zip(*rows) # flip the CSV columns and rows, on Python 3.x use: list(zip(*rows))
for i, row in enumerate(rows):
average_real = sum(row) / total_rows # calculate the real average
average = int(average_real) # integer average, use as an average for non-floats
if average_real - average != 0: # the average is not an integer
average = int(average_real * 100) / 100.0 # shorten the float to 2 decimals
rows[i] = [column or average for column in row] # apply to empty fields and update
# on Python 3.x use: with open("output.csv", "w", newline='')
with open("output.csv", "wb") as f_out: # open output.csv for writing
writer = csv.writer(f_out)
writer.writerow(header) # write the header to output CSV
writer.writerows(zip(*rows)) # flip back rows and colums and write them to output CSV
For an input.csv
file with contents as:
A,B,C,D 1,2,1, ,1,, 2,1,1,2
It will produce output.csv
as:
A,B,C,D 1,2,1,0.66 1,1,0.66,0.66 2,1,1,2
(NOTE: I've fixed the CSV headers to make it a valid CSV, but it will work even without them provided a perfect 2D list, ie every row having the same number of columns)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.