I am trying to automatically generate an EDA report for each column in my dataframe, starting with value_counts().
the problem is that my function doesn't return anything. So while it does print to console, it doesn't print that same output to my text file. I was using this to just generate syntax and then run it line-by-line in my IDE to look at all the variables, but that is not a very programmatic solution.
Once this is working, I am going to add some syntax for graphs and the output of df.describe(), but for now I can't even get the basics of what I want.
Output doesnt have to be .txt, but I thought that would be easiest while getting this to work.
import pandas as pd
def EDA(df, name):
df.name = name # name == string version of df
print('#', df.name)
for val in df.columns:
print('# ', val, '\n', df[val].value_counts(dropna=False), '\n', sep='')
print(df[val].value_counts(dropna=False))
path = 'Data/nameofmyfile.csv'
# name of df
activeWD = pd.read_csv(path, skiprows=6)
f = open('Output/outtext.txt', 'a+', encoding='utf-8')
f.write(EDA(activeWD, 'activeWD'))
f.close()
various version of replacing print
with return
def EDA(df, name):
df.name = name # name == string version of df print('#', df.name) for val in df.columns: print('# ', val, '\\n', df[val].value_counts(dropna=False), '\\n', sep='') return(df[val].value_counts(dropna=False))
running file from anaconda prompt
Python Syntax\\newdataEDA.5.py >> Output.outtext.txt
which results in the following codec error:
(base) C:\Users\auracoll\Analytic Projects\IDL Attrition>Python Syntax\newdatanewlife11.5.py >> Output.outtext.txt
sys:1: DtypeWarning: Columns (3,16,39,40,41,42,49) have mixed types. Specify dtype option on import or set low_memory=False.
Traceback (most recent call last):
File "Syntax\newdatanewlife11.5.py", line 46, in <module>
EDA(activeWD, name='activeWD')
File "Syntax\newdatanewlife11.5.py", line 38, in EDA
print(df[col].value_counts(dropna=False))
File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 382-385: character maps to <undefined>
I tried encoding='utf-8'
and encoding='ISO-8859-1'
, neither of which resolve this problem.
I have tried to save intermediary variables, which return none type.
testvar = for val in df.columns: df[val].value_counts(dropna=False)
when I do this, testvar is NoneType object of builtins module
Command-line solution, although you can certainly print to file using pure python as your commenters suggested. I'm posting this because you mentioned you already tried using your command prompt and weren't able to get your outputs to print to file. So, edit your script, filename.py
as follows...
import pandas as pd
df = pd.DataFrame({'Pet':['Cat','Dog','Dog','Dog','Fish'],
'Color':['Blue','Blue','Red','Orange','Orange'],
'Name':['Henry','Bob','Mary','Doggo','Henry']})
def EDA(df, name):
df.name = name
print('#{}\n'.format(df.name))
for col in df.columns:
print('#{}\n'.format(col))
print(df[col].value_counts(dropna=False))
print('\n')
if __name__=='__main__':
EDA(df, name='test')
Then you should be able to run: python filename.py > output.txt
in your terminal.
For posterity's sake, OP's issue was not with how they were printing to file, instead there was an issue where their csv contained uncommon characters which pandas.read_csv
was having trouble decoding. The solution involved setting python's I/O encoding to UTF-8 before running the code, as shown here: python 3.2 UnicodeEncodeError: 'charmap' codec can't encode character '\–' in position 9629: character maps to <undefined>
chcp 65001
set PYTHONIOENCODING=utf-8
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.