[英]Python, Pandas, df 2 part question: 1. how to add a column into a list based of a certain condition 2. how to remove those columns from df
真正的問題在下面,但這是背景信息
我的最終目標是制作一個通用的 python 文件(腳本?.py 文件),它打開一個 excel 文件,確定數據的組織方式,“清理”不可用的數據,然后運行多元線性回歸分析。 我被困在“清潔”部分,但我有想法,只是不知道如何去做。
這是數據的樣子(在 excel 文件中):
y value data 1 data 2 data 3 data 4 data 5 data 6
282 1 215 169 14 147
148 0 250 307 232 134
351 1 191 343 189 9
31 0 32 327 8 201
33 0 503 484 85 166
973 0 651 134 128
329 0 300 186 195
271 1 543 18
814 1 544 123
274 1 349 209
425 1
所以,這是我到目前為止的代碼(有解釋)
import pandas as pd
import numpy as np
import statistics
df =pd.read_excel (r'D:\...data.xlsx')
## find the longest column and longest rows.........https://www.kite.com/python/answers/how-to-count-the-number-of-rows-in-a-pandas-dataframe-in-python
index = df.index
number_of_rows = len(index)
#####find number of columns ............https://www.w3resource.com/python-exercises/pandas/python-pandas-data-frame-exercise-57.php
number_of_columns = len(df.columns)
if number_of_rows>number_of_columns:
row_based = input("it appears that this is a row based file, in other words, the data go from top to bottom, and the top row are the datas' title. If so, press enter, otherwise press any other keyboard and enter. " )
if row_based == "":
######https://stackoverflow.com/questions/23979184/how-to-know-if-a-user-has-pressed-the-enter-key-using-python.................You should use \n as enter. It means the newline character.
到目前為止,我所做的只是確定數據的方向。現在我需要“清理”數據。
######this is the real work indent...##########################
#quantify how many how many empty values are in each column..............https://datascience.stackexchange.com/questions/12645/how-to-count-the-number-of-missing-values-in-each-row-in-pandas-dataframe
num_of_empty_cells_in_columns = df.isnull().sum(axis=0)
# sort the coulmns based on how many empty values they have............https://data-flair.training/blogs/sort-pandas-dataframes-series-array/
columns_pd_sorted = num_of_empty_cells_in_columns.sort_values(ascending=True)
現在我對列進行了“排序”(基於整列中有多少空單元格),我只需要選擇第一列作為“最低值”。 這意味着這要么是 Y 值(用於稍后的多元線性回歸分析),要么是一個數據字段與 y 值一樣多的數據。
#find the lowest value (this is the value of the already sorted array)
lowest_value=columns_pd_sorted[0]
我還想取所有空字段的平均值。 這個平均值(平均值)稍后會使用(我認為)。
mean_empty_cells=statistics.mean(columns_pd_sorted)
#if the user says its horizontal data (instead of vertical data)
else:
print(" this code hasn't been built yet..")
#if the number of columns exceeds the number of rows (indicates horizontal data series)
else:
print(" this code hasn't been built yet..")
我的最終目標是:
** !!!!!!!!!!!!!!!!!!!!! 我想我如何解決它(這是真正的問題)!!!!!!!!!!!!!!! !!! **
我想我可以通過執行以下操作來解決所有這些問題,但不知道如何編寫代碼。 最終我的問題是我找不到強制熊貓給我列名和數據的方法。
y_variable_candidates = []
for col in num_of_empty_cells_in_columns:
if col=lowest_value:
y_variable_candidates=y_variable_candidates + col
y_variable = y_variable_candidates[1]
y_variable_confirmation = input('currently your y variable is ' + str(y_variable) +' it appears that there are many y variable candidates, such as' + str(y_variable_candidates) + 'press enter if the current y variable is okay, otherwise press a number key to indicate which column should be the y variable')
#... more code later on
mostly_empty_columns = []
for col in num_of_empty_cells_in_columns:
if col>mean_empty_cells:
mostly_empty_columns=mostly_empty_columns + col
#some code to get user to confirm to delete all the selected columns
所需的最終數據:
y value data 1 data 2 data 3 data 5 data 6
282 1 215 169 14 147
148 0 250 307 232 134
351 1 191 343 189 9
31 0 32 327 8 201
33 0 503 484 85 166
我想對上述所需的最終數據運行多元線性回歸分析。
非常感謝您的幫助!!!
這是一種使用dropna()
函數的方法。 首先,我們有初始數據框:
print(df) # initial data frame
y_value data_1 data_2 data_3 data_4 data_5 data_6
0 282 1 215.0 169.0 NaN 14.0 147.0
1 148 0 250.0 307.0 NaN 232.0 134.0
2 351 1 191.0 343.0 NaN 189.0 9.0
3 31 0 32.0 327.0 NaN 8.0 201.0
4 33 0 503.0 484.0 NaN 85.0 166.0
5 973 0 651.0 134.0 NaN 128.0 NaN
6 329 0 300.0 186.0 NaN 195.0 NaN
7 271 1 543.0 18.0 NaN NaN NaN
8 814 1 544.0 123.0 NaN NaN NaN
9 274 1 349.0 209.0 NaN NaN NaN
10 425 1 NaN NaN NaN NaN NaN
接下來,(a) 如果每個值都是 NaN,我們刪除列,然后 (b) 如果任何值是 NaN,我們刪除行:
# un-comment the next line to transpose the data frame (e.g., based on user input / user confirmation)
# df = df.transpose()
# delete columns with all NaN
df = df.dropna(axis=1, how='all')
# delete rows with 1 or more NaN
df = df.dropna(axis=0, how='any')
print(df)
y_value data_1 data_2 data_3 data_5 data_6
0 282 1 215.0 169.0 14.0 147.0
1 148 0 250.0 307.0 232.0 134.0
2 351 1 191.0 343.0 189.0 9.0
3 31 0 32.0 327.0 8.0 201.0
4 33 0 503.0 484.0 85.0 166.0
dropna()
文檔在這里
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.