簡體   English   中英

Python、Pandas、df 2 部分問題:1. 如何根據特定條件將列添加到列表中 2. 如何從 df 中刪除這些列

[英]Python, Pandas, df 2 part question: 1. how to add a column into a list based of a certain condition 2. how to remove those columns from df

真正的問題在下面,但這是背景信息

我的最終目標是制作一個通用的 python 文件(腳本?.py 文件),它打開一個 excel 文件,確定數據的組織方式,“清理”不可用的數據,然后運行多元線性回歸分析。 我被困在“清潔”部分,但我有想法,只是不知道如何去做。

這是數據的樣子(在 excel 文件中):

y value data 1  data 2  data 3  data 4  data 5  data 6
282     1       215     169             14      147
148     0       250     307             232     134
351     1       191     343             189     9
31      0       32      327             8       201
33      0       503     484             85      166
973     0       651                     134     128
329     0       300                     186     195
271     1       543                             18
814     1       544                             123
274     1       349                             209
425     1                   

所以,這是我到目前為止的代碼(有解釋)

import pandas as pd
import numpy as np
import statistics

df =pd.read_excel (r'D:\...data.xlsx')

## find the longest column and longest rows.........https://www.kite.com/python/answers/how-to-count-the-number-of-rows-in-a-pandas-dataframe-in-python
index = df.index
number_of_rows = len(index)

#####find number of columns ............https://www.w3resource.com/python-exercises/pandas/python-pandas-data-frame-exercise-57.php
number_of_columns = len(df.columns)

if number_of_rows>number_of_columns:
    row_based = input("it appears that this is a row based file, in other words, the data go from top to bottom, and the top row are the datas' title. If so, press enter, otherwise press any other keyboard and enter. " )
    if row_based == "":
    ######https://stackoverflow.com/questions/23979184/how-to-know-if-a-user-has-pressed-the-enter-key-using-python.................You should use \n as enter. It means the newline character.

到目前為止,我所做的只是確定數據的方向。現在我需要“清理”數據。

        ######this is the real work indent...##########################

        #quantify how many how many empty values are in each column..............https://datascience.stackexchange.com/questions/12645/how-to-count-the-number-of-missing-values-in-each-row-in-pandas-dataframe
        num_of_empty_cells_in_columns = df.isnull().sum(axis=0)
        # sort the coulmns based on how many empty values they  have............https://data-flair.training/blogs/sort-pandas-dataframes-series-array/
        columns_pd_sorted = num_of_empty_cells_in_columns.sort_values(ascending=True)

現在我對列進行了“排序”(基於整列中有多少空單元格),我只需要選擇第一列作為“最低值”。 這意味着這要么是 Y 值(用於稍后的多元線性回歸分析),要么是一個數據字段與 y 值一樣多的數據。

        #find the lowest value (this is the value of the already sorted array)
        lowest_value=columns_pd_sorted[0]

我還想取所有空字段的平均值。 這個平均值(平均值)稍后會使用(我認為)。

        mean_empty_cells=statistics.mean(columns_pd_sorted)

    #if the user says its horizontal data (instead of vertical data)
    else:
        print(" this code hasn't been built yet..")

#if the number of columns exceeds the number of rows (indicates horizontal data series)
else:
    print(" this code hasn't been built yet..")

我的最終目標是:

  1. 要求用戶驗證(或選擇)y 變量。
  2. 消除大部分為空字段的列。
  3. 消除沒有完整數據的行(所有列都有數據)。

** !!!!!!!!!!!!!!!!!!!!! 我想我如何解決它(這是真正的問題)!!!!!!!!!!!!!!! !!! **

我想我可以通過執行以下操作來解決所有這些問題,但不知道如何編寫代碼。 最終我的問題是我找不到強制熊貓給我列名和數據的方法。

  1. 列出具有等效空單元格作為最低值的列。 我假設某種迭代? 下面的代碼實際上是在黑暗中刺傷。
        y_variable_candidates = []
        for col in num_of_empty_cells_in_columns:
           if col=lowest_value:
              y_variable_candidates=y_variable_candidates + col

        y_variable = y_variable_candidates[1]

        y_variable_confirmation = input('currently your y variable is ' + str(y_variable) +' it appears that there are many y variable candidates, such as' + str(y_variable_candidates) + 'press enter if the current y variable is okay, otherwise press a number key to indicate which column should be the y variable')
        #... more code later on
  1. 同樣,列出所有具有超過平均值的空單元格 (mean_empty_cells) 的列。 我再次認為我可以通過迭代來做到這一點
  2. 將 y_variable 設置為第一個“lowest_value”,但要求用戶確認
  3. 返回 df 並刪除所有匹配的列
        mostly_empty_columns = []
        for col in num_of_empty_cells_in_columns:
           if col>mean_empty_cells:
              mostly_empty_columns=mostly_empty_columns + col

        #some code to get user to confirm to delete all the selected columns

所需的最終數據:

y value data 1  data 2  data 3  data 5  data 6
282     1       215     169     14      147
148     0       250     307     232     134
351     1       191     343     189     9
31      0       32      327     8       201
33      0       503     484     85      166

我想對上述所需的最終數據運行多元線性回歸分析。

非常感謝您的幫助!!!

這是一種使用dropna()函數的方法。 首先,我們有初始數據框:

print(df)   # initial data frame
    y_value  data_1  data_2  data_3  data_4  data_5  data_6
0       282       1   215.0   169.0     NaN    14.0   147.0
1       148       0   250.0   307.0     NaN   232.0   134.0
2       351       1   191.0   343.0     NaN   189.0     9.0
3        31       0    32.0   327.0     NaN     8.0   201.0
4        33       0   503.0   484.0     NaN    85.0   166.0
5       973       0   651.0   134.0     NaN   128.0     NaN
6       329       0   300.0   186.0     NaN   195.0     NaN
7       271       1   543.0    18.0     NaN     NaN     NaN
8       814       1   544.0   123.0     NaN     NaN     NaN
9       274       1   349.0   209.0     NaN     NaN     NaN
10      425       1     NaN     NaN     NaN     NaN     NaN

接下來,(a) 如果每個值都是 NaN,我們刪除列,然后 (b) 如果任何值是 NaN,我們刪除行:

# un-comment the next line to transpose the data frame (e.g., based on user input / user confirmation)
# df = df.transpose()

# delete columns with all NaN
df = df.dropna(axis=1, how='all')

# delete rows with 1 or more NaN
df = df.dropna(axis=0, how='any')

print(df)

   y_value  data_1  data_2  data_3  data_5  data_6
0      282       1   215.0   169.0    14.0   147.0
1      148       0   250.0   307.0   232.0   134.0
2      351       1   191.0   343.0   189.0     9.0
3       31       0    32.0   327.0     8.0   201.0
4       33       0   503.0   484.0    85.0   166.0

dropna()文檔在這里

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM