检查多列数据格式并将结果附加到 Pandas 中的一列

Question

给定一个玩具数据集如下：

   id    room   area           situation
0   1   A-102  world  under construction
1   2     NaN     24  under construction
2   3    B309    NaN                 NaN
3   4   C·102     25    under decoration
4   5  E_1089  hello    under decoration
5   6      27    NaN          under plan
6   7      27    NaN                 NaN

我需要根据以下条件检查三列： room, area, situation ：

(1) 如果room名称不是数字、字母、 - （ NaN也被认为是无效的），则为check列返回incorrect room name ；

(2) 如果area不是number或NaN s，则返回area is not numbers并将其附加到现有的check列。

(3) 如果situation有under decoration ，则返回并追加decoration is in the content到现有的check栏。

请注意，我还有其他列要检查真实数据，我需要按分隔符附加新的check结果; .

我怎么能得到这样的预期结果：

   id    room   area           situation                                              check
0   1   A-102  world  under construction                                area is not numbers
1   2     NaN     24  under construction                                                incorrect room name
2   3    B309    NaN                 NaN                                                NaN
3   4   C·102     25    under decoration  incorrect room name; decoration is in the content
4   5  E_1089  hello    under decoration  incorrect room name; area is not numbers; decoration is in the content
5   6      27    NaN          under plan                                                NaN
6   7      27    NaN                 NaN                                                NaN

到目前为止我的代码：

房间名称检查：

df['check'] = np.where(df.room.str.match('^[a-zA-Z\d\-]*$'), np.NaN, 'incorrect room name')

出去：

   id    room   area           situation                check
0   1   A-102  world  under construction                  nan
1   2     NaN     24  under construction                  nan
2   3    B309    NaN                 NaN                  nan
3   4   C·102     25    under decoration  incorrect room name
4   5  E_1089  hello    under decoration  incorrect room name
5   6      27    NaN          under plan                  nan
6   7      27    NaN                 NaN                  nan

区域检查：

df['check'] = df['check'].where(df.area.str.contains('^\d+$', na = True),
                                'area is not a numbers')

出去：

   id    room   area           situation                  check
0   1   A-102  world  under construction  area is not a numbers
1   2     NaN     24  under construction                    nan
2   3    B309    NaN                 NaN                    nan
3   4   C·102     25    under decoration    incorrect room name
4   5  E_1089  hello    under decoration  area is not a numbers
5   6      27    NaN          under plan                    nan
6   7      27    NaN                 NaN                    nan

情况检查：

df['check'] = df['check'].where(df.situation.str.contains('under decoration', na = True),
                                'decoration is in the content')

出去：

   id    room   area           situation                         check
0   1   A-102  world  under construction  decoration is in the content
1   2     NaN     24  under construction  decoration is in the content
2   3    B309    NaN                 NaN                           nan
3   4   C·102     25    under decoration           incorrect room name
4   5  E_1089  hello    under decoration         area is not a numbers
5   6      27    NaN          under plan  decoration is in the content
6   7      27    NaN                 NaN                           nan

谢谢。

Answer 1

首先由numpy.where更改每个测试的numpy.where ，然后zip每个数组并应用自定义函数进行连接，如果没有缺失值：

a = np.where(df.room.str.match('^[a-zA-Z\d\-]*$', na = False), None,
                               'incorrect room name')
b = np.where(df.area.str.contains('^\d+$', na = True), None,
                                 'area is not a numbers')  
c = np.where(df.situation.str.contains('under decoration', na = False),
                                      'decoration is in the content', None) 


f = (lambda x: ';'.join(y for y in x if pd.notna(y)) 
                if any(pd.notna(np.array(x))) else np.nan )
df['check'] = [f(x) for x in zip(a,b,c)]
print(df)
   id    room   area           situation  \
0   1   A-102  world  under construction   
1   2     NaN     24  under construction   
2   3    B309    NaN                 NaN   
3   4   C·102     25    under decoration   
4   5  E_1089  hello    under decoration   
5   6      27    NaN          under plan   
6   7      27    NaN                 NaN   

                                               check  
0                              area is not a numbers  
1                                incorrect room name  
2                                                NaN  
3   incorrect room name;decoration is in the content  
4  incorrect room name;area is not a numbers;deco...  
5                                                NaN  
6                                                NaN

Answer 2

我稍微修改了你的条件，所以结果更接近你的预期输出：

a = np.where(df.room.str.match('^[a-zA-Z\d\-]*$').notnull(), pd.NA, 'incorrect room name')
b = np.where(df["area"].str.isnumeric() & df["area"].notnull(), pd.NA, 'area is not a numbers')
c = np.where(df.situation.str.contains('under decoration', na = False), 'decoration is in the content', pd.NA)

s = (pd.concat([pd.Series(i, index=df.index) for i in (a, b, c)], axis = 1)
       .stack().groupby(level = 0).agg("; ".join))

print(df.assign(check=s))

   id    room   area           situation                                              check
0   1   A-102  world  under construction                              area is not a numbers
1   2     NaN     24  under construction                                incorrect room name
2   3    B309    NaN                 NaN  area is not a numbers; decoration is in the co...
3   4   C·102     25    under decoration                       decoration is in the content
4   5  E_1089  hello    under decoration  area is not a numbers; decoration is in the co...
5   6      27    NaN          under plan                              area is not a numbers
6   7      27    NaN                 NaN  area is not a numbers; decoration is in the co...

Answer 3

你可以试试这个：

import os
import glob
import pandas as pd
os.chdir(r"C:\Users\Rameez PC\Desktop\python data files 2\")

extension = 'xlsx'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

#combine all files in the list
combined_xlsx1 = pd.concat([pd.read_excel(f) for f in all_filenames] )
#export to csv
combined_xlsx1.to_excel( "combined.xlsx", index=False, encoding='utf-8-sig')

检查多列数据格式并将结果附加到 Pandas 中的一列

问题描述

3 个解决方案

解决方案1
2 已采纳 2020-11-09 06:53:39

解决方案2
1 2020-11-09 07:00:23

解决方案3
-1 2020-11-09 06:28:36

检查多列数据格式并将结果附加到 Pandas 中的一列

问题描述

3 个解决方案

解决方案1 2 已采纳 2020-11-09 06:53:39

解决方案2 1 2020-11-09 07:00:23

解决方案3 -1 2020-11-09 06:28:36

解决方案1
2 已采纳 2020-11-09 06:53:39

解决方案2
1 2020-11-09 07:00:23

解决方案3
-1 2020-11-09 06:28:36