简体   繁体   中英

standardizing column in pandas with loc

I have a column that appears to have data in 4 different formats, I created a small snippet with an array to illustrate what I am working with

ex_array = np.array(['100X172',
 '78X120',
 '1 ac',
 '76,666',
 '85X175',
 '19,928',
 '14810',
 '3 ac',
 '90X181',
 '38X150',
 '19040',
 '8265',
 '100X125',
 '6000',
 '8,750',
 '.448 ac'])

ex_df = pd.DataFrame(data=ex_array, columns=['ex_col'])

this outputs the following as expected:

    ex_col
0   100X172
1   78X120
2   1 ac
3   76,666
4   85X175
5   19,928
6   14810
7   3 ac
8   90X181
9   38X150
10  19040
11  8265
12  100X125
13  6000
14  8,750
15  .448 ac

The goal is to standardize the column where everything would be in acres and the desired output would be as follows

ex_df['acreage'] =
    acreage
0   .394858
1   .214876
2   1
3   1.76
4   .341483
5   .457484
6   .339991
7   3
8   .373967
9   .130854
10  .437098
11  .189738
12  .284665
13  .137741
14  .200872
15  .448

my thought in pandas was to create 3 boolean columns to handle the different types of data

ex_df['hasX'] = ex_df['ex_col'].str.contains('X')
ex_df['has_ac'] = ex_df['ex_col'].str.contains('ac')
ex_df['has_comma'] = ex_df['ex_col'].str.contains(',')

this outputs as expected

ex_df
ex_col  hasX    has_ac  has_comma
0   100X172     True    False   False
1   78X120  True    False   False
2   1 ac    False   True    False
3   76,666  False   False   True
4   85X175  True    False   False
5   19,928  False   False   True
6   14810   False   False   False
7   3 ac    False   True    False
8   90X181  True    False   False
9   38X150  True    False   False
10  19040   False   False   False
11  8265    False   False   False
12  100X125     True    False   False
13  6000    False   False   False
14  8,750   False   False   True
15  .448 ac     False   True    False

next I attempted multiple loc operations as follows

ex_df.loc[(ex_df['hasX']==True), 'acreage']= ex_df['ex_col'].apply(lambda x: float(((int(x.split('X')[0]))*(int(x.split('X')[-1])))/43560))
ex_df.loc[(ex_df['has_ac']==True), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x.split()[0]))
ex_df.loc[(ex_df['has_comma']==True), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x.replace(',','')))
ex_df.loc[((ex_df['hasX']==False) & (ex_df['has_ac']==False) & (ex_df['has_comma']==False)), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x))

this outputs the following error:

<ipython-input-40-e9eb1eacbacb> in <module>
----> 1 ex_df.loc[(ex_df['hasX']==True), 'acreage']= ex_df['ex_col'].apply(lambda x: float(((int(x.split('X')[0]))*(int(x.split('X')[-1])))/43560))
      2 ex_df.loc[(ex_df['has_ac']==True), 'acreage'] = ex_df['ex_col'].apply(lambda x: x.split()[0])
      3 ex_df.loc[(ex_df['has_comma']==True), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x.replace(',','')))
      4 ex_df.loc[((ex_df['hasX']==False) & (ex_df['has_ac']==False) & (ex_df['has_comma']==False)), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x))
4198             else:
   4199                 values = self.astype(object)._values
-> 4200                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   4201 
   4202         if len(mapped) and isinstance(mapped[0], Series):

pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()

<ipython-input-40-e9eb1eacbacb> in <lambda>(x)
----> 1 ex_df.loc[(ex_df['hasX']==True), 'acreage']= ex_df['ex_col'].apply(lambda x: float(((int(x.split('X')[0]))*(int(x.split('X')[-1])))/43560))
      2 ex_df.loc[(ex_df['has_ac']==True), 'acreage'] = ex_df['ex_col'].apply(lambda x: x.split()[0])
      3 ex_df.loc[(ex_df['has_comma']==True), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x.replace(',','')))
      4 ex_df.loc[((ex_df['hasX']==False) & (ex_df['has_ac']==False) & (ex_df['has_comma']==False)), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x))

ValueError: invalid literal for int() with base 10: '1 ac'

Let's try extract to split your data column and use np.select to map:

data = (ex_df['ex_col'].str.replace(',','')
            .str.extract('([\.\d]+)\s?(ac|X)?([\.\d,]+)?')
       )

data[[0,2]] = data[[0,2]].astype(float)

ex_df['area'] = np.select((data[1].eq('X'), data[1].eq('ac')),
                          (data[0]* data[2]/43560,data[0]), 
                          data[0]/43560 )

Output:

     ex_col      area
0   100X172  0.394858
1    78X120  0.214876
2      1 ac  1.000000
3    76,666  1.760009
4    85X175  0.341483
5    19,928  0.457484
6     14810  0.339991
7      3 ac  3.000000
8    90X181  0.373967
9    38X150  0.130854
10    19040  0.437098
11     8265  0.189738
12  100X125  0.286961
13     6000  0.137741
14    8,750  0.200872
15  .448 ac  0.448000

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM