I have a column that appears to have data in 4 different formats, I created a small snippet with an array to illustrate what I am working with
ex_array = np.array(['100X172',
'78X120',
'1 ac',
'76,666',
'85X175',
'19,928',
'14810',
'3 ac',
'90X181',
'38X150',
'19040',
'8265',
'100X125',
'6000',
'8,750',
'.448 ac'])
ex_df = pd.DataFrame(data=ex_array, columns=['ex_col'])
this outputs the following as expected:
ex_col
0 100X172
1 78X120
2 1 ac
3 76,666
4 85X175
5 19,928
6 14810
7 3 ac
8 90X181
9 38X150
10 19040
11 8265
12 100X125
13 6000
14 8,750
15 .448 ac
The goal is to standardize the column where everything would be in acres and the desired output would be as follows
ex_df['acreage'] =
acreage
0 .394858
1 .214876
2 1
3 1.76
4 .341483
5 .457484
6 .339991
7 3
8 .373967
9 .130854
10 .437098
11 .189738
12 .284665
13 .137741
14 .200872
15 .448
my thought in pandas was to create 3 boolean columns to handle the different types of data
ex_df['hasX'] = ex_df['ex_col'].str.contains('X')
ex_df['has_ac'] = ex_df['ex_col'].str.contains('ac')
ex_df['has_comma'] = ex_df['ex_col'].str.contains(',')
this outputs as expected
ex_df
ex_col hasX has_ac has_comma
0 100X172 True False False
1 78X120 True False False
2 1 ac False True False
3 76,666 False False True
4 85X175 True False False
5 19,928 False False True
6 14810 False False False
7 3 ac False True False
8 90X181 True False False
9 38X150 True False False
10 19040 False False False
11 8265 False False False
12 100X125 True False False
13 6000 False False False
14 8,750 False False True
15 .448 ac False True False
next I attempted multiple loc operations as follows
ex_df.loc[(ex_df['hasX']==True), 'acreage']= ex_df['ex_col'].apply(lambda x: float(((int(x.split('X')[0]))*(int(x.split('X')[-1])))/43560))
ex_df.loc[(ex_df['has_ac']==True), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x.split()[0]))
ex_df.loc[(ex_df['has_comma']==True), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x.replace(',','')))
ex_df.loc[((ex_df['hasX']==False) & (ex_df['has_ac']==False) & (ex_df['has_comma']==False)), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x))
this outputs the following error:
<ipython-input-40-e9eb1eacbacb> in <module>
----> 1 ex_df.loc[(ex_df['hasX']==True), 'acreage']= ex_df['ex_col'].apply(lambda x: float(((int(x.split('X')[0]))*(int(x.split('X')[-1])))/43560))
2 ex_df.loc[(ex_df['has_ac']==True), 'acreage'] = ex_df['ex_col'].apply(lambda x: x.split()[0])
3 ex_df.loc[(ex_df['has_comma']==True), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x.replace(',','')))
4 ex_df.loc[((ex_df['hasX']==False) & (ex_df['has_ac']==False) & (ex_df['has_comma']==False)), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x))
4198 else:
4199 values = self.astype(object)._values
-> 4200 mapped = lib.map_infer(values, f, convert=convert_dtype)
4201
4202 if len(mapped) and isinstance(mapped[0], Series):
pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-40-e9eb1eacbacb> in <lambda>(x)
----> 1 ex_df.loc[(ex_df['hasX']==True), 'acreage']= ex_df['ex_col'].apply(lambda x: float(((int(x.split('X')[0]))*(int(x.split('X')[-1])))/43560))
2 ex_df.loc[(ex_df['has_ac']==True), 'acreage'] = ex_df['ex_col'].apply(lambda x: x.split()[0])
3 ex_df.loc[(ex_df['has_comma']==True), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x.replace(',','')))
4 ex_df.loc[((ex_df['hasX']==False) & (ex_df['has_ac']==False) & (ex_df['has_comma']==False)), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x))
ValueError: invalid literal for int() with base 10: '1 ac'
Let's try extract
to split your data column and use np.select
to map:
data = (ex_df['ex_col'].str.replace(',','')
.str.extract('([\.\d]+)\s?(ac|X)?([\.\d,]+)?')
)
data[[0,2]] = data[[0,2]].astype(float)
ex_df['area'] = np.select((data[1].eq('X'), data[1].eq('ac')),
(data[0]* data[2]/43560,data[0]),
data[0]/43560 )
Output:
ex_col area
0 100X172 0.394858
1 78X120 0.214876
2 1 ac 1.000000
3 76,666 1.760009
4 85X175 0.341483
5 19,928 0.457484
6 14810 0.339991
7 3 ac 3.000000
8 90X181 0.373967
9 38X150 0.130854
10 19040 0.437098
11 8265 0.189738
12 100X125 0.286961
13 6000 0.137741
14 8,750 0.200872
15 .448 ac 0.448000
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.