I have a csv file similar to this but with about 155,000 rows with years from 1910-2010 and 83 different station id's:
station_id year month element 1 2 3 4 5 6
216565 2008 7 SNOW 0TT 0 0 0 0 0
216565 2008 8 SNOW 0 0T 0 0 0 0
216565 2008 9 SNOW 0 0 0 0 0 0
and I want to replace any value that has a pattern of a number and then one letter or a number and then two letter with NaN.
My desired output then is:
station_id year month element 1 2 3 4 5 6
216565 2008 7 SNOW NaN 0 0 0 0 0
216565 2008 8 SNOW 0 NaN 0 0 0 0
216565 2008 9 SNOW 0 0 0 0 0 0
I have tried to use:
replace=df.replace([r'[0-9] [AZ]'], ['NA']) replace2=replace.replace([r'[0-9][AZ][AZ]'], ['NA'])
I was hoping by using the pattern of [0-9] [AZ] would take care of a number and just one letter and then [0-9][AZ][AZ] would replace any cells with 2 letters but the file stays the exact same even though no errors are returned.
Any help would be much appreciated.
You can use the pandas method convert_objects
to do this. You'll set convert_numeric
to True
convert_numeric : if True attempt to coerce to numbers (including strings), non-convertibles get NaN
>>> df
station_id year month element 1 2 3 4 5 6
0 216565 2008 7 SNOW 0TT 0 0 0 0 0
1 216565 2008 8 SNOW 0 0T 0 0 0 0
2 216565 2008 9 SNOW 0 0 0 0 0 0
>>> df.convert_objects(convert_numeric=True)
station_id year month element 1 2 3 4 5 6
0 216565 2008 7 SNOW NaN 0 0 0 0 0
1 216565 2008 8 SNOW 0 NaN 0 0 0 0
2 216565 2008 9 SNOW 0 0 0 0 0 0
If you wish to go the route of using replace
, you need to modify your call.
>>> df
station_id year month element 1 2 3 4 5 6
0 216565 2008 7 SNOW 0TT 0 0 0 0 0
1 216565 2008 8 SNOW 0 0T 0 0 0 0
2 216565 2008 9 SNOW 0 0 0 0 0 0
>>> df1.replace(value=np.nan, regex=r'[0-9][A-Z]+')
station_id year month element 1 2 3 4 5 6
0 216565 2008 7 SNOW NaN 0 0 0 0 0
1 216565 2008 8 SNOW 0 NaN 0 0 0 0
2 216565 2008 9 SNOW 0 0 0 0 0 0
This also requires that you import numpy ( import numpy as np
)
str.replace
doesn't do regexes. Use the re
module instead (assuming df is a string):
import re
re.sub(r'[0-9][A-Z]+', 'NaN', df)
returns:
station_id year month element 1 2 3 4 5 6
216565 2008 7 SNOW NaN 0 0 0 0 0
216565 2008 8 SNOW 0 NaN 0 0 0 0
216565 2008 9 SNOW 0 0 0 0 0
However, you would be better off letting eg Pandas or np.genfromtxt handle the invalid values automatically.
from re import sub
string = "station_id year month element 1 2 3 4 5 6 216565 2008 7 SNOW 0TT 0 0 0 0 0 216565 2008 8 SNOW 0 0T 0 0 0 0 216565 2008 9 SNOW 0 0 0 0 0 0"
string = sub(r'\d{1}[A-Za-z]{1,2}', 'NaN', string)
print string
# station_id year month element 1 2 3 4 5 6 216565 2008 7 SNOW NaN 0 0 0 0 0 216565 2008 8 SNOW 0 NaN 0 0 0 0 216565 2008 9 SNOW 0 0 0 0 0 0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.