简体   繁体   中英

Parsing price out of a character string with regex in R

My data looks like this:

L/S Price
$555,000Previous Price: $575,000
$865,000Previous Price: $875,000
$995,000 
$1,325,000Previous Price: $1,459,000

The result I want is this:

555000
865000
995000
1325000

The best regex I could come up with was ([0-9,])+ but that has several problems, such as also matching the "Previous Price" which is just noise. I was including the comma in my regex so that I can match the entire price, even though I need to remove the comma eventually.

Alternately, I am thinking that I can select the part I DON'T want with something like ([a-zA-Z]).+ then remove it, though I'm having trouble implementing this.

Here's a dput :

> dput(mls_res$`L/S Price`[1:4])
c("$555,000Previous Price: $575,000", "$865,000Previous Price: $875,000", 
"$995,000 ", "$1,325,000Previous Price: $1,459,000")

With library stringr , you can do something like this:

library(stringr)
df <- c('$555,000Previous Price: $575,000', '$865,000Previous Price: $875,000', '$995,000', '$1,325,000Previous Price: $1,459,000')
as.numeric(gsub('\\$|,', '', str_extract(df, '^\\$[0-9,]*')))

This seems simple and involves no packages. It removes P and everything thereafter and then removes all non-digits from what is left. Finally it converts that to numeric.

as.numeric(gsub("\\D", "", sub("P.*", "", s)))
## [1]  555000  865000  995000 1325000

If the last digit may be followed by some other letter than P then replace P with [[:alpha:]] .

Note: We used this input:

s <- c("$555,000Previous Price: $575,000", "$865,000Previous Price: $875,000", 
       "$995,000 ", "$1,325,000Previous Price: $1,459,000")

We can either use capture groups ( (...) ) to capture the numeric elements from the string and then replace it with backreference of the captured group

as.numeric(gsub("^\\D*([0-9]+),*([0-9]+),([0-9]+).*", "\\1\\2\\3", str1))
#[1]  555000  865000  995000 1325000

Or just match the non-numeric characters and replace it with "" .

as.numeric(gsub("[$,]|[[:alpha:]]+.*", "", str1))
#[1]  555000  865000  995000 1325000

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM