I am trying to extract just the emails from text column in openrefine. some cells have just the email, but others have the name and email in john doe <john@doe.com>
format. I have been using the following GREL/regex but it does not return the entire email address. For the above exaple I'm getting ["n@doe.com"]
value.match(
/.*([a-zA-Z0-9_\-\+]+@[\._a-zA-Z0-9-]+).*/
)
Any help is much appreciated.
The n
is captured because you are using .*
before the capturing group, and since it can match any 0+ chars other than line break chars greedily the only char that can land in Group 1 during backtracking is the char right before @
.
If you can get partial matches git rid of the .*
and use
/[^<\s]+@[^\s>]+/
See the regex demo
Details
[^<\\s]+
- 1 or more chars other than <
and whitespace @
- a @
char [^\\s>]+
- 1 or more chars other than whitespace and >
. Python/Jython implementation :
import re
res = ''
m = re.search(r'[^<\s]+@[^\s>]+', value)
if m:
res = m.group(0)
return res
There are other ways to match these strings. In case you need a full string match .*<([^<]+@[^>]+)>.*
where .*
will not gobble the name since it will stop before an obligatory <
.
If some cells contain just the email, it's probably better to use the @wiktor-stribiżew's partial match. In the development version of Open Refine, there is now a value.find()
function that can do this , but it will only be officially implemented in the next version (2.9). In the meantime, you can reproduce it using Python/Jython instead of GREL:
import re
return re.findall(r"[^<\s]+@[^\s>]+", value)[0]
Result :
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.