简体   繁体   中英

OpenRefine Regex and GREL match error

Inside openRefine I want to run the below regex on a website's source that finds email addresses with a mailto link. My trouble is when running value.match, I get this error:

Parsing error at offset 12: Bad regular expression (Unclosed character class near index 10 .*mailto:[^ ^)

I have tested the expression in another environment without value.match and it works.

value.match(/.*mailto:[^/"/']*.com.*/)
isNotNull(value.match(/.*mailto:[^\"\']*.com.*/)) 

as described on our Reference page for the match() function, it return an array of capture groups in your RegEx pattern and then isNotNull() outputs True or False if your value is like that pattern: https://github.com/OpenRefine/OpenRefine/wiki/GREL-String-Functions#matchstring-s-regexp-p

also described here: https://github.com/OpenRefine/OpenRefine/wiki/Understanding-Regular-Expressions#basic-examples

You can also use get() as described here in Recipes on our wiki, BUT will only work well if you have only 1 email address per cell (its because the get() function without number from or to, makes assumptions and uses the length of the array to determine the last element and pushes out only the last element, not the first, or third, etc.): https://github.com/OpenRefine/OpenRefine/wiki/Recipes#find-a-sub-pattern-that-exists-at-the-end-of-a-string

For example:

get(value.match(/.*(mailto:[^\"\']*.com).*/),0)

So if you have text like:

Blah blah <a href="mailto:j.bloggs@example.com">mail me</a>

To extract the email address using the match function in OpenRefine you need to use:

value.match(/.*mailto:([^\"\']*.com).*/)

This will give an array containing the email address, which is captured using a capture group. To extract the email address from the array (which is necessary if you want to store the mail address in an OpenRefine cell) you need to get the string value from the array. eg:

value.match(/.*mailto:([^\"\']*.com).*/)[0]

The difference between your original expression and this one is that the characters are escaped correctly and there is a capture group - basically implementing the advice from @LukStorms in the comments above.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM