简体   繁体   English

OpenRefine正则表达式和GREL匹配错误

[英]OpenRefine Regex and GREL match error

Inside openRefine I want to run the below regex on a website's source that finds email addresses with a mailto link. openRefine我想在网站源上运行以下正则表达式,以查找带有mailto链接的电子邮件地址。 My trouble is when running value.match, I get this error: 我的麻烦是运行value.match时遇到此错误:

Parsing error at offset 12: Bad regular expression (Unclosed character class near index 10 .*mailto:[^ ^) 偏移12处的解析错误:正则表达式错误(索引10附近的未封闭字符类。* mailto:[^ ^)

I have tested the expression in another environment without value.match and it works. 我已经在没有value.match的另一个环境中测试了该表达式,并且可以正常工作。

value.match(/.*mailto:[^/"/']*.com.*/)
isNotNull(value.match(/.*mailto:[^\"\']*.com.*/)) 

as described on our Reference page for the match() function, it return an array of capture groups in your RegEx pattern and then isNotNull() outputs True or False if your value is like that pattern: https://github.com/OpenRefine/OpenRefine/wiki/GREL-String-Functions#matchstring-s-regexp-p 如我们对match()函数的参考页中所述,它会在RegEx模式中返回一组捕获组,然后isNotNull()如果您的值类似于该模式,则输出True或False: https : //github.com/OpenRefine / OpenRefine / wiki / GREL-String-Functions#matchstring-s-regexp-p

also described here: https://github.com/OpenRefine/OpenRefine/wiki/Understanding-Regular-Expressions#basic-examples 在此处也进行了描述: https : //github.com/OpenRefine/OpenRefine/wiki/Understanding-Regular-Expressions#basic-examples

You can also use get() as described here in Recipes on our wiki, BUT will only work well if you have only 1 email address per cell (its because the get() function without number from or to, makes assumptions and uses the length of the array to determine the last element and pushes out only the last element, not the first, or third, etc.): https://github.com/OpenRefine/OpenRefine/wiki/Recipes#find-a-sub-pattern-that-exists-at-the-end-of-a-string 您还可以按照Wiki上的食谱中的说明使用get(),但只有在每个单元格只有1个电子邮件地址的情况下,BUT才能很好地工作(这是因为get()函数不带数字或非数字,进行假设并使用长度)确定最后一个元素并仅推出最后一个元素,而不是第一个或第三个,等等): https : //github.com/OpenRefine/OpenRefine/wiki/Recipes#find-a-sub-pattern在字符串的末尾存在

For example: 例如:

get(value.match(/.*(mailto:[^\"\']*.com).*/),0)

So if you have text like: 因此,如果您有以下文字:

Blah blah <a href="mailto:j.bloggs@example.com">mail me</a>

To extract the email address using the match function in OpenRefine you need to use: 要使用OpenRefine中的匹配功能提取电子邮件地址,您需要使用:

value.match(/.*mailto:([^\"\']*.com).*/)

This will give an array containing the email address, which is captured using a capture group. 这将提供一个包含电子邮件地址的数组,该电子邮件地址是使用捕获组捕获的。 To extract the email address from the array (which is necessary if you want to store the mail address in an OpenRefine cell) you need to get the string value from the array. 要从数组中提取电子邮件地址(如果要将邮件地址存储在OpenRefine单元中,这是必需的),则需要从数组中获取字符串值。 eg: 例如:

value.match(/.*mailto:([^\"\']*.com).*/)[0]

The difference between your original expression and this one is that the characters are escaped correctly and there is a capture group - basically implementing the advice from @LukStorms in the comments above. 您的原始表达式与该表达式之间的区别在于,字符可以正确转义并且有一个捕获组-基本上在上述注释中实现了@LukStorms的建议。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM