简体   繁体   English

openrefine使用正则表达式从文本列中提取数字

[英]openrefine extracting a number from a text column using regex

I'm trying to parse out a column of data from the OpenFoodFacts dataset that I found via Kaggle. 我正在尝试从我通过Kaggle找到的OpenFoodFacts数据集中解析出一列数据。 There is a attribute called "serving_size" that contains whatever serving size information is presented on the package for a food item. 有一个名为“ serving_size”的属性,其中包含食品包装上显示的任何份量信息。 Most of the time the serving size is expressed in grams (g), however there is often other text as well. 多数情况下,份量以克(g)表示,但是通常也有其他文字。 I'd like to be able to search through the string, find the number that corresponds to the number of grams, and extract that value into its own field. 我希望能够搜索字符串,找到与克数相对应的数字,然后将该值提取到其自己的字段中。 The value is not just an integer - it might have a decimal. 该值不仅是整数-可能有一个十进制。

I'm new to regular expressions, but it seems like it ought to be possible to search for the "g" character and if it is proceeded by any numeric values to extract them. 我是正则表达式的新手,但似乎应该可以搜索“ g”字符,以及是否以任何数字值开头来提取它们。 I've found some recipes that suggest this is possible, but so far nothing I've tried has worked. 我发现一些食谱表明可以做到这一点,但到目前为止,我尝试过的任何方法都没有奏效。 In the OpenRefine documentation they give the example of extracting decimal data using this regex: /[-+]?[0-9]+(.[0-9]+)?/, but there was no variation of that I could get to work in our scenario. 在OpenRefine文档中,他们给出了使用此正则表达式提取十进制数据的示例:/[-+]?[0-9]+(.[0-9]+)?/,但是我没有得到任何变化在我们的方案中工作。 I've also tried commands like "value.match(/(. )?(/d+[g]). ?/)". 我也尝试过类似“ value.match(/(.)?(/ d + [g])。 ?/)”的命令。 I'm finding that I don't understand how regex is supposed to work - when I tell it "/d" I'm expecting that it will ONLY give me back numeric values, however that does not appear to be the case - it gives whatever is there regardless of the character type. 我发现我不了解正则表达式应该如何工作-当我告诉它“ / d”时,我期望它只会给我返回数值,但是事实并非如此-它给出任何字符类型的字符。

Any help would be appreciated. 任何帮助,将不胜感激。

Here are some example text strings from the data: 以下是数据中的一些示例文本字符串:

serving_size  
 - 113.5g
 - 20g
 - 1 cup (227g)
 - 4 cookies (15g)
 - 13 pieces (39g)
 - 1/4 packet (21g) makes 1/2 cup
 - 0.75 oz (21g)
 - 1 can (12 FL OZ) 355g
 - 15.2 fl oz (450g)
 - 1 can (355mL)
 - 1/4 tsp (1.4g)
 - 10 fl oz 1 bottle.
 - 20 fl oz
 - 1 envelope (21g)
 - 1 tbsp (4.5g)
 - 45.2g
 - 1/2 pack 142.5gms
 - 1 carré de chocolat de 20g
 - 4 biscottes (≈ 35g) Ce paquet contient 8.5 portions de 4
   biscottes.
 - 0.33L
 - 2galettes 10.5g
 - 0.041649313g
 - 1 package (79g)

screenshot of attempt 尝试的屏幕截图

In OpenRefine GREL (the language used to write the transformations) the 'match' function requires the regular expression to match the entire string in the cell - you can't use a partial match. 在OpenRefine GREL(用于编写转换的语言)中,“匹配”功能需要正则表达式来匹配单元格中的整个字符串-您不能使用部分匹配。

The output of the 'match' function is an array of all the capture groups. “匹配”功能的输出是所有捕获组的数组。 To get a specific value you have to select this from the array, or convert the array to a string. 要获取特定值,您必须从数组中选择该值,或者将数组转换为字符串。

So for example you could try: 因此,例如,您可以尝试:

value.match(/.*?(\d+\.?\d*)g(ram)?(s)?\b?.*/)[0]

This will find all strings where there is a number (with or without a decimal point) in front of the letter 'g', or 'gram' or 'grams', followed by a non-word character (eg a space or a bracket) and will capture the number as the first member of the resulting array of capture groups. 这将查找所有在字母“ g”,“ gram”或​​“ grams”前面带有数字(带小数点或不带小数点)的字符串,后跟非单词字符(例如空格或方括号) ),并将该数字捕获为捕获组结果数组的第一个成员。

The '?' '?' is needed after the first '.*' to make this lazy, so that the capture group gets the whole number, not just the last digit. 需要在第一个“。*”之后使它变懒,以便捕获组获取整数,而不仅仅是最后一位。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM