简体   繁体   English

使用python将字符串转换为所需标记的列表

[英]Converting a string into list of desired tokens using python

I have ingredients for thousands of products for example:例如,我有数千种产品的成分:

Ingredient = 'Beef stock (beef bones, water, onion, carrot, beef meat, parsnip, thyme, parsley, clove, black pepper, bay leaf), low lactose cream (28%), onion, mustard, modified maize starch,tomato puree, modified potato starch, butter sugar, salt (0,8%), burnt sugar, blackcurrant, peppercorns (black, pink, green, all spice, white) 0,4%.'

I want this ingredient in the form of a list like the following:我希望这种成分以如下列表的形式出现:

listOfIngredients = ['Beef Stock', 'low lactose cream', 'onion', 'mustard', 'modified maize starch','tomato puree', 'modified potato starch', 'butter sugar', 'salt', 'burnt sugar', 'blackcurrant', 'peppercorns']

So in the listOfIngredients I do not have any explanations of the product in percentage or even further products that one ingredient itself contains.因此,在 listOfIngredients 中,我没有对产品的百分比或一种成分本身包含的其他产品进行任何解释。 Regex is a good way of doing this but I am not good at making regex.正则表达式是这样做的好方法,但我不擅长制作正则表达式。 Can someone help me in making regex to get the desired output.有人可以帮助我制作正则表达式以获得所需的输出。 Thanks in advance.提前致谢。

You might try two approaches.您可以尝试两种方法。

The first one is to remove all (...) substrings and anything that is not , after (that is not followed with non-word boundary).第一个是删除所有(...)子字符串和任何不是, after (后面没有非单词边界的)。

\s*\([^()]*\)[^,]*(?:,\b[^,]*)*

See the regex demo查看正则表达式演示

Details :详情

  • \\s* - 0+ whitespaces \\s* - 0+ 个空格
  • \\([^()]*\\) - a (...) substring having no ( and ) inside: \\([^()]*\\) - 内部没有()(...)子字符串:
    • \\( - a literal ( \\( - 一个文字(
    • [^()]* - 0+ chars other than ( and ) (a [^...] is a negated character class ) [^()]* - 除()之外的 0+ 个字符(a [^...]否定字符类
  • [^,]* - 0+ chars other than , [^,]* - 除,以外的 0+ 个字符
  • (?:,\\b[^,]*)* - zero or more sequences of: (?:,\\b[^,]*)* - 零个或多个序列:
    • ,\\b - a comma that is followed with a letter/digit/underscore ,\\b - 后跟字母/数字/下划线的逗号
    • [^,]* - 0+ chars other than , . [^,]* - 除,之外的 0+ 个字符。

These matches are removed, and then ,\\s* regex is used to split the string with a comma and 0+ whitespaces to get the final result.删除这些匹配项,然后使用,\\s* regex 用逗号和 0+ 空格分割字符串以获得最终结果。

The second one is based on matching and capturing words consisting of letters (and _ ) only, and just matching (...) substrings.第二个基于匹配和捕获仅由字母(和_ )组成的单词,并且仅匹配(...)子字符串。

\([^()]*\)|([^\W\d]+(?:\s+[^\W\d]+)*)

See the second regex demo参见第二个正则表达式演示

Details :详情

  • \\([^()]*\\) - a (...) substring having no ( and ) inside \\([^()]*\\) - 一个(...)子串,里面没有()
  • | - or - 或者
  • ([^\\W\\d]+(?:\\s+[^\\W\\d]+)*) - Group 1 capturing: ([^\\W\\d]+(?:\\s+[^\\W\\d]+)*) - 第 1 组捕获:
    • [^\\W\\d]+ - 1+ letters or underscores (you may add _ after \\d to exclude underscores) [^\\W\\d]+ - 1+ 个字母或下划线(您可以在\\d后添加_以排除下划线)
    • (?:\\s+[^\\W\\d]+)* - 0+ sequences of: (?:\\s+[^\\W\\d]+)* - 0+ 个序列:
      • \\s+ - 1 or more whitespaces \\s+ - 1 个或多个空格
      • [^\\W\\d]+ - 1+ letters or underscores [^\\W\\d]+ - 1+ 个字母或下划线

Both return the same results for the current string, but you may want to adjust it in future.两者都为当前字符串返回相同的结果,但您可能希望在将来对其进行调整。

See Python demo :参见Python 演示

import re
Ingredient = 'Beef stock (beef bones, water, onion, carrot, beef meat, parsnip, thyme, parsley, clove, black pepper, bay leaf), low lactose cream (28%), onion, mustard, modified maize starch,tomato puree, modified potato starch, butter sugar, salt (0,8%), burnt sugar, blackcurrant, peppercorns (black, pink, green, all spice, white) 0,4%.'
res = re.sub(r'\s*\([^()]*\)[^,]*(?:,\b[^,]*)*', "", Ingredient)
print(re.split(r',\s*', res))

vals = re.findall(r'\([^()]*\)|([^\W\d]+(?:\s+[^\W\d]+)*)', Ingredient)
vals = [x for x in vals if x]
print(vals)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM