简体   繁体   English

如何在 python 中有效地同时基于参考列表和单词拆分字符串?

[英]How to split a string based on a reference list and on words at the same time efficiently in python?

I have a string, and a reference list of elements.我有一个字符串和一个元素的参考列表。 I want to be able to split the string into another list of elements, taking the reference list into account.我希望能够将字符串拆分为另一个元素列表,同时考虑参考列表。 That means spliting the sentence based on reference or words.这意味着根据参考或单词拆分句子。 For example,例如,

reference_list = ['10', '2 to 3', '1 and 1/2' '1/2', '1/22', ... ... etc]
my_list = "this happened at 10 o'clock and now after 2 to 3 hours has gone..meet 1 and 1/2 hours later. Visit us on 1/22 or 2/12/2012... etc.

Output should look like, Output 应该看起来像,

out = ["this", "happened", "at", "10", "o'clock", .... "2 to 3", "hours", ... ... "1 and 1/2", "hours", ... "1/22", "or", "2/12/2012... ]

I would appreciate any help.我将不胜感激任何帮助。 Thank you in advance.先感谢您。

Update:更新:

I have tried this,我试过这个,

   reg = r'\b(%s|\w+)\b' % '|'.join(reference_list)
   print(reg)
   result = []
   for e in re.finditer(reg, sentence):
       result.append(e.group())
   
   print(result)

Doesn't work.不工作。

This is similar to the split strings and keep separators problem.这类似于拆分字符串并保留分隔符问题。

You could concatenate all of your reference_list strings into one regex and use that.您可以将所有reference_list字符串连接到一个正则表达式中并使用它。

Then for the resulting list, you can split the results that aren't in the reference_list by spaces.然后对于结果列表,您可以将不在reference_list中的结果用空格分割。

Suppose we have the following data:假设我们有以下数据:

reference_list = ['10', '2', '1', '2 to 3', '1/2', '1 and 1/2',
                  '1/22', '2 to 3 to 4']
my_list = "this happened at 10 o'clock and now after 2 to 3 " +
          "to 4 hours has gone we've decided to meet on-time " +
          "1 and 1/2 hours later. Visit us on 1/22 or 2/12/2012"

(I have written the string this way so that it can be viewed without the need for horizontal scrolling.) (我以这种方式编写了字符串,以便无需水平滚动即可查看它。)

The key is to first sort reference_list to create a list new_list such that if new_list[j] is included in new_list[i] then i < j (though the opposite is generally not true.) With Ruby this could be done as follows.关键是首先对reference_list进行排序以创建一个列表new_list ,这样如果new_list[j]包含在new_list[i]中,则i < j (尽管相反通常不正确)。对于 Ruby,可以按如下方式完成。

new_list = reference_list.sort { |a,b| a.include?(b) ? -1 : 1 }
  #=> ["1/22", "1 and 1/2", "1/2", "2 to 3 to 4", "10", "1",
  #    "2 to 3", "2"]

I assume Python code would be similar.我假设 Python 代码会相似。

Next we programmatically construct a regular expression from new_list .接下来,我们以编程方式从new_list构造一个正则表达式。 Again, this could be done as follows in Ruby, and I assume the Python code would be similar:同样,这可以在 Ruby 中按如下方式完成,我假设 Python 代码将类似:

/\b(?:#{new_list.join('|')}|[\w'-]+)\b/
  #=> /\b(?:1\/22|1 and 1\/2|1\/2|2 to 3 to 4|10|1|2 to 3|2|[\w'-]+)\b/

If this regular expression is used with re.findall we obtain the following result:如果将此正则表达式与re.findall使用,我们将获得以下结果:

["this", "happened", "at", "10", "o'clock", "and", "now", "after",
 "2 to 3 to 4", "hours", "has", "gone", "we've", "decided", "to",
 "meet", "on-time", "1 and 1/2", "hours", "later", "Visit", "us",
 "on", "1/22", "or", "2", "12", "2012"]

Python regex demo Python 正则表达式演示

Before any match has been made, and after each match has been made, findall attempts to match '1/22' at the current location in the string.在进行任何匹配之前和每次匹配之后, findall都会尝试在字符串中的当前位置匹配'1/22' If that fails to match it attempts to match '1 and 1\/2' , and so on.如果不匹配,它会尝试匹配'1 and 1\/2' ,依此类推。 Lastly, if all matches but the last fail it will attempt to match the catch-all [\w'-]+ .最后,如果所有匹配但最后一个失败,它将尝试匹配全部[\w'-]+ I have arbitrarily included an apostrophe (so "o'clock" will be matched) and hyphen (so "on-time" will be matched).我任意包含了一个撇号(因此"o'clock"将被匹配)和连字符(因此"on-time"将被匹配)。 Notice that all matches must be preceded and followed by a word boundary ( \b ).请注意,所有匹配项的前后都必须有一个单词边界( \b )。

Notice that while '2 to 3 to 4' is matched by 2 to 3 to 4 , 2 to 3 and 2 , the ordering of the elements of the alternation ensure that first of these is the match that is made.请注意,虽然'2 to 3 to 4'2 to 3 to 42 to 32匹配,但交替元素的顺序确保其中第一个是匹配的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM