简体   繁体   English

使用正则表达式进行非贪婪(懒惰)匹配?

[英]Non-greedy (lazy) matching using regex?

How do you implement non-greedy matching in Stata using regex? 如何使用正则表达式在Stata中实现非贪婪匹配? Or does Stata even have this capability? 还是Stata具备此功能?

I want to extract all text that occurs between a hashtag "#" and a period ".". 我想提取出现在主题标签“#”和句点“。”之间的所有文本。

Example code: 示例代码:

clear
set obs 3
generate var1="anything#aaabbbccc.dddeee.fff" in 1
replace var1="anything#aaabbbccc.dddeee" in 2
replace var1="anything#aaabbbccc." in 3
generate var2=regexs(1) if regexm(var1,"#(.*)\.")
list

But in Stata (v.13.1), I can't seem to be able to use the non-greedy character #(.*?)\\. 但是在Stata(v.13.1)中,我似乎无法使用非贪婪字符#(.*?)\\. . Thus, above code gives this: 因此,以上代码给出了这一点:

+--------------------------------------------------+
|                          var1               var2 |
|--------------------------------------------------|
| anything#aaabbbccc.dddeee.fff   aaabbbccc.dddeee |
|     anything#aaabbbccc.dddeee          aaabbbccc |
|           anything#aaabbbccc.          aaabbbccc |
+--------------------------------------------------+

But what I want is this: 但是我想要的是:

+--------------------------------------------------+
|                          var1               var2 |
|--------------------------------------------------|
| anything#aaabbbccc.dddeee.fff          aaabbbccc |
|     anything#aaabbbccc.dddeee          aaabbbccc |
|           anything#aaabbbccc.          aaabbbccc |
+--------------------------------------------------+

One play on using #(.*?)\\. 使用#(.*?)\\.一次播放#(.*?)\\. would be to just match any non dot character occurring after the hash sign, ie this pattern: 将仅匹配出现在井号后的任何非点字符,即此模式:

#([^.]*)

Try this code: 试试这个代码:

clear
set obs 3
generate var1="anything#aaabbbccc.dddeee.fff" in 1
replace var1="anything#aaabbbccc.dddeee" in 2
replace var1="anything#aaabbbccc." in 3
generate var2=regexs(1) if regexm(var1,"#([^.]*)")
list

Demo 演示

Once many programmers have learned about regular expressions, they are reluctant to look elsewhere in string management, and with good reason. 一旦许多程序员了解了正则表达式,他们便不愿意在字符串管理中寻找其他理由,并且有充分的理由。

This is just to point out that for the problem given, and many others too, there is a pedestrian alternative: 这只是指出,对于给出的问题以及许多其他问题,还有行人替代方案:

clear
set obs 3
generate var1="anything#aaabbbccc.dddeee.fff" in 1
replace var1="anything#aaabbbccc.dddeee" in 2
replace var1="anything#aaabbbccc." in 3
generate var2=regexs(1) if regexm(var1,"#([^.]*)")

gen where1 = strpos(var1, "#") + 1 
gen where2 = strpos(var1, ".") 
gen var3 = substr(var1, where1, where2 - where1)  

list


     +-------------------------------------------------------------------------+
     |                          var1        var2   where1   where2        var3 |
     |-------------------------------------------------------------------------|
  1. | anything#aaabbbccc.dddeee.fff   aaabbbccc       10       19   aaabbbccc |
  2. |     anything#aaabbbccc.dddeee   aaabbbccc       10       19   aaabbbccc |
  3. |           anything#aaabbbccc.   aaabbbccc       10       19   aaabbbccc |
     +-----------------------------------------------------------------------

Find the positions of the start and end of the substring you want, and extract what lies between. 找到所需的子字符串的开头和结尾的位置,然后提取它们之间的位置。 This is resolutely lacking in style, but sometimes gets you there faster. 这是绝对缺乏风格,但有时可以使您更快地到达那里。 Always remember to account for programmer time in working out the regular expression you need. 始终记住在设计所需的正则表达式时要花程序员的时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM