简体   繁体   English

正则表达式在R中命名组

[英]Regex named groups in R

For all intents and purposes, I am a Python user and use the Pandas library on a daily basis. 出于所有意图和目的,我是Python用户并且每天使用Pandas库。 The named capture groups in regex is extremely useful. 正则表达式中的命名捕获组非常有用。 So, for example, it is relatively trivial to extract occurrences of specific words or phrases and to produce concatenated strings of the results in new columns of a dataframe. 因此,例如,提取特定单词或短语的出现并在数据帧的新列中生成结果的连接字符串是相对微不足道的。 An example of how this might be achieved is given below: 下面给出了如何实现这一目标的一个例子:

import numpy as np
import pandas as pd
import re

myDF = pd.DataFrame(['Here is some text',
                     'We all love TEXT',
                     'Where is the TXT or txt textfile',
                     'Words and words',
                     'Just a few works',
                     'See the text',
                     'both words and text'],columns=['origText'])

print("Original dataframe\n------------------")
print(myDF)

# Define regex to find occurrences of 'text' or 'word' as separate named groups
myRegex = """(?P<textOcc>t[e]?xt)|(?P<wordOcc>word)"""
myCompiledRegex = re.compile(myRegex,flags=re.I|re.X)

# Extract all occurrences of 'text' or 'word'
myMatchesDF = myDF['origText'].str.extractall(myCompiledRegex)
print("\nDataframe of matches (with multi-index)\n--------------------")
print(myMatchesDF)

# Collapse resulting multi-index dataframe into single rows with concatenated fields
myConcatMatchesDF = myMatchesDF.groupby(level = 0).agg(lambda x: '///'.join(x.fillna('')))
myConcatMatchesDF = myConcatMatchesDF.replace(to_replace = "^/+|/+$",value = "",regex = True) # Remove '///' at start and end of strings
print("\nCollapsed and concatenated matches\n----------------------------------")
print(myConcatMatchesDF)

myDF = myDF.join(myConcatMatchesDF)
print("\nFinal joined dataframe\n----------------------")
print(myDF)

This produces the following output: 这会产生以下输出:

Original dataframe
------------------
                           origText
0                 Here is some text
1                  We all love TEXT
2  Where is the TXT or txt textfile
3                   Words and words
4                  Just a few works
5                      See the text
6               both words and text

Dataframe of matches (with multi-index)
--------------------
        textOcc wordOcc
  match                
0 0        text     NaN
1 0        TEXT     NaN
2 0         TXT     NaN
  1         txt     NaN
  2        text     NaN
3 0         NaN    Word
  1         NaN    word
5 0        text     NaN
6 0         NaN    word
  1        text     NaN

Collapsed and concatenated matches
----------------------------------
            textOcc      wordOcc
0              text             
1              TEXT             
2  TXT///txt///text             
3                    Word///word
5              text             
6              text         word

Final joined dataframe
----------------------
                           origText           textOcc      wordOcc
0                 Here is some text              text             
1                  We all love TEXT              TEXT             
2  Where is the TXT or txt textfile  TXT///txt///text             
3                   Words and words                    Word///word
4                  Just a few works               NaN          NaN
5                      See the text              text             
6               both words and text              text         word

I've printed each stage to try to make it easy to follow. 我打印了每个阶段,试图让它易于理解。

The question is, can I do something similar in R. I've searched the web but can't find anything that describes the use of named groups (although I'm an R-newcomer and so might be searching for the wrong libraries or descriptive terms). 问题是,我可以在R中做类似的事情。我在网上搜索过但找不到任何描述使用命名组的内容(尽管我是R-newcomer,因此可能会搜索错误的库或描述性术语)。

I've been able to identify those items that contain one or more matches but I cannot see how to extract specific matches or how to make use of the named groups. 我已经能够识别那些包含一个或多个匹配项的项目,但我看不到如何提取特定匹配项或如何使用命名组。 The code I have so far (using the same dataframe and regex as in the Python example above) is: 到目前为止我使用的代码(使用与上面Python示例中相同的数据框和正则表达式)是:

origText = c('Here is some text','We all love TEXT','Where is the TXT or txt textfile','Words and words','Just a few works','See the text','both words and text')
myDF = data.frame(origText)
myRegex = "(?P<textOcc>t[e]?xt)|(?P<wordOcc>word)"
myMatches = grep(myRegex,myDF$origText,perl=TRUE,value=TRUE,ignore.case=TRUE)
myMatches
[1] "Here is some text"                "We all love TEXT"                 "Where is the TXT or txt textfile" "Words and words"                 
[5] "See the text"                     "both words and text"             

myMatchesRow = grep(myRegex,myDF$origText,perl=TRUE,value=FALSE,ignore.case=TRUE)
myMatchesRow
[1] 1 2 3 4 6 7

The regex seems to be working and the correct rows are identified as containing a match (ie all except row 5 in the above example). 正则表达式似乎正在工作,并且正确的行被标识为包含匹配(即,除了上面示例中的第5行之外的所有行)。 However, my question is, can I produce an output that is similar to that produced by Python where the specific matches are extracted and listed in new columns in the dataframe that are named using the group names contained in the regex? 但是,我的问题是,我是否可以生成类似于Python生成的输出,其中提取特定匹配项并在数据框中使用正则表达式中包含的组名命名的新列中列出?

Base R does capture the information about the names but it doesn't have a good helper to extract them by name. Base R确实捕获了有关名称的信息,但它没有一个好帮手来按名称提取它们。 I write a wrapper to help called regcapturedmatches . 我写了一个包装器来帮助调用regcapturedmatches You can use it with 你可以用它

myRegex = "(?<textOcc>t[e]?xt)|(?<wordOcc>word)"
m<-regexpr(myRegex, origText, perl=T, ignore.case=T)
regcapturedmatches(origText,m)

Which returns 哪个回报

     textOcc wordOcc
[1,] "text"  ""     
[2,] "TEXT"  ""     
[3,] "TXT"   ""     
[4,] ""      "Word" 
[5,] ""      ""     
[6,] "text"  ""     
[7,] ""      "word" 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM