如何从第一个方括号和最后一个圆括号中拆分 R 中的字符串？

Question

I am dealing with legal citations.我正在处理法律引用。 I want to split the citations into four parts.我想将引文分成四个部分。 The citation is in general format as follows: ABC v. DEF [Year] citation data (Authority) So, I want to split it into four parts - ABC v. DEF, Year, citation data, and authority.引文的一般格式如下：ABC v. DEF [Year] 引文数据（权威）所以，我想把它分成四个部分 - ABC v. DEF、年份、引文数据和权威。 The problem is that the first part (ie, ABC v. DEF)might have additional round brackets, while the third part (ie, citation data) might have additional square and/or round brackets.问题是第一部分（即 ABC v. DEF）可能有额外的圆括号，而第三部分（即引文数据）可能有额外的方括号和/或圆括号。 For example, in this following case例如，在以下这种情况下

"Lubrizol Corporation, USA v. Asstt. DIT (International Taxation) [2013] 33 taxmann.com 424/60 SOT 118 (URO) (Mum. Trib.)"

The first part is "Lubrizol Corporation, USA v. Asstt. DIT (International Taxation)" , second part is "2013" , third part is "33 taxmann.com 424/60 SOT 118 (URO)" and the last part is "Mum. Trib."第一部分是"Lubrizol Corporation, USA v. Asstt. DIT (International Taxation)" ，第二部分是"2013" ，第三部分是"33 taxmann.com 424/60 SOT 118 (URO)" ，最后一部分是"Mum. Trib." I am unable to come up with the right regex to do this.我无法想出正确的正则表达式来做到这一点。 Can anyone help me with this one?谁能帮我解决这个问题？

Answer 1

text <- "Lubrizol Corporation, USA v. Asstt. DIT (International Taxation) [2013] 33 taxmann.com 424/60 SOT 118 (URO) (Mum. Trib.)"
pattern <- "(.*?)\\s*\\[(\\d{4})\\]\\s*(.*?)\\s*\\((.*)\\)"

regmatches(text, regexec(pattern, text))
[[1]]
[1] "Lubrizol Corporation, USA v. Asstt. DIT (International Taxation) [2013] 33 taxmann.com 424/60 SOT 118 (URO) (Mum. Trib.)"
[2] "Lubrizol Corporation, USA v. Asstt. DIT (International Taxation)"                                                        
[3] "2013"                                                                                                                    
[4] "33 taxmann.com 424/60 SOT 118 (URO)"                                                                                     
[5] "Mum. Trib."

If you want a dataframe:如果你想要一个数据框：

dat <- data.frame(citation = character(), year = numeric(), data = character(), Authority = character())
strcapture(pattern, text, dat)
                                                          citation year                                data  Authority
1 Lubrizol Corporation, USA v. Asstt. DIT (International Taxation) 2013 33 taxmann.com 424/60 SOT 118 (URO) Mum. Trib.

Answer 2

Use extract :使用extract ：

library(tidyr)
data.frame(txt) %>%
  extract(txt,
          into = c("First", "Sec", "Thrd", "Frth"),
          regex = "(.+)\\[(\\d+)\\](.*)\\((.*)\\)")
                                                              First  Sec                                  Thrd       Frth
1 Lubrizol Corporation, USA v. Asstt. DIT (International Taxation)  2013  33 taxmann.com 424/60 SOT 118 (URO)  Mum. Trib.

The regex part looks scarier than it is: you simply describe the string in full, wrapping those parts that you wish to extract into parentheses (the syntaxt for capturing groups) regex部分看起来比实际更可怕：您只需完整地描述字符串，将您希望提取的那些部分包装到括号中（用于捕获组的语法）

Data:数据：

txt <- "Lubrizol Corporation, USA v. Asstt. DIT (International Taxation) [2013] 33 taxmann.com 424/60 SOT 118 (URO) (Mum. Trib.)"

如何从第一个方括号和最后一个圆括号中拆分 R 中的字符串？

问题描述

2 个解决方案

解决方案1
1 2022-06-09 02:13:03

解决方案2
1 已采纳 2022-06-09 07:23:02

如何从第一个方括号和最后一个圆括号中拆分 R 中的字符串？

问题描述

2 个解决方案

解决方案1 1 2022-06-09 02:13:03

解决方案2 1 已采纳 2022-06-09 07:23:02

解决方案1
1 2022-06-09 02:13:03

解决方案2
1 已采纳 2022-06-09 07:23:02