简体   繁体   English

从向量中删除字符串

[英]removing strings from a vector

Work on raw textual data from a scanned catalog. 处理扫描目录中的原始文本数据。
I only want to keep 2 types of strings: 我只想保留两种类型的字符串:
- begining with a number (artists works) - 以数字开头(艺术家作品)
- containing 2 juxtaposed uppercases letters **with accents **(artists names) - 包含2个并列的大写字母**和重音**(艺术家姓名)

I want easily to remove everything else (with true -false?) 我想轻松删除其他所有内容(使用true -false?)

my datas 我的数据

ÁÀDFDS (artist 1 with accents)
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB (artist 2)
2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis.
B'BDDED (artist 3)
az*ù*ù*ù (bad string)
3 Nunc et eros eget turpis sollicitudin mollis id et mi.
4 Mauris condimentum velit eu consequat feugiat.
5 Suspendisse sit amet metus vitae est eleifend tincidunt.
ÉÈDFSF (artist 4)
6 Sed cursus augue in tempus scelerisque.
A..gdgdgdg (bad string begining with a upper case letter)
7 in commodo enim in laoreet gravida.

expected results 预期成绩

with accents DFDS
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB 
2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis.
B'BDDED
3 Nunc et eros eget turpis sollicitudin mollis id et mi.
4 Mauris condimentum velit eu consequat feugiat.
5 Suspendisse sit amet metus vitae est eleifend tincidunt.
ÉÈDDFSF
6 Sed cursus augue in tempus scelerisque.
7 in commodo enim in laoreet gravida.

The data is imported into R with: 数据导入R:

readlines ("clipboard")

I am able to identify lines including artist names in capital letters with different regex 我能够识别包含不同正则表达式的大写字母的艺术家姓名的行

eg 例如

[A-ZÁÀÂÄÃÅÇÉÈÊËÍÌÎÏÑÓÒÔÖÕÚÙÛÜÝYÆO][A-ZÁÀÂÄÃÅÇÉÈÊËÍÌÎÏÑÓÒÔÖÕÚÙÛÜÝYÆO |']

I am able to identify lines including artworks 我能够识别包括艺术品在内的线条

^[0-9]+[\s]

Any help would be greatly appreciated. 任何帮助将不胜感激。

Just a side-note: [:upper:] matches uppercase letters in the current locale ( see source ). 只是旁注: [:upper:]匹配当前语言环境中的大写字母( 参见源代码 )。 Thus, this solution is good if you work with one locale: 因此,如果您使用一个区域设置,此解决方案很好:

ll <- readLines(textConnection("ÁÀDFDS (artist 1)
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB (artist 2)
2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis.
BBDDED (artist 3)
az*ù*ù*ù (bad string)
3 Nunc et eros eget turpis sollicitudin mollis id et mi.
4 Mauris condimentum velit eu consequat feugiat.
5 Suspendisse sit amet metus vitae est eleifend tincidunt.
ÉÈDFSF (artist 4)
6 Sed cursus augue in tempus scelerisque.
...gdgdgdg (bad string)
7 in commodo enim in laoreet gravida."))
ll[grep("^[[:digit:]]+[[:blank:]]|[[:upper:]]['[:upper:]]", ll)]

See the IDEONE demo 请参阅IDEONE演示

The regex breakdown: 正则表达式细分:

  • ^ - start of string ^ - 字符串的开头
  • [[:digit:]]+ - 1 or more digits [[:digit:]]+ - 1位或更多位数
  • [[:blank:]] - 1 space or tab [[:blank:]] - 1个空格或制表符
  • | - or - 要么
  • [[:upper:]]['[:upper:]] - an uppercase letter followed by ' or another uppercase letter. [[:upper:]]['[:upper:]] - 一个大写字母后跟'或另一个大写字母。

And here is a way to achieve what you need with a Perl-like regex: 这是一种通过类似Perl的正则表达式实现所需的方法:

ll[grep("^\\d+\\s|\\p{Lu}['\\p{Lu}]", ll, perl=T)]

The regex matches: 正则表达式匹配:

  • ^ - start of string ^ - 字符串的开头
  • \\\\d+\\\\s - 1 or more digits and then a whitespace \\\\d+\\\\s - 一个或多个数字,然后是一个空格
  • | - or... - 要么...
  • \\\\p{Lu}['\\\\p{Lu}] - an uppercase Unicode letter followed by either an apostrophe or another uppercase Unicode letter. \\\\p{Lu}['\\\\p{Lu}] - 一个大写的Unicode字母后跟一个撇号或另一个大写的Unicode字母。

The output of the sample demo : 示例演示的输出:

[1] "ÁÀDFDS (artist 1)"                                                     
[2] "1 Lorem ipsum dolor sit amet, consectetur adipiscing elit."            
[3] "AB (artist 2)"                                                         
[4] "2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis."
[5] "BBDDED (artist 3)"                                                     
[6] "3 Nunc et eros eget turpis sollicitudin mollis id et mi."              
[7] "4 Mauris condimentum velit eu consequat feugiat."                      
[8] "5 Suspendisse sit amet metus vitae est eleifend tincidunt."            
[9] "ÉÈDFSF (artist 4)"                                                     
[10] "6 Sed cursus augue in tempus scelerisque."                             
[11] "7 in commodo enim in laoreet gravida."    

To clean up the beginning of strings, you can use 要清理字符串的开头,您可以使用

ll <- gsub("^[\\P{L}\\D]*?([\\p{L}\\d])", "\\1", ll, perl=T)

The regex ^[\\\\P{L}\\\\D]*?([\\\\p{L}\\\\d]) matches any non-letters and non-digits as few as possible before a letter or a digit (that are placed into a capturing group), and then restores the captured alphanumeric using the \\1 backreference with gsub call. 正则表达式^[\\\\P{L}\\\\D]*?([\\\\p{L}\\\\d])在字母或数字之前匹配任何非字母和非数字(尽可能少)放入捕获组),然后使用带有gsub调用的\\1反向引用恢复捕获的字母数字。 Use it before grep ping. grep ping之前使用它。

See IDEONE demo 请参阅IDEONE演示

You can use grep : 你可以使用grep

z<-readlines ("clipboard")
z[grep("^[0-9]|[[:upper:]]{2,}", z)]
 [1] "AADFDS (artist 1)"                                                     
 [2] "1 Lorem ipsum dolor sit amet, consectetur adipiscing elit."            
 [3] "AB (artist 2)"                                                         
 [4] "2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis."
 [5] "BBDDED (artist 3)"                                                     
 [6] "3 Nunc et eros eget turpis sollicitudin mollis id et mi."              
 [7] "4 Mauris condimentum velit eu consequat feugiat."                      
 [8] "5 Suspendisse sit amet metus vitae est eleifend tincidunt."            
 [9] "CCDDFSF (artist 4)"                                                    
[10] "6 Sed cursus augue in tempus scelerisque."                             
[11] "7 in commodo enim in laoreet gravida."  

You can use POSIX character classes if you want. 如果需要,可以使用POSIX字符类。 However, their interpretation depends on the current locale and if it's not set properly, it could alter the behavior of the POSIX class. 但是,它们的解释取决于当前的语言环境,如果设置不正确,它可能会改变POSIX类的行为。

I'd recommend turning on Perl regular expressions and use Unicode properties. 我建议打开Perl正则表达式并使用Unicode属性。

x <- readLines('clipboard')
r <- x[grepl("^\\pN+|\\p{Lu}[\\p{Lu}']", x, perl=TRUE)]

Another interesting way would be to match the accented letters, dissuading from POSIX. 另一个有趣的方法是匹配重音字母,从POSIX劝阻。

r <- x[grepl("^\\d+|(?![×Þß÷þø])[A-ZÀ-ÿ][A-ZÀ-ÿ']", x, perl=TRUE)]

You can view the compiled demo of both regular expressions be used. 您可以查看正在使用的正则表达式的编译演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM