简体   繁体   中英

Regular expressions in Ruby

I've got an external text file which looks like this:

This_ART is_P an_ART example_N.
Thus_KONJ this_ART is_P a_ART part_N of_PREP it_N.

Now I want to open this file in Ruby and make an Array with every annotated word. My attempt looks like this:

def get_entries(file)
  return File.open(file).map { |x| x.split(/\W+_[A-Z]+/) }
end

but the execution just returns an Array with each sentence as a member:

[["This_ART is_P an_ART example_N.\n"],["Thus_KONJ this_ART is_P a_ART part_N of PREP it_N.\n"]]

The punctuation and the escape characters are included. Where is the mistake or what do I have to change to get the correct array?

try scanning for just the ones you want, eg

return File.read(file).scan(/\w+_[A-Z]+/)

that will give you something like:

["This_ART", "is_P", "an_ART", "example_N", "Thus_KONJ", ...]

if you want the annotation part removed, you could tack on:

.map{ |w| w.gsub(/_[A-Z]+\z/, '') }

note that \\w is word chars and \\W is non-word chars

/\W+_[A-Z]+/

matches only if there is a non-word character before the _ , which isn't the case in your string.

I don't know exactly what you're expecting as a result, but try this:

/_[A-Z]+\W*/

Splitting along this regex gives you

["This", "is", "an", "example", "Thus", "this", "is", "a", "part", "of", "it"]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM