I've got an external text file which looks like this:
This_ART is_P an_ART example_N.
Thus_KONJ this_ART is_P a_ART part_N of_PREP it_N.
Now I want to open this file in Ruby and make an Array with every annotated word. My attempt looks like this:
def get_entries(file)
return File.open(file).map { |x| x.split(/\W+_[A-Z]+/) }
end
but the execution just returns an Array with each sentence as a member:
[["This_ART is_P an_ART example_N.\n"],["Thus_KONJ this_ART is_P a_ART part_N of PREP it_N.\n"]]
The punctuation and the escape characters are included. Where is the mistake or what do I have to change to get the correct array?
try scanning for just the ones you want, eg
return File.read(file).scan(/\w+_[A-Z]+/)
that will give you something like:
["This_ART", "is_P", "an_ART", "example_N", "Thus_KONJ", ...]
if you want the annotation part removed, you could tack on:
.map{ |w| w.gsub(/_[A-Z]+\z/, '') }
note that \\w is word chars and \\W is non-word chars
/\W+_[A-Z]+/
matches only if there is a non-word character before the _
, which isn't the case in your string.
I don't know exactly what you're expecting as a result, but try this:
/_[A-Z]+\W*/
Splitting along this regex gives you
["This", "is", "an", "example", "Thus", "this", "is", "a", "part", "of", "it"]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.