简体   繁体   中英

Using gsub to replace string and following n words

I am trying to clean texts from parliamentary protocols. Since the data originate from pdf files, they include footers with legislative period and page references as such: "18th legislative period page x of N". Since all 600 protocols differ in their total number of pages, I cannot match exact expressions. Instead, I would like to use the gsub function to delete the beginning of the footer and the next n words.

I worked around with a number of solutions proposed for other questions that went in a similar direction, but could not get it to work.

string <- "this is the first page. 18th legislative period page 1 of 44 
this is the second page. 18th legislative period page 2 of 44 and this is 
the third page"

gsub("18th legislative period page", "", string)

I expect the string to read

"this is the first page. this is the second page. and this is the third page."   

Edit: Thank you so much for your time and patience!

You could use

gsub("18th legislative period page \\d+ of \\d+", "", string)
# or without the newline symbol '\n'
gsub('\\s{2,}', ' ', gsub("18th legislative period page \\d+ of \\d+", "", string))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM