简体   繁体   中英

R searching for specific string patterns (part 2)

I have a previous question listed here ( Searching for specific string pattern ) but there are some additional questions that I have.

Previously, I thought my file naming convention was only of these formats:

"aaaaa-ttttt-eeee-q4-2015-file"
"aaaaaa-fffff-3333-q2-2012-file"

or specifically, it is the quarter followed by "-" then year .

However, upon further investigation, the files have other variations such as:

"aaaaaa-f2q09-bbbbb"
"aaaaaa-f2q2008-bbbbb"
"aaaaaa-f4q-2008-fffff"
"f4q-aaaaa-eeeeee-2008"
"q2-aaaaaaaaa-eeeeeee-2005"
"aaaaaaaa-3q-2008-rrrrrrr"

Similarly for all the above, I would like to extract the year and quarter , and I'm not sure if there is a general code that I can write that can extract them all at one go or do i have to write a few sets of code and run them by waves. Not very familiar with sub function in R and would actually appreciate if someone can point me to a website that has detailed explanations and examples for me to write my own code to extract these info.

Ultimately, the code should parse all those strings and output something like: year = 2005 , quarter = q4 etc.

Try this it uses regexpr to show the location of the match and regmatches to return them, it is very susceptible to pull out incorrect data. For quarter it will return any instance of 1-4 either followed or preceded by a q. If there is any other information that can make these more specific matches than I suggest including them.

input=c("aaaaaa-f2q09-bbbbb",
"aaaaaa-f2q2008-bbbbb",
"aaaaaa-f4q-2008-fffff",
"f4q-aaaaa-eeeeee-2008",
"q2-aaaaaaaaa-eeeeeee-2005",
"aaaaaaaa-3q-2008-rrrrrrr")


quarter=regmatches(input, regexpr("[1-4]q|q[1-4]", input))
year = regmatches(input, regexpr("q\\d{4}|q\\d{2}|\\d{4}", input))
year = gsub("q","",year)
year = sub("\\b(\\d{2})\\b","20\\1", year)

There are lots of issues with the year matching also, because you have three different formats that are possible "q09", "q2008", "2008". Because the function returns the first match in the string the q\\d{4} is needed to pull back the q2008 example.

My sub function here subs that matching regular expression with 20 and the matching expression itself, the \\\\1 is returning the stuff in brackets (\\\\d{2})

Test it and comment any mistakes

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM