Parsing out particular text in a big text column in a Dataframe - R

Question

Suppose I have the following data,

data

text
abc/1234&
qwertyabc/5555&
a&sdfghabc/ppp&plksa&
z&xabc/lkjh&poiuw&
lkjqwefasrjabc/855698&plkjdhweb

For example if I want to parse out the text between abc/ and first occurrence of & alone, how do I parse out those text between these texts. I want the text between first occurence of abc/ and first occurrence of & after abc/ has occurred.

My output should be as follows,

data

text                                 parsed_out
abc/1234&                               1234 
qwertyabc/5555&                         5555
a&sdfghabc/ppp&plksa&                    ppp
z&xabc/lkjh&poiuw&                      lkjh
lkjqwefasrjabc/855698&plkjdhweb       855698

The following is my trying,

data1 = within(data, FOO<-data.frame(do.call('rbind', strsplit(as.character(text), 'abc/', fixed=TRUE))))

data2 = within(data1, FOO1<-data.frame(do.call('rbind', strsplit(as.character(FOO$X1), '&', fixed=TRUE))))

This is using too much of memory since the text file is of 8 million rows and also data2 would be having several columns because it has several '&'. Can anybody help me in parsing text between these two characters as only one column in a best efficient way so that it doesn't occupy too much of memory?

x = "thesearepresentinthestartingwhichisnotneededhttp://google.com/needstobeparsedout&reoccurencenotneeded&"

here, the function should check for http://google.com/ and parse out until first & is found. Here the output should be needstobeparsedout.

new_x = "\\" http://www.google.com/search?q=erykah+badu+with+hiatus+kaiyote,+august+3& ""

Why is it not working with this link?

Thanks

Answer 1

I actually wanted to parse out few parts of the URL and for example, I want to parse out, the text between "http:www.google.com/" and first occurrence of "&".

Use

sub(".*?https?://(?:www\\.)?google\\.com/([^&]+).*", "\\1", x)

See the regex demo .

The pattern matches:

(optionally add a ^ in front to match the start of string position)
.*? - 0+ chars as few as possible from the start till the first
https?:// - either https:// or http:// followed with
(?:www\\\\.)? - 1 or 0 (optional) sequence www.
google\\\\.com/ - literal text google.com
([^&]+) - 1 or more chars other than & (Capture group 1)
.* - any 0+ chars (up to the end of string).

In the replacment pattern, \\1 refers to the subtext captured into Group 1.

Parsing out particular text in a big text column in a Dataframe - R

Question

1 answers

solution1
0 ACCPTED 2016-08-12 23:16:40

Parsing out particular text in a big text column in a Dataframe - R

Question

1 answers

solution1 0 ACCPTED 2016-08-12 23:16:40

solution1
0 ACCPTED 2016-08-12 23:16:40