简体   繁体   中英

Parsing out particular text in a big text column in a Dataframe - R

Suppose I have the following data,

data

text
abc/1234&
qwertyabc/5555&
a&sdfghabc/ppp&plksa&
z&xabc/lkjh&poiuw&
lkjqwefasrjabc/855698&plkjdhweb

For example if I want to parse out the text between abc/ and first occurrence of & alone, how do I parse out those text between these texts. I want the text between first occurence of abc/ and first occurrence of & after abc/ has occurred.

My output should be as follows,

data

text                                 parsed_out
abc/1234&                               1234 
qwertyabc/5555&                         5555
a&sdfghabc/ppp&plksa&                    ppp
z&xabc/lkjh&poiuw&                      lkjh
lkjqwefasrjabc/855698&plkjdhweb       855698

The following is my trying,

data1 = within(data, FOO<-data.frame(do.call('rbind', strsplit(as.character(text), 'abc/', fixed=TRUE))))

data2 = within(data1, FOO1<-data.frame(do.call('rbind', strsplit(as.character(FOO$X1), '&', fixed=TRUE))))

This is using too much of memory since the text file is of 8 million rows and also data2 would be having several columns because it has several '&'. Can anybody help me in parsing text between these two characters as only one column in a best efficient way so that it doesn't occupy too much of memory?

x = "thesearepresentinthestartingwhichisnotneededhttp://google.com/needstobeparsedout&reoccurencenotneeded&"

here, the function should check for http://google.com/ and parse out until first & is found. Here the output should be needstobeparsedout.

new_x = "\\" http://www.google.com/search?q=erykah+badu+with+hiatus+kaiyote,+august+3& ""

Why is it not working with this link?

Thanks

I actually wanted to parse out few parts of the URL and for example, I want to parse out, the text between "http:www.google.com/" and first occurrence of "&".

Use

sub(".*?https?://(?:www\\.)?google\\.com/([^&]+).*", "\\1", x)

See the regex demo .

The pattern matches:

  • (optionally add a ^ in front to match the start of string position)
  • .*? - 0+ chars as few as possible from the start till the first
  • https?:// - either https:// or http:// followed with
  • (?:www\\\\.)? - 1 or 0 (optional) sequence www.
  • google\\\\.com/ - literal text google.com
  • ([^&]+) - 1 or more chars other than & (Capture group 1)
  • .* - any 0+ chars (up to the end of string).

In the replacment pattern, \\1 refers to the subtext captured into Group 1.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM