简体   繁体   中英

Filtering values from string with regex

I am new to regex expressions.

I am trying to filter out values from 2 columns.

The first column look like this:

_source.cookie
__cfduid=d118f225fac35345d9e1d87e533b596ec1574680126; gclid=EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; full_path=https://google.com/free-test/windows/; country_code=OM; clid=06a98eb3-177a-4692-8a15-04cb4c084c1c; ct_t=1574680122; ct_tid=1574680122; _ga=GA1.2.575812751.1574680122; _gid=GA1.2.560773616.1574680122; _gac_UA-138885843-1=1.1574680161.EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; _gat=1; _gcl_aw=GCL.1574680123.EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; _gcl_au=1.1.1227740955.1574680123; sessionid=yr0pycyfhjh90vauf0z8yw4kxno5rom0; u_id=22b5d5e0-d2b5-4a4a-ad6f-128008b4b466; _gat_UA-138885843-1=1
...
__cfduid=de7d3a7e772a62b171f445ce489bc5f791574680110; gclid=CjwKCAiAlO7uBRANEiwA_vXQ-4dP3_zZJmNXCm-P2acHITBe1XbZZZmQIGKcrL9EaoP4r9CaYEQbPxoC1uQQAvD_BwE; full_path=https://google.com/au/free-test/; country_code=AU; ct_tid=1574680121; _ga=GA1.2.476582918.1574680125; _gid=GA1.2.1129397609.1574680125; _gat=1; _gcl_au=1.1.356653701.1574680128; _gat_UA-138885843-1=1; clid=3d0b5be5-8b7b-4094-ba47-879252a59a7a; ct_t=1574680159; _gcl_aw=GCL.1574680162.CjwKCAiAlO7uBRANEiwA_vXQ-4dP3_zZJmNXCm-P2acHITBe1XbZZZmQIGKcrL9EaoP4r9CaYEQbPxoC1uQQAvD_BwE; _gac_UA-138885843-1=1.1574680169.CjwKCAiAlO7uBRANEiwA_vXQ-4dP3_zZJmNXCm-P2acHITBe1XbZZZmQIGKcrL9EaoP4r9CaYEQbPxoC1uQQAvD_BwE
__cfduid=d3b31d4cba74d440bf60e238a62bf46a51574680162; gclid=CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE; full_path=https://google.com/au/best-test/; country_code=AU; clid=4e65772c-5da2-471a-86dd-240a34fd36ac; ct_t=1574680164; ct_tid=1574680164; _ga=GA1.2.242059245.1574680165; _gid=GA1.2.1757216414.1574680165; _gac_UA-138885843-1=1.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE; _gat=1; _gcl_aw=GCL.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE; _gcl_au=1.1.1892979809.1574680165
__cfduid=d054c8a93d4874e31aef9f2966829fefc1574680166; gclid=CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE; full_path=https://google.com/au/free-test/; country_code=AU; clid=726ebc25-95b9-4507-b29d-998ab54a9eeb; ct_t=1574680164; ct_tid=1574680164; _ga=GA1.2.1271977185.1574680165; _gid=GA1.2.506750010.1574680165; _gac_UA-138885843-1=1.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE; _gat=1; _gcl_aw=GCL.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE; _gcl_au=1.1.24394228.1574680165
__cfduid=d27ba2095c6b6ac5fb6108343075969f11574679826; full_path=https://google.com/reviews/testtest/; country_code=VN; ct_tid=1574679826; _ga=GA1.2.2008368313.1574679827; _gid=GA1.2.1231813533.1574679827; _gcl_au=1.1.299737663.1574679827; sessionid=dqwf1zmqdjkv9tdqi1cotr6m2judep2p; u_id=a71d0a87-b93d-4626-8f51-bcc0550dbbee; gclid=EAIaIQobChMI-ZOE3ZyF5gIVy2ArCh37VAdGEAEYASAAEgLCaPD_BwE; clid=aeb5b4d0-400b-47ee-b916-69a7b03544aa; ct_t=1574680166; _gac_UA-138885843-1=1.1574680167.EAIaIQobChMI-ZOE3ZyF5gIVy2ArCh37VAdGEAEYASAAEgLCaPD_BwE; _gat=1; _gcl_aw=GCL.1574680167.EAIaIQobChMI-ZOE3ZyF5gIVy2ArCh37VAdGEAEYASAAEgLCaPD_BwE

The second column look like this:

_source.request_url
https://google.com/go/test/?p3
https://google.com/au/test/?gclid=CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE
https://google.com/go/test/
...
https://google.com/api/dto/?click_type=gclid&click_id=CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE&click_src=GET&cid=242059245.1574680165&user_id=&landing_page_uri=https%3A%2F%2Fgoogle.com%2Fau%2Fbest-vpn%2F%3Fgclid%3DCjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE&landing_page_referer=&lpu=https%3A%2F%2Fgoogle.com%2Fau%2Fbest-vpn%2F&lpr=https%3A%2F%2Fwww.google.com%2F&trigger=onLoad&gtmon=true&gaon=true&cookieon=true&ct_t=1574680164&ct_tid=1574680164&v=20191029&_=1574680164139

My goal is to extract glid values from both columns so that I would have 2 new columns Glid from cookie and Gclid from URL

What I have so far:

def get_glid_from_source(pattern, data):

    result = re.search(pattern, str(data))
    if result is not None:
        return result.group(1)
    return None
df['Gclid_from_url'] = df.apply(lambda x: get_glid_from_source('[gclid|click_id]=(.+?)&', x['_source.request_url']), axis=1)

df['Gclid_from_cookie'] = df.apply(lambda x: get_glid_from_source('gclid=(.+?);', x['_source.cookie']), axis=1)

I would need to edit the expression so that:

1. Gclid can only start with a letter az or AZ

2. Gclid ends with one of the following - ;, %, & or the end of string

Now after filtering I get values wfrom the second column that have click_id=ZFGe... because the value I need can be either gclid=Value I need or gclid&click_id= Value I need

EDIT

I have a third column which looks like this:

_source.request_url
www.google.com/api/test...
www.google.com/go/test...
www.google.com/fire-start.php/test...
www.google.com/test...
www.google.com/api/test...

I am making new column in the pandas dataframe where I add TRUE or FALSE values depending if these conditions are met in the above column:

If the link has:

.com/fire-start.php or .com/go/ or .com/api/

New column would have value as FALSE if the patter is not found in the string 'TRUE' value is passed.

What I tried:

df['validate'] = df['_source.request_url'].str.extract(r'(www.google)=([a-zA-Z][^.com/fire-start.php|^.com/go/|^.com/api/]*)')

But that does not seem to work.

Thank you for your help, appreciate it.

You may use

df['Gclid_from_url'] = df['_source.request_url'].str.extract(r'(?:gclid|click_id)=([a-zA-Z][^&#]*)')

See the regex demo

The Gclid_from_cookie can be populated using

df['Gclid_from_cookie'] = df['_source.cookie'].str.extract(r'gclid=([a-zA-Z][^&#;%]*)')

See this regex demo

Note that [gclid|click_id] matches any 1 char defined in the character set, a g , c , l , i , d , | , k or _ , not a sequence of chars, hence the non-capturing group (?:...) in my pattern.

The value pattern is [a-zA-Z][^&#]* or [a-zA-Z][^&#;%]* that is quite self-explanatory: an ASCII letter and 0 or more chars other than & , # , ; , % .

As far as the updated part of the question is concerned, you need to understand that a negated character class matches a single char, not a sequence of chars, you can't "say" [^not] to match any text but not , [^not] matches any char but n , o and t .

Add

import re
filters = ['.com/fire-start.php', '.com/go/', '.com/api/'] 
df['validate']=df['_source.request_url'].str.contains("|".join(map(re.escape,filters)))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM