使用正则表达式过滤字符串中的值

Question

I am new to regex expressions.我是正则表达式的新手。

I am trying to filter out values from 2 columns.我正在尝试从 2 列中过滤掉值。

The first column look like this:第一列如下所示：

_source.cookie
__cfduid=d118f225fac35345d9e1d87e533b596ec1574680126; gclid=EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; full_path=https://google.com/free-test/windows/; country_code=OM; clid=06a98eb3-177a-4692-8a15-04cb4c084c1c; ct_t=1574680122; ct_tid=1574680122; _ga=GA1.2.575812751.1574680122; _gid=GA1.2.560773616.1574680122; _gac_UA-138885843-1=1.1574680161.EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; _gat=1; _gcl_aw=GCL.1574680123.EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; _gcl_au=1.1.1227740955.1574680123; sessionid=yr0pycyfhjh90vauf0z8yw4kxno5rom0; u_id=22b5d5e0-d2b5-4a4a-ad6f-128008b4b466; _gat_UA-138885843-1=1
...
__cfduid=de7d3a7e772a62b171f445ce489bc5f791574680110; gclid=CjwKCAiAlO7uBRANEiwA_vXQ-4dP3_zZJmNXCm-P2acHITBe1XbZZZmQIGKcrL9EaoP4r9CaYEQbPxoC1uQQAvD_BwE; full_path=https://google.com/au/free-test/; country_code=AU; ct_tid=1574680121; _ga=GA1.2.476582918.1574680125; _gid=GA1.2.1129397609.1574680125; _gat=1; _gcl_au=1.1.356653701.1574680128; _gat_UA-138885843-1=1; clid=3d0b5be5-8b7b-4094-ba47-879252a59a7a; ct_t=1574680159; _gcl_aw=GCL.1574680162.CjwKCAiAlO7uBRANEiwA_vXQ-4dP3_zZJmNXCm-P2acHITBe1XbZZZmQIGKcrL9EaoP4r9CaYEQbPxoC1uQQAvD_BwE; _gac_UA-138885843-1=1.1574680169.CjwKCAiAlO7uBRANEiwA_vXQ-4dP3_zZJmNXCm-P2acHITBe1XbZZZmQIGKcrL9EaoP4r9CaYEQbPxoC1uQQAvD_BwE
__cfduid=d3b31d4cba74d440bf60e238a62bf46a51574680162; gclid=CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE; full_path=https://google.com/au/best-test/; country_code=AU; clid=4e65772c-5da2-471a-86dd-240a34fd36ac; ct_t=1574680164; ct_tid=1574680164; _ga=GA1.2.242059245.1574680165; _gid=GA1.2.1757216414.1574680165; _gac_UA-138885843-1=1.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE; _gat=1; _gcl_aw=GCL.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE; _gcl_au=1.1.1892979809.1574680165
__cfduid=d054c8a93d4874e31aef9f2966829fefc1574680166; gclid=CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE; full_path=https://google.com/au/free-test/; country_code=AU; clid=726ebc25-95b9-4507-b29d-998ab54a9eeb; ct_t=1574680164; ct_tid=1574680164; _ga=GA1.2.1271977185.1574680165; _gid=GA1.2.506750010.1574680165; _gac_UA-138885843-1=1.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE; _gat=1; _gcl_aw=GCL.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE; _gcl_au=1.1.24394228.1574680165
__cfduid=d27ba2095c6b6ac5fb6108343075969f11574679826; full_path=https://google.com/reviews/testtest/; country_code=VN; ct_tid=1574679826; _ga=GA1.2.2008368313.1574679827; _gid=GA1.2.1231813533.1574679827; _gcl_au=1.1.299737663.1574679827; sessionid=dqwf1zmqdjkv9tdqi1cotr6m2judep2p; u_id=a71d0a87-b93d-4626-8f51-bcc0550dbbee; gclid=EAIaIQobChMI-ZOE3ZyF5gIVy2ArCh37VAdGEAEYASAAEgLCaPD_BwE; clid=aeb5b4d0-400b-47ee-b916-69a7b03544aa; ct_t=1574680166; _gac_UA-138885843-1=1.1574680167.EAIaIQobChMI-ZOE3ZyF5gIVy2ArCh37VAdGEAEYASAAEgLCaPD_BwE; _gat=1; _gcl_aw=GCL.1574680167.EAIaIQobChMI-ZOE3ZyF5gIVy2ArCh37VAdGEAEYASAAEgLCaPD_BwE

The second column look like this:第二列如下所示：

_source.request_url
https://google.com/go/test/?p3
https://google.com/au/test/?gclid=CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE
https://google.com/go/test/
...
https://google.com/api/dto/?click_type=gclid&click_id=CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE&click_src=GET&cid=242059245.1574680165&user_id=&landing_page_uri=https%3A%2F%2Fgoogle.com%2Fau%2Fbest-vpn%2F%3Fgclid%3DCjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE&landing_page_referer=&lpu=https%3A%2F%2Fgoogle.com%2Fau%2Fbest-vpn%2F&lpr=https%3A%2F%2Fwww.google.com%2F&trigger=onLoad&gtmon=true&gaon=true&cookieon=true&ct_t=1574680164&ct_tid=1574680164&v=20191029&_=1574680164139

My goal is to extract glid values from both columns so that I would have 2 new columns Glid from cookie and Gclid from URL我的目标是从两列中提取 glid 值，以便我有 2 个新列Glid from cookie和Gclid from URL

What I have so far:到目前为止我所拥有的：

def get_glid_from_source(pattern, data):

    result = re.search(pattern, str(data))
    if result is not None:
        return result.group(1)
    return None

df['Gclid_from_url'] = df.apply(lambda x: get_glid_from_source('[gclid|click_id]=(.+?)&', x['_source.request_url']), axis=1)

df['Gclid_from_cookie'] = df.apply(lambda x: get_glid_from_source('gclid=(.+?);', x['_source.cookie']), axis=1)

I would need to edit the expression so that:我需要编辑表达式，以便：

1. Gclid can only start with a letter az or AZ 1. gclid 只能以字母 az 或 AZ 开头

2. Gclid ends with one of the following - ;, %, & or the end of string 2. Gclid 以下列之一结尾 - ;, %, & 或字符串的结尾

Now after filtering I get values wfrom the second column that have click_id=ZFGe... because the value I need can be either gclid=Value I need or gclid&click_id= Value I need现在过滤后我从第二列得到值 w 从具有click_id=ZFGe...因为我需要的值可以是gclid=Value I need或gclid&click_id= Value I need

EDIT编辑

I have a third column which looks like this:我有一个看起来像这样的第三列：

_source.request_url
www.google.com/api/test...
www.google.com/go/test...
www.google.com/fire-start.php/test...
www.google.com/test...
www.google.com/api/test...

I am making new column in the pandas dataframe where I add TRUE or FALSE values depending if these conditions are met in the above column:我正在 Pandas 数据框中创建新列，根据上列中是否满足这些条件，在其中添加TRUE或FALSE值：

If the link has:如果链接有：

.com/fire-start.php or .com/go/ or .com/api/ .com/fire-start.php或.com/go/或.com/api/

New column would have value as FALSE if the patter is not found in the string 'TRUE' value is passed.如果在传递的字符串 'TRUE' 值中找不到模式，则新列的值为FALSE 。

What I tried:我试过的：

df['validate'] = df['_source.request_url'].str.extract(r'(www.google)=([a-zA-Z][^.com/fire-start.php|^.com/go/|^.com/api/]*)')

But that does not seem to work.但这似乎不起作用。

Thank you for your help, appreciate it.谢谢你的帮助，感激不尽。

Answer 1

You may use您可以使用

df['Gclid_from_url'] = df['_source.request_url'].str.extract(r'(?:gclid|click_id)=([a-zA-Z][^&#]*)')

See the regex demo查看正则表达式演示

The Gclid_from_cookie can be populated using Gclid_from_cookie可以使用

df['Gclid_from_cookie'] = df['_source.cookie'].str.extract(r'gclid=([a-zA-Z][^&#;%]*)')

See this regex demo看到这个正则表达式演示

Note that [gclid|click_id] matches any 1 char defined in the character set, a g , c , l , i , d , |请注意， [gclid|click_id]匹配字符集中定义的任何 1 个字符， a g 、 c 、 l 、 i 、 d 、 | , k or _ , not a sequence of chars, hence the non-capturing group (?:...) in my pattern. , k或_ ，不是字符序列，因此是我的模式中的非捕获组(?:...) 。

The value pattern is [a-zA-Z][^&#]* or [a-zA-Z][^&#;%]* that is quite self-explanatory: an ASCII letter and 0 or more chars other than & , # , ;值模式是[a-zA-Z][^&#]*或[a-zA-Z][^&#;%]* ，这是不言自明的：一个 ASCII 字母和 0 个或多个字符，而不是& , # , ; , % . , % 。

As far as the updated part of the question is concerned, you need to understand that a negated character class matches a single char, not a sequence of chars, you can't "say" [^not] to match any text but not , [^not] matches any char but n , o and t .就问题的更新部分而言，您需要了解否定字符类匹配单个字符，而不是字符序列，您不能“说” [^not]匹配任何文本但not匹配， [^not]匹配除n 、 o和t任何字符。

Add添加

import re
filters = ['.com/fire-start.php', '.com/go/', '.com/api/'] 
df['validate']=df['_source.request_url'].str.contains("|".join(map(re.escape,filters)))

使用正则表达式过滤字符串中的值

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-11-28 10:35:29

使用正则表达式过滤字符串中的值

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-11-28 10:35:29

解决方案1
2 已采纳 2019-11-28 10:35:29