简体   繁体   English

使用 Vaex 从数据框中的列中提取字典值

[英]Extract dictionary value from column in data frame with Vaex

I applied on my dataframe the next command我在我的 dataframe 上应用了下一个命令

df['date_article'] = df.pagePath.str.extract_regex(pattern='(?P<digit>/\d{4}/\d{2}/\d{2}/)') df['date_article'] = df.pagePath.str.extract_regex(pattern='(?P<digit>/\d{4}/\d{2}/\d{2}/)')

And this created the column 'date_article'这创建了“date_article”列

pagePath页面路径 date_article date_article
'/empresas/2021/10/22/tiendas-no-participan-buen' '/empresas/2021/10/22/tiendas-no-participan-buen' {'digit': '/2021/10/22/'} {'数字':'/2021/10/22/'}
'/finanzas-personales/2021/10/22/pueden-cobrar-c '/finanzas-personales/2021/10/22/pueden-cobrar-c {'digit': '/2021/10/22/'} {'数字':'/2021/10/22/'}

Now I want to left only the date in 'date_article'.现在我只想在“date_article”中留下日期。

Expected output预期 output

pagePath页面路径 date_article date_article
'/empresas/2021/10/22/tiendas-no-participan-buen' '/empresas/2021/10/22/tiendas-no-participan-buen' '/2021/10/22/' '/2021/10/22/'
/finanzas-personales/2021/10/22/pueden-cobrar-c /finanzas-personales/2021/10/22/pueden-cobrar-c '/2021/10/22/' '/2021/10/22/'

I tried many things but nothing seems to work我尝试了很多东西,但似乎没有任何效果

Thank you in advance for help预先感谢您的帮助

How about the following:以下情况如何:

df['date_article'] = df.apply(lambda x: x['digit'], axis=1)

It appears that extract_regex returns a struct series:看来extract_regex返回一个结构系列:

Extract substrings defined by a regular expression using Apache Arrow (Google RE2 library).使用 Apache Arrow(Google RE2 库)提取由正则表达式定义的子字符串。

Parameters参数

pattern (str) – A regular expression which needs to contain named capture groups, eg 'letter' and 'digit' for the regular expression

'(?P[ab])(?Pd)'. '(?P[ab])(?Pd)'。

Returns退货

an expression containing a struct with field names corresponding to capture group identifiers.

So you will need to extract the field you want out of the struct.所以你需要从结构中提取你想要的字段。 I'm not a Vaex expert but maybe something like:我不是 Vaex 专家,但可能类似于:

struct_series = df.pagePath.str.extract_regex(pattern='(?P<digit>/\d{4}/\d{2}/\d{2}/)')
df['date_article'] = struct_series.struct.get('digit')

Use:利用:

df = pd.DataFrame({'date_article':[{'digit': '/2021/10/22/'}]})
df['date_article'] = df['date_article'].apply(lambda x: x['digit'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM