简体   繁体   English

创建仅选择符合条件的行的数据框

[英]Create a dataframe only selecting rows that match condition

I have a big table in Hive (dozens to hundreds of millions of rows) from which I want to only choose those that match a regex. 我在Hive中有一张大桌子(几十行到几亿行),我只想从中选择与正则表达式匹配的表。

Currently I have a small example to try my code first: 目前,我有一个小例子,可以首先尝试我的代码:

columns = ['id', 'column']
vals = [
(1, "VAL_ID1 BD store"),
(2, "VAL_ID2 BD store"),
(3, "VAL_ID3 BD model"),
(4, "BAD WRONG")
]

df = spark.createDataFrame(vals, columns)

And then I have a regex tested that goes like: 然后我进行了正则表达式测试,如下所示:

df_regex = df.withColumn('newColumn',F.regexp_extract(df['id'], '^(([a-zA-Z]{2}[a-zA-Z0-9]{1})+(_[a-zA-Z]{2}[a-zA-Z0-9]{1})*)(\s|$)',1))

As I said, this is a test dataframe. 如我所说,这是一个测试数据框。 In the future I will make it "look" at a very large table. 将来,我会在一张很大的桌子上“看”它。 Is there any way to only add rows that match the regex, and thus create a much smaller dataframe? 有什么方法可以添加与正则表达式匹配的行,从而创建更小的数据框?

As it is right now, I am reading every single row, then adding a column withColumn that has an empty field for the rows that do not match the regex. 现在,我正在读取每一行,然后添加一列withColumn ,其中包含与正则表达式不匹配的行的空字段。 Which makes sense, but I feel like there is benefit in not reading this dataframe two times if I can avoid it. 这样做是有道理的,但是如果可以避免的话,我觉得不读取两次此数据帧有好处。

You want to use the where probably. 您想在where使用。

df.where(
    F.regexp_extract(df['id'], '^(([a-zA-Z]{2}[a-zA-Z0-9]{1})+(_[a-zA-Z]{2}[a-zA-Z0-9]{1})*)(\s|$)',1) != F.lit('')
)

Actually, I tried your regex and it gives no results. 实际上,我尝试了您的正则表达式,但没有结果。 But as long as you understand the principle, I think you can use that solution. 但是,只要您了解原理,我认为您可以使用该解决方案。


EDIT: 编辑:

I feel like there is benefit in not reading this dataframe two times if I can avoid it. 我觉得如果可以避免的话,两次不读取此数据帧是有好处的。

Spark will read your data only if you perform "action". 仅当您执行“操作”时,Spark才会读取您的数据。 Transformations are lazy and therefore evaluated only at the end ... so no need to worry about Spark reading your data twice (or more). 转换是惰性的,因此仅在最后进行评估...因此无需担心Spark两次(或多次)读取数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM