簡體   English   中英

在PySpark的時間間隔內計算重復值

[英]Count Duplicates Values within a time interval in PySpark

我正在使用以下格式的大數據進行PySpark作業。

ID-1234567  iplong  agent   partner client  country timestamp   category    reference

我需要基於一分鍾時間間隔為每個合作伙伴在第2(iplong), 3(agent), 5(client), 6(country), 9(reference)列中找到重復記錄的平均數量。

我了解我需要

  1. 將記錄分成一分鍾的間隔。
  2. 通過partner映射所有內容
  3. partner分組所有內容
  4. 通過減少總記錄數和減少不同記錄數來減少每個間隔,並求差以獲取重復記錄的數量(還需要定義一個函數來比較兩個記錄,僅將兩個記錄的值分別為2(iplong), 3(agent), 5(client), 6(country), 9(reference)列。)
  5. 將所有間隔中的所有partner及其重復計數加在一起。 並除以其外觀計數。

我了解此過程,但不了解pyspark的確切實現。

有人可以幫我在pyspark執行上述任何步驟pyspark

樣本數據:

9794474 1000460030  Samsung_S5233   dv4gs   dswae   in  2012-03-08 00:00:00 mg  riflql2a0yv8xoa9sq0recx4x
9794471 3386480130  Nokia_C3-00 duq7h   dr75h   py  2012-03-08 00:00:00 co  
9794468 1907980030  Nokia_5233  dv6i3   ds3xq   vn  2012-03-08 00:00:00 es  gp53lqr9njqd6z2ap5d364sip
9794467 1791990020  MAUI    duxto   dvb8g   in  2012-03-08 00:00:00 ad  
9794466 1791000060  Nokia_3110c dusg4   dvb8g   in  2012-03-08 00:00:00 ad  
9794477 1353590020  Blackberry_9300 du6dt   dtr0u   es  2012-03-08 00:00:00 es  h5njsswvxorsau9u8fxh0e9se
9794478 1402290050  NokiaC6-01.3    dusnc   dsgcn   ru  2012-03-08 00:00:00 mc  
9794481 1848749950  Nokia_C3-00 dvry3   dr6sg   th  2012-03-08 00:00:01 mc  oj0rekb51pvirnjuqjt10zn4b

更新:

到目前為止,我已經嘗試將整個數據放入MySQL並從中讀取數據。 但是讀取操作會花費太多時間。

對於mapreduce方法,我嘗試了其他更小的方法。 但是不明白我將如何在代碼中進一步處理它。 因此,無法采用一種方法前進。

clicks_rdd = sc.parallelize(list(clicks_reader)[1:]) 
minwise_clicks = clicks_rdd.groupby(clicks_rdd.index.map(lambda t: t.minute)) # Didn't work
clicks_mapped_publishers = clicks_rdd.map(lambda x : (x.pop(3), x)) # Works fine but need the records divided into minute intervals first.

在這里和那里也嘗試了其他一些東西。 但是沒有什么堅實的。

以下是我原始數據集文件的前25條記錄。

id,iplong,agent,partnerid,cid,cntr,timeat,category,referer
9794476,1071324855,SonyEricsson_K70,dv3va,dsfag,us,2012-03-08 00:00:00.0,ad,
9794474,1000461055,Samsung_S5233,dv4gs,dswae,in,2012-03-08 00:00:00.0,mg,riflql2a0yv8xoa9sq0recx4x
9794471,3386484265,Nokia_C3-00,duq7h,dr75h,py,2012-03-08 00:00:00.0,co,
9794468,1907981997,Nokia_5233,dv6i3,ds3xq,vn,2012-03-08 00:00:00.0,es,gp53lqr9njqd6z2ap5d364sip
9794467,1791989091,MAUI,duxto,dvb8g,in,2012-03-08 00:00:00.0,ad,
9794466,1791002478,Nokia_3110c,dusg4,dvb8g,in,2012-03-08 00:00:00.0,ad,
9794477,1353590316,Blackberry_9300,du6dt,dtr0u,es,2012-03-08 00:00:00.0,es,h5njsswvxorsau9u8fxh0e9se
9794478,1402285217,NokiaC6-01.3,dusnc,dsgcn,ru,2012-03-08 00:00:00.0,mc,
9794481,1848747204,Nokia_C3-00,dvry3,dr6sg,th,2012-03-08 00:00:01.0,mc,oj0rekb51pvirnjuqjt10zn4b
9794482,1893182670,NokiaC2-03,du77a,dr6x2,id,2012-03-08 00:00:01.0,co,r63f8uhijvr2irvka3glwyb38
9794483,1912930086,MAUI,dvwdj,dvb8g,id,2012-03-08 00:00:01.0,ad,
9794485,2098816838,GT-S5360B,dvjtq,dr72e,th,2012-03-08 00:00:01.0,co,
9794486,3309473440,MAUI,dv6i3,ds3k0,za,2012-03-08 00:00:01.0,es,
9794492,702295934,Nokia_9300,dv6i3,dtqrw,ng,2012-03-08 00:00:01.0,es,onbw7na2mi8a62g4p6y3av2qt
9794493,694135362,Nokia_N95,dupgf,dvb8g,sd,2012-03-08 00:00:01.0,ad,hoq05psulkszxm4izlql4g962
9794495,1791428359,Samsung_S8300,dvpo7,dvb8g,in,2012-03-08 00:00:02.0,co,im387req0zp1ucygamhgadgtm
9794496,1783607271,GT-S5570,du56s,dsgq2,in,2012-03-08 00:00:02.0,mc,immfap8948rebeym8ri0vf5cr
9794498,1860189232,Samsung_GT-B3313,du56s,ds22r,in,2012-03-08 00:00:02.0,mc,r81nrzjemr5jrfvjjeoxmdm4y
9794499,1868310973,Nokia_2730c,dv3va,drvnr,au,2012-03-08 00:00:02.0,ad,
9794500,1893182511,Nokia_5233,dv6i7,dr6tn,id,2012-03-08 00:00:02.0,co,tq09jycwii12iul7hzalucue3
9794501,1884230403,Samsung_GT-S3653,dvjil,ds92x,in,2012-03-08 00:00:02.0,mc,h0z1j3bwiverubvwg851e9eon
9794503,1945382244,GT-S5360,dvijt,dsgq2,in,2012-03-08 00:00:02.0,mc,fbbenjzmoe0oc7x4e2080nj8x
9794508,2928534854,Samsung_R310,dunsq,dsg3q,us,2012-03-08 00:00:02.0,ad,kl9j183hop90uwq2p82iidjsb
9794510,3063717709,Samsung_GT-S3653,dvjjf,dr751,in,2012-03-08 00:00:02.0,ad,rpdt9h4kpooxiedeuuxvk6gi5
9794511,3557769762,Samsung_C3050,du53k,dr71b,hr,2012-03-08 00:00:02.0,se,

更新2

樣本輸出。 這是制表符分隔值格式。 您可以將其復制並粘貼到excel中以正確查看。 這里avg_spiky_ReAgCnIpCi是每秒重復的referenceAgentCountryIPClient組合的平均計數。 我對此感興趣。然后我可以進行更改以派生其他功能。

partnerid   status  avg_spiky_ReAgCnIpCi    std_spiky_ReAgCnIpCi    night_avg_spiky_ReAgCnIpCi  night_std_spiky_ReAgCnIpCi  morning_avg_spiky_ReAgCnIpCi    morning_std_spiky_ReAgCnIpCi    afternoon_avg_spiky_ReAgCnIpCi  afternoon_std_spiky_ReAgCnIpCi  evening_avg_spiky_ReAgCnIpCi    evening_std_spiky_ReAgCnIpCi    avg_spiky_ReAgCnIp  std_spiky_ReAgCnIp  avg_spiky_ReAgCn    std_spiky_ReAgCn    avg_spiky_iplong    std_spiky_iplong    avg_spiky_agent std_spiky_agent night_avg_spiky_agent   night_std_spiky_agent   morning_avg_spiky_agent morning_std_spiky_agent afternoon_avg_spiky_agent   afternoon_std_spiky_agent   evening_avg_spiky_agent evening_std_spiky_agent avg_spiky_cid   std_spiky_cid   avg_spiky_cntr  std_spiky_cntr  avg_spiky_referer   std_spiky_referer   night_avg_spiky_referer night_std_spiky_referer morning_avg_spiky_referer   morning_std_spiky_referer   afternoon_avg_spiky_referer afternoon_std_spiky_referer evening_avg_spiky_referer   evening_std_spiky_referer   category_es category_mc category_ad category_co category_se category_mg category_pp category_in category_gd category_ow total_clicks    distinct_iplong distinct_agent  distinct_cid    distinct_cntr   distinct_referer    night_click_percent morning_click_percent   afternoon_click_percent evening_click_percent   night_referer_percent   morning_referer_percent afternoon_referer_percent   evening_referer_percent night_agent_percent morning_agent_percent   afternoon_agent_percent evening_agent_percent   avg_total_clicks    std_total_clicks    avg_distinct_iplong std_distinct_iplong avg_distinct_agent  std_distinct_agent  avg_distinct_cid    std_distinct_cid    avg_distinct_cntr   std_distinct_cntr   avg_distinct_referer    std_distinct_referer    avg_null_agent  std_null_agent  avg_null_referer    std_null_referer    night_avg_null_referer  night_std_null_referer  morning_avg_null_referer    morning_std_null_referer    afternoon_avg_null_referer  afternoon_std_null_referer  evening_avg_null_referer    evening_std_null_referer    first_15_minute_percent second_15_minute_percent    third_15_minute_percent last_15_minute_percent  brand_MAUI_percent  brand_Nokia_percent brand_Generic_percent   brand_Apple_percent brand_Blackberry_percent    brand_Samsung_percent   brand_SonyEricsson_percent  brand_LG_percent    brand_other_percent avg_per_hour_density    std_per_hour_density    cntr_az_percent cntr_id_percent cntr_in_percent cntr_us_percent cntr_ng_percent cntr_tr_percent cntr_ru_percent cntr_th_percent cntr_sg_percent cntr_uk_percent cntr_other_percent
du3nk   0   1.23    8.47    0   0   0   0   0   0   1.23    8.47    1.24    8.48    1.27    8.61    4.14    11.73   8.73    16.06   0   0   0   0   0   0   8.73    16.06   38.18   240.99  60  248 1.8 10.35   0   0   0   0   0   0   1.8 10.35   0   1   0   0   0   0   0   0   0   0   3360    644 250 61  31  1696    0   0   0   1   0   0   0   1   0   0   0   1   3360    0   644 0   250 0   61  0   31  0   1696    0   0   0   598 0   0   0   0   0   0   0   598 0   0.16    0.17    0.33    0.35    0.01    0   0.05    0   0   0   0   0   0   2   0   0   0   0.13    0   0   0   0   0   0.01    0   0.04
du3nq   1   8.38    5.83    0   0   0   0   0   0   8.38    5.83    25.13   9.27    25.13   9.27    188.5   49.5    188.5   49.5    0   0   0   0   0   0   188.5   49.5    53.86   39.03   188.5   49.5    25.13   9.27    0   0   0   0   0   0   25.13   9.27    1   0   0   0   0   0   0   0   0   0   377 1   1   5   1   8   0   0   0   1   0   0   0   1   0   0   0   1   377 0   1   0   1   0   5   0   1   0   8   0   0   0   0   0   0   0   0   0   0   0   0   0   0.09    0.14    0.33    0.44    0   0   0   1   0   0   0   0   0   2   0   0   0   0   0   0   0   0   0   0   0   1
du3op   0   30.43   46.87   0   0   0   0   44.67   59.63   19.75   30.19   35.5    48.84   35.5    48.84   71  52.27   71  52.27   0   0   0   0   134 0   39.5    33.5    13.31   8.24    71  52.27   35.5    48.84   0   0   0   0   67  62  19.75   30.19   0   0   1   0   0   0   0   0   0   0   213 1   1   6   1   1   0   0   0.63    0.37    0   0   1   1   0   0   1   1   213 0   1   0   1   0   6   0   1   0   1   0   0   0   205 0   0   0   0   0   129 0   76  0   0   0.09    0.25    0.66    0   1   0   0   0   0   0   0   0   3   0   0   0   0   0   0   0   0   0   0   0   1
du3or   0   1   0   0   0   0   0   1   0   1   0   1   0   1   0   1   0   1   0   0   0   0   0   1   0   1   0   1   0   1   0   1   0   0   0   0   0   1   0   1   0   0   1   0   0   0   0   0   0   0   0   2   2   1   1   1   1   0   0   0.5 0.5 0   0   1   1   0   0   1   1   2   0   2   0   1   0   1   0   1   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0.5 0   0   0.5 0   0   0   0   0   0   1   0   0   2   0   0   1   0   0   0   0   0   0   0   0   0
du3ov   0   1.01    0.11    0   0   0   0   0   0   1.01    0.11    1.01    0.11    1.01    0.11    44  30  29.33   31.63   0   0   0   0   0   0   29.33   31.63   6.29    5.59    44  30  1.02    0.21    0   0   0   0   0   0   1.02    0.21    0   0   0   0   1   0   0   0   0   0   88  1   2   10  1   86  0   0   0   1   0   0   0   1   0   0   0   1   88  0   1   0   2   0   10  0   1   0   86  0   0   0   0   0   0   0   0   0   0   0   0   0   0.84    0   0   0.16    0   0.94    0   0.06    0   0   0   0   0   2   0   0   0   0   0   0   0   0   0   0   0   1
du3ox   0   1   0   0   0   0   0   0   0   1   0   1   0   1   0   1   0   1   0   0   0   0   0   0   0   1   0   1   0   1   0   1   0   0   0   0   0   0   0   1   0   0   1   0   0   0   0   0   0   0   0   1   1   1   1   1   1   0   0   0   1   0   0   0   1   0   0   0   1   1   0   1   0   1   0   1   0   1   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   1   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   1
du3oy   0   1.02    0.12    0   0   0   0   0   0   1.02    0.12    1.02    0.15    1.02    0.15    64.5    31.5    32.25   35.55   0   0   0   0   0   0   32.25   35.55   7.59    6.03    64.5    31.5    1.03    0.28    0   0   0   0   0   0   1.03    0.28    0   0   0   0   1   0   0   0   0   0   129 1   3   12  1   124 0   0   0   1   0   0   0   1   0   0   0   1   129 0   1   0   3   0   12  0   1   0   124 0   0   0   0   0   0   0   0   0   0   0   0   0   0.26    0.58    0.16    0   0   0.95    0   0.04    0   0   0   0   0   2   0   0   0   0   0   0   0   0   0   0   0   1
du3oz   1   1   0   0   0   0   0   1   0   0   0   1   0   33  3.35    1.01    0.08    165 0   0   0   0   0   165 0   0   0   27.5    8.18    165 0   33  3.35    0   0   0   0   33  3.35    0   0   1   0   0   0   0   0   0   0   0   0   165 164 1   6   1   5   0   0   1   0   0   0   1   0   0   0   1   0   165 0   164 0   1   0   6   0   1   0   5   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   1   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   1   0   0
du3p1   0   1   0   0   0   0   0   1   0   0   0   1   0   18.2    16.11   1.01    0.07    91  80  0   0   0   0   91  80  0   0   15.17   14.82   91  80  18.2    16.11   0   0   0   0   18.2    16.11   0   0   1   0   0   0   0   0   0   0   0   0   182 181 1   6   1   5   0   0   1   0   0   0   1   0   0   0   1   0   182 0   181 0   1   0   6   0   1   0   5   0   0   0   0   0   0   0   0   0   0   0   0   0   0.06    0   0   0.94    0   1   0   0   0   0   0   0   0   2   0   0   0   0   0   0   0   0   0   1   0   0
du3r7   0   3.63    1.32    0   0   0   0   0   0   3.63    1.32    29  0   29  0   29  0   29  0   0   0   0   0   0   0   29  0   3.63    1.32    29  0   29  0   0   0   0   0   0   0   29  0   0   0   0   0   1   0   0   0   0   0   29  1   1   8   1   1   0   0   0   1   0   0   0   1   0   0   0   1   29  0   1   0   1   0   8   0   1   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   1   0   0   1   0   0   0   0   0   0   0   0   0   0   1   0

在里面:

from pyspark import *
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql import functions as f

這等於您的“我的原始數據集文件的前25條記錄”。

df = spark.read.load(path="file:///home/zht/PycharmProjects/test/disk_file", format='csv', sep=',', header=True)

只是為了獲得卓越的結果,這一步可以忽略

df = df.withColumn('iplong', f.substring('iplong', pos=0, len=1)) \
    .withColumn('agent', f.substring('agent', pos=0, len=1)) \
    .withColumn('client', f.substring('client', pos=0, len=2)) \
    .withColumn('partner', f.substring('partner', pos=0, len=2)) \
    .withColumn('timestamp',f.when(f.substring('id', pos=6, len=1) % 2 == 1, '2012-03-08 00:01:00.0').otherwise(df['timestamp']))
df.show()

+-------+------+-----+-------+------+-------+--------------------+--------+--------------------+
|     id|iplong|agent|partner|client|country|           timestamp|category|           reference|
+-------+------+-----+-------+------+-------+--------------------+--------+--------------------+
|9794476|     1|    S|     dv|    ds|     us|2012-03-08 00:01:...|      ad|                null|
|9794474|     1|    S|     dv|    ds|     in|2012-03-08 00:01:...|      mg|riflql2a0yv8xoa9s...|
|9794471|     3|    N|     du|    dr|     py|2012-03-08 00:01:...|      co|                null|
|9794468|     1|    N|     dv|    ds|     vn|2012-03-08 00:00:...|      es|gp53lqr9njqd6z2ap...|
|9794467|     1|    M|     du|    dv|     in|2012-03-08 00:00:...|      ad|                null|
|9794466|     1|    N|     du|    dv|     in|2012-03-08 00:00:...|      ad|                null|
|9794477|     1|    B|     du|    dt|     es|2012-03-08 00:01:...|      es|h5njsswvxorsau9u8...|
|9794478|     1|    N|     du|    ds|     ru|2012-03-08 00:01:...|      mc|                null|
|9794481|     1|    N|     dv|    dr|     th|2012-03-08 00:00:...|      mc|oj0rekb51pvirnjuq...|
|9794482|     1|    N|     du|    dr|     id|2012-03-08 00:00:...|      co|r63f8uhijvr2irvka...|
|9794483|     1|    M|     dv|    dv|     id|2012-03-08 00:00:...|      ad|                null|
|9794485|     2|    G|     dv|    dr|     th|2012-03-08 00:00:...|      co|                null|
|9794486|     3|    M|     dv|    ds|     za|2012-03-08 00:00:...|      es|                null|
|9794492|     7|    N|     dv|    dt|     ng|2012-03-08 00:01:...|      es|onbw7na2mi8a62g4p...|
|9794493|     6|    N|     du|    dv|     sd|2012-03-08 00:01:...|      ad|hoq05psulkszxm4iz...|
|9794495|     1|    S|     dv|    dv|     in|2012-03-08 00:01:...|      co|im387req0zp1ucyga...|
|9794496|     1|    G|     du|    ds|     in|2012-03-08 00:01:...|      mc|immfap8948rebeym8...|
|9794498|     1|    S|     du|    ds|     in|2012-03-08 00:01:...|      mc|r81nrzjemr5jrfvjj...|
|9794499|     1|    N|     dv|    dr|     au|2012-03-08 00:01:...|      ad|                null|
|9794500|     1|    N|     dv|    dr|     id|2012-03-08 00:00:...|      co|tq09jycwii12iul7h...|
+-------+------+-----+-------+------+-------+--------------------+--------+--------------------+

以及關鍵操作:

res = df.groupBy([f.window('timestamp', windowDuration='1 minutes'),'partner', 'iplong', 'agent']).count()
res = res.withColumn('total',f.sum('count').over(Window.partitionBy(["window", "partner"])))
res.show(n=30, truncate=False)

+---------------------------------------------+-------+------+-----+-----+-----+
|window                                       |partner|iplong|agent|count|total|
+---------------------------------------------+-------+------+-----+-----+-----+
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|du     |1     |N    |1    |7    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|du     |3     |N    |1    |7    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|du     |3     |S    |1    |7    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|du     |6     |N    |1    |7    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|du     |1     |B    |1    |7    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|du     |1     |G    |1    |7    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|du     |1     |S    |1    |7    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|dv     |3     |M    |1    |8    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|dv     |1     |N    |3    |8    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|dv     |2     |G    |1    |8    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|dv     |1     |G    |1    |8    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|dv     |1     |M    |1    |8    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|dv     |1     |S    |1    |8    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|dv     |3     |S    |1    |6    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|dv     |7     |N    |1    |6    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|dv     |1     |S    |3    |6    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|dv     |1     |N    |1    |6    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|du     |2     |S    |1    |4    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|du     |1     |M    |1    |4    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|du     |1     |N    |2    |4    |
+---------------------------------------------+-------+------+-----+-----+-----+

是指每1分鍾,合作伙伴,iplong和代理每1分鍾的記錄數量

總計是指每1分鍾和每個合作伙伴記錄的數量

你是這個意思嗎

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM