简体   繁体   English

聚合列表 dataframe 在 pandas 使用专用 function

[英]Aggregate list dataframe in pandas using dedicated function

I have the following dataframe in pandas:我在 pandas 中有以下 dataframe:

data = {'ID_1': {0: '10A00', 1: '10B00', 2: '20001', 3: '20001'},
        'ID_2_LIST': {0: [20009, 30006], 1: [20001, 30006],
                      2: [30009, 30006], 3: [20001, 30003]},
        'ID_OCCURRENCY_LIST': {0: [1, 2], 1: [5, 6], 2: [2, 4], 3: [1, 3]}}

# create df
df = pd.DataFrame(data)

|    | ID_1   | ID_2_LIST      | ID_OCCURRENCY_LIST   |
|---:|:-------|:---------------|:---------------------|
|  0 | 10A00  | [20009, 30006] | [1, 2]               |
|  1 | 10B00  | [20001, 30006] | [5, 6]               |
|  2 | 20001  | [30009, 30006] | [2, 4]               |
|  3 | 20001  | [20001, 30003] | [1, 3]               |

I would aggregate by ID_1 field applying an external function (in order to identify similar ID_1, let's say "similarID(ID1,ID2)", which returns ID1 or ID2 according to some internal rules), re-generate the list of ID2 and sum the occurrencies for all the equal ID2.我将通过应用外部 function 的ID_1字段进行聚合(为了识别类似的 ID_1,假设“similarID(ID1,ID2)”,它根据一些内部规则返回 ID1 或 ID2),重新生成 ID2 和 sum 的列表所有相同 ID2 的出现次数。

The outcome should be:结果应该是:

  **INDEX    ID_1      ID_2_LIST              ID_OCCURRENCY_LIST**
   0      10A00      [20009,30006,20001]            [1, 8, 5]
   1      10B00      [20001,30006, 30003,20001]     [5, 6, 4, 2]
   1      20001      [30009,30006, 20001,30003]     [2, 4, 1, 3]

EDIT The code for the function is the following(s1=first string,c1=second string, p1=similarity percentage l1=confidence level, demeraulevenshtein is a literature function):编辑function 的代码如下(s1=第一个字符串,c1=第二个字符串,p1=相似度百分比 l1=置信度,demeraulevenshtein 是文献函数):

def pySimilar(s1,c1,p1,l1): 
  if s1 is None or c1 is None:
    return 0
  if len(s1)<=5 or len(c1)<=5:
    return 0

  s1=s1.strip()
  c1=c1.strip()
  s=s1
  c=c1
  if s1[3:len(s1)]==c1[3:len(c1)]:
      return 1

  if len(s1)>=len(c1):
    ITERATIONLENGTH=len(c1)/2
  else:
    ITERATIONLENGTH=len(s1)/2

  if len(s1)>=len(c1):
      a=int(len(c1)/2)+1
      if s1.find(c1[3:a])<0:
        return 0  
  else:
      b=int(len(s1)/2)+1
      if c1.find(s1[3:b])<0:
        return 0

  v=[]

  CNT=0
  TMP=0
  max_res=0
  search=s1
  while CNT < ITERATIONLENGTH:
    TMP=(100-((pyDamerauLevenschtein(s[3:len(s)],c[3:len(c)]))*100)/(len(c)-3)) * ((len(search)-3)/(len(s1)-3))
    v.append(TMP)
    CNT=CNT+1
    if TMP>max_res:
        max_res = TMP
    #s=s[0:len(s)-CNT]
    search=s1[0:len(s1)-CNT]
    s=s1[0:len(s1)-CNT]
    c=c1[0:len(c1)-CNT]


  if ((p1-(l1*p1/100)<=sum(v)/len(v) and sum(v)/len(v)<=p1+(l1*p1/100)) or sum(v)/len(v)>=p1+(l1*p1/100)) :
    return 1
  else:
    return 0

I have implemented a function to be applied in the dataframe but it is very slow:我已经实现了一个 function 应用到 dataframe 但它很慢:

def aggregateListAndOccurrencies(list1,list2):
  final = []
  final_cnt = []
  output = []
  cnt_temp = 0
  while list1:
    elem = list1.pop(0)
    cnt = list2.pop(0)
    i=0
    cnt_temp = cnt
    for item in list1:  
      if pyMATCHSIMILARPN(elem,item,65,20)==1:
        cnt_temp = list2[i]+cnt_temp
        list1.pop(i)
        list2.pop(i)
      i+=1
    final.append(elem)
    final_cnt.append(cnt_temp)

  output.append(final)
  output.append(final_cnt)
  return output

How could apply this in pandas?如何在 pandas 中应用它? Any suggestions?有什么建议么?

You can simply do a groupby over your ID_1 and just sum the ID_2_List and ID_OCCURRENCY_LIST columns:您可以简单地对您的ID_1进行分组,然后对ID_2_ListID_OCCURRENCY_LIST列求和:

df.groupby('ID_1').agg({'ID_2_LIST': 'sum', 'ID_OCCURRENCY_LIST': 'sum'})

if there's a spicific function you'd like the groupby to work with you can then you can use lambda to add it in the.agg:如果有特定的 function,您希望 groupby 可以使用,那么您可以使用lambda将其添加到 the.agg 中:

df.groupby('ID_1').agg({'ID_2_LIST': 'sum', 'ID_OCCURRENCY_LIST': lambda x: ' '.join(x)})

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM