繁体   English   中英

Python:以其他列表中的元素为条件分割字符串

[英]Python: Split string conditional on elements in other list

所以我有一个熊猫数据框,其结构如下:

In: df.head(1)
Out:
Individual      Employer                    EmployerState       BranchesState                    BranchesNr
872570          (4210, 7463, 23130, 133752) (MN, GA, NY, AZ)    (MN, AZ, GA, AZ, NY, AZ, AZ)    (0, 1, 0, 1, 0, 1, 0)

现在,我打算做的是拆分所有多个雇主信息,并为每个雇主-雇员对创建一条记录,如下所示:

Individual       Employer       EmployerState   BranchesState       BranchesNr
872570           4210           MN              MN, AZ              0, 1
872570           7463           GA              GA, AZ              0, 1
872570           23130          NY              NY, AZ              0, 1
872570           133752         AZ              AZ                  0

目前,我可以通过应用以下代码来对“ 个人”,“雇主”和“雇主状态 ”列执行此操作:

rows = [] # Store individuals in empty array
for _, row in indv_sub.iterrows():

# If there are multiple employers
# Example:
# Individual | Employer      =>         Individual | Employer
# 123        | (XY, AB)                 123        | XY
#                                       123        | AB

    if len(str(row['Employer']).split(','))>1:
        # split the individual record into as many employers as an individual has
        [rows.append(
                 [row['Individual'], 
                  m.replace('(','').replace(')',''),
                  l.replace('(','').replace(')',''),
                  row['BranchesState']]) 
                  for m,l in zip(row['Employer'].split(','),row['EmployerState'].split(','))]
    else:
        # just add the single employer
        rows.append([row['Individual'], row['Employer'], row['EmployerState'], row['BranchesState']])

indv_relevant = pd.DataFrame(rows,columns=('Individual','Employer','EmployerState','BranchesState'))
indv_relevant = indv_relevant.convert_objects(convert_numeric=True)   

这很好用,但是我不能完全拆分BranchesState列。 我添加了BranchesNr字段,该字段以0指示下一个雇主的分支机构。 因此,请考虑以下示例:

 Employer           BranchesState                   BranchesNr
 (MN, GA, NY, AZ)   (MN, AZ, GA, AZ, NY, AZ, AZ)    (0, 1, 0, 1, 0, 1, 0)

值的第一个为0,1,然后为0,表示直到第二个职位的所有分支都属于第一位雇主。

list(row['BranchesState'].split(','))[:2] # would be attributable to the first employer

接下来是职位3至4,该职位归属于第二个雇主,依此类推。 我不太确定如何很好地实现它。 有什么想法或建议吗?

PS:字段是字符串,而不是看起来像的元组/列表。 同样,0,1,0只是一个例子,一些序列是0,1,2,0,1,0,1,2,3,4,依此类推。

为了包括更多数据变化,以下是10个观察值的示例:

{u'BrnchOfLoc_FirmNr':{1490:u'(0,0)',1498:u'(0,0,1,2)',1594:u'(0,0)',1618:u'(0 ,0,0)',1632:u'(0,0)',1633:u'(0,0)',1687:u'(0,0)',1738:u'(0,0)' ,1783:u'(0,0,1)',1793:u'(0,0)'},u'BrnchOfLoc_state':{1490:u'(CA,CA)',1498:u'(CA, CA,CA,CA)',1594:u'(PA,PA)',1618:u'(CA,CA,CA)',1632:u'(NY,NY)',1633:u'(NH, NH)',1687:u'(FL,FL)',1738:u'(CA,CA)',1783:u'(MS,MS,LA)',1793:u'(NJ,NJ)'} ,u'CrntEmp_orgPK':{1490:u'(13572,144875)',1498:u'(112059,137743)',1594:u'(519,162200)',1618:u'(23131,111532,113269 )',1632:u'(6627,118660)',1633:u'(6413,131406)',1687:u'(131587,142133)',1738:u'(23131,105698)',1783:u '(159778,160431)',1793:u'(6413,128859)'},u'CrntEmp_state':{1490:u'(CA,CA)',1498:u'(CA,CA)',1594: u'(PA,PA)',1618:u'(NY,CA,CA)',1632:u'(NY,NY)',1633:u'(MA,NH)',1687:u'(FL ,FL)',1738:u'(NY,CA)',1783:u'(MS,LA)',1793:u'(MA,NJ)'},u'Info_indvlPK':{1490:u'731003 ',1498:u'29443',1594:u'70802 4',1618:u'707057',1632:u'830502',1633:u'854101',1687:u'706344',1738:u'867229',1783:u'734227',1793:u'849856 '},'NumberEmployer':{1490:2,1498:2,1594:2,1618:3,1632:2,1633:2,1687:2,1738:2,1783:2,1793:2}}

我认为这几乎可以解决您的问题,但是我仍然不清楚拆分EmployerState的规则。 也许您可以包括其他示例?

df = pd.DataFrame(
    {'BranchesNr': ['(0, 1, 0, 1, 0, 1, 0)', 
                    '(0, 1, 0, 1, 0, 1, 0)'],
     'BranchesState': ['(MN, AZ, GA, AZ, NY, AZ, AZ)',
                       '(MN, AZ, GA, AZ, NY, AZ, AZ)'],
     'Employer': ['(4210, 7463, 23130, 133752)',
                  '(4210, 7463, 23130, 133752)'],
     'EmployerState': ['(MN, GA, NY, AZ)',
                       '(MN, GA, NY, AZ)'],
     'Individual': [872570, 872570]})

df['Employer'] = df.Employer.str.findall('(\d+)')
df['EmployerState'] = df.EmployerState.str.findall('(\w+)')
df['BranchesState'] = df.BranchesState.str.findall('(\w+)')
df['BranchesNr'] = df.BranchesNr.str.findall('(0|1)+')

indices = [[0] + [n for n, flag in enumerate(branches, 1) if flag == '1'] 
           for branches in df.BranchesNr]

>>> [(row. Individual, row.Employer[n], row. EmployerState[n]) 
     for idx, row in df.iterrows() 
     for n in range(len(row.Employer))]


[(872570, '4210', 'MN'),
 (872570, '7463', 'GA'),
 (872570, '23130', 'NY'),
 (872570, '133752', 'AZ'),
 (872570, '4210', 'MN'),
 (872570, '7463', 'GA'),
 (872570, '23130', 'NY'),
 (872570, '133752', 'AZ')]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM