Python：以其他列表中的元素为条件分割字符串

Question

所以我有一个熊猫数据框，其结构如下：

In: df.head(1)
Out:
Individual      Employer                    EmployerState       BranchesState                    BranchesNr
872570          (4210, 7463, 23130, 133752) (MN, GA, NY, AZ)    (MN, AZ, GA, AZ, NY, AZ, AZ)    (0, 1, 0, 1, 0, 1, 0)

现在，我打算做的是拆分所有多个雇主信息，并为每个雇主-雇员对创建一条记录，如下所示：

Individual       Employer       EmployerState   BranchesState       BranchesNr
872570           4210           MN              MN, AZ              0, 1
872570           7463           GA              GA, AZ              0, 1
872570           23130          NY              NY, AZ              0, 1
872570           133752         AZ              AZ                  0

目前，我可以通过应用以下代码来对“ 个人”，“雇主”和“雇主状态 ”列执行此操作：

rows = [] # Store individuals in empty array
for _, row in indv_sub.iterrows():

# If there are multiple employers
# Example:
# Individual | Employer      =>         Individual | Employer
# 123        | (XY, AB)                 123        | XY
#                                       123        | AB

    if len(str(row['Employer']).split(','))>1:
        # split the individual record into as many employers as an individual has
        [rows.append(
                 [row['Individual'], 
                  m.replace('(','').replace(')',''),
                  l.replace('(','').replace(')',''),
                  row['BranchesState']]) 
                  for m,l in zip(row['Employer'].split(','),row['EmployerState'].split(','))]
    else:
        # just add the single employer
        rows.append([row['Individual'], row['Employer'], row['EmployerState'], row['BranchesState']])

indv_relevant = pd.DataFrame(rows,columns=('Individual','Employer','EmployerState','BranchesState'))
indv_relevant = indv_relevant.convert_objects(convert_numeric=True)

这很好用，但是我不能完全拆分BranchesState列。 我添加了BranchesNr字段，该字段以0指示下一个雇主的分支机构。 因此，请考虑以下示例：

 Employer           BranchesState                   BranchesNr
 (MN, GA, NY, AZ)   (MN, AZ, GA, AZ, NY, AZ, AZ)    (0, 1, 0, 1, 0, 1, 0)

值的第一个为0,1，然后为0，表示直到第二个职位的所有分支都属于第一位雇主。

list(row['BranchesState'].split(','))[:2] # would be attributable to the first employer

接下来是职位3至4，该职位归属于第二个雇主，依此类推。 我不太确定如何很好地实现它。 有什么想法或建议吗？

PS：字段是字符串，而不是看起来像的元组/列表。 同样，0,1,0只是一个例子，一些序列是0,1,2,0,1,0,1,2,3,4，依此类推。

为了包括更多数据变化，以下是10个观察值的示例：

{u'BrnchOfLoc_FirmNr'：{1490：u'（0，0）'，1498：u'（0，0，1，2）'，1594：u'（0，0）'，1618：u'（0 ，0，0）'，1632：u'（0，0）'，1633：u'（0，0）'，1687：u'（0，0）'，1738：u'（0，0）' ，1783：u'（0，0，1）'，1793：u'（0，0）'}，u'BrnchOfLoc_state'：{1490：u'（CA，CA）'，1498：u'（CA， CA，CA，CA）'，1594：u'（PA，PA）'，1618：u'（CA，CA，CA）'，1632：u'（NY，NY）'，1633：u'（NH， NH）'，1687：u'（FL，FL）'，1738：u'（CA，CA）'，1783：u'（MS，MS，LA）'，1793：u'（NJ，NJ）'} ，u'CrntEmp_orgPK'：{1490：u'（13572，144875）'，1498：u'（112059，137743）'，1594：u'（519，162200）'，1618：u'（23131，111532，113269 ）'，1632：u'（6627，118660）'，1633：u'（6413，131406）'，1687：u'（131587，142133）'，1738：u'（23131，105698）'，1783：u '（159778，160431）'，1793：u'（6413，128859）'}，u'CrntEmp_state'：{1490：u'（CA，CA）'，1498：u'（CA，CA）'，1594： u'（PA，PA）'，1618：u'（NY，CA，CA）'，1632：u'（NY，NY）'，1633：u'（MA，NH）'，1687：u'（FL ，FL）'，1738：u'（NY，CA）'，1783：u'（MS，LA）'，1793：u'（MA，NJ）'}，u'Info_indvlPK'：{1490：u'731003 '，1498：u'29443'，1594：u'70802 4'，1618：u'707057'，1632：u'830502'，1633：u'854101'，1687：u'706344'，1738：u'867229'，1783：u'734227'，1793：u'849856 '}，'NumberEmployer'：{1490：2，1498：2，1594：2，1618：3，1632：2，1633：2，1687：2，1738：2，1783：2，1793：2}}

Answer 1

我认为这几乎可以解决您的问题，但是我仍然不清楚拆分EmployerState的规则。 也许您可以包括其他示例？

df = pd.DataFrame(
    {'BranchesNr': ['(0, 1, 0, 1, 0, 1, 0)', 
                    '(0, 1, 0, 1, 0, 1, 0)'],
     'BranchesState': ['(MN, AZ, GA, AZ, NY, AZ, AZ)',
                       '(MN, AZ, GA, AZ, NY, AZ, AZ)'],
     'Employer': ['(4210, 7463, 23130, 133752)',
                  '(4210, 7463, 23130, 133752)'],
     'EmployerState': ['(MN, GA, NY, AZ)',
                       '(MN, GA, NY, AZ)'],
     'Individual': [872570, 872570]})

df['Employer'] = df.Employer.str.findall('(\d+)')
df['EmployerState'] = df.EmployerState.str.findall('(\w+)')
df['BranchesState'] = df.BranchesState.str.findall('(\w+)')
df['BranchesNr'] = df.BranchesNr.str.findall('(0|1)+')

indices = [[0] + [n for n, flag in enumerate(branches, 1) if flag == '1'] 
           for branches in df.BranchesNr]

>>> [(row. Individual, row.Employer[n], row. EmployerState[n]) 
     for idx, row in df.iterrows() 
     for n in range(len(row.Employer))]


[(872570, '4210', 'MN'),
 (872570, '7463', 'GA'),
 (872570, '23130', 'NY'),
 (872570, '133752', 'AZ'),
 (872570, '4210', 'MN'),
 (872570, '7463', 'GA'),
 (872570, '23130', 'NY'),
 (872570, '133752', 'AZ')]

Python：以其他列表中的元素为条件分割字符串

问题描述

1 个解决方案

解决方案1
0 2016-05-12 16:08:33

Python：以其他列表中的元素为条件分割字符串

问题描述

1 个解决方案

解决方案1 0 2016-05-12 16:08:33

解决方案1
0 2016-05-12 16:08:33