[英]Python, Split the input string on elements of other list and remove digits from it
[英]Python: Split string conditional on elements in other list
所以我有一个熊猫数据框,其结构如下:
In: df.head(1)
Out:
Individual Employer EmployerState BranchesState BranchesNr
872570 (4210, 7463, 23130, 133752) (MN, GA, NY, AZ) (MN, AZ, GA, AZ, NY, AZ, AZ) (0, 1, 0, 1, 0, 1, 0)
现在,我打算做的是拆分所有多个雇主信息,并为每个雇主-雇员对创建一条记录,如下所示:
Individual Employer EmployerState BranchesState BranchesNr
872570 4210 MN MN, AZ 0, 1
872570 7463 GA GA, AZ 0, 1
872570 23130 NY NY, AZ 0, 1
872570 133752 AZ AZ 0
目前,我可以通过应用以下代码来对“ 个人”,“雇主”和“雇主状态 ”列执行此操作:
rows = [] # Store individuals in empty array
for _, row in indv_sub.iterrows():
# If there are multiple employers
# Example:
# Individual | Employer => Individual | Employer
# 123 | (XY, AB) 123 | XY
# 123 | AB
if len(str(row['Employer']).split(','))>1:
# split the individual record into as many employers as an individual has
[rows.append(
[row['Individual'],
m.replace('(','').replace(')',''),
l.replace('(','').replace(')',''),
row['BranchesState']])
for m,l in zip(row['Employer'].split(','),row['EmployerState'].split(','))]
else:
# just add the single employer
rows.append([row['Individual'], row['Employer'], row['EmployerState'], row['BranchesState']])
indv_relevant = pd.DataFrame(rows,columns=('Individual','Employer','EmployerState','BranchesState'))
indv_relevant = indv_relevant.convert_objects(convert_numeric=True)
这很好用,但是我不能完全拆分BranchesState列。 我添加了BranchesNr字段,该字段以0指示下一个雇主的分支机构。 因此,请考虑以下示例:
Employer BranchesState BranchesNr
(MN, GA, NY, AZ) (MN, AZ, GA, AZ, NY, AZ, AZ) (0, 1, 0, 1, 0, 1, 0)
值的第一个为0,1,然后为0,表示直到第二个职位的所有分支都属于第一位雇主。
list(row['BranchesState'].split(','))[:2] # would be attributable to the first employer
接下来是职位3至4,该职位归属于第二个雇主,依此类推。 我不太确定如何很好地实现它。 有什么想法或建议吗?
PS:字段是字符串,而不是看起来像的元组/列表。 同样,0,1,0只是一个例子,一些序列是0,1,2,0,1,0,1,2,3,4,依此类推。
为了包括更多数据变化,以下是10个观察值的示例:
{u'BrnchOfLoc_FirmNr':{1490:u'(0,0)',1498:u'(0,0,1,2)',1594:u'(0,0)',1618:u'(0 ,0,0)',1632:u'(0,0)',1633:u'(0,0)',1687:u'(0,0)',1738:u'(0,0)' ,1783:u'(0,0,1)',1793:u'(0,0)'},u'BrnchOfLoc_state':{1490:u'(CA,CA)',1498:u'(CA, CA,CA,CA)',1594:u'(PA,PA)',1618:u'(CA,CA,CA)',1632:u'(NY,NY)',1633:u'(NH, NH)',1687:u'(FL,FL)',1738:u'(CA,CA)',1783:u'(MS,MS,LA)',1793:u'(NJ,NJ)'} ,u'CrntEmp_orgPK':{1490:u'(13572,144875)',1498:u'(112059,137743)',1594:u'(519,162200)',1618:u'(23131,111532,113269 )',1632:u'(6627,118660)',1633:u'(6413,131406)',1687:u'(131587,142133)',1738:u'(23131,105698)',1783:u '(159778,160431)',1793:u'(6413,128859)'},u'CrntEmp_state':{1490:u'(CA,CA)',1498:u'(CA,CA)',1594: u'(PA,PA)',1618:u'(NY,CA,CA)',1632:u'(NY,NY)',1633:u'(MA,NH)',1687:u'(FL ,FL)',1738:u'(NY,CA)',1783:u'(MS,LA)',1793:u'(MA,NJ)'},u'Info_indvlPK':{1490:u'731003 ',1498:u'29443',1594:u'70802 4',1618:u'707057',1632:u'830502',1633:u'854101',1687:u'706344',1738:u'867229',1783:u'734227',1793:u'849856 '},'NumberEmployer':{1490:2,1498:2,1594:2,1618:3,1632:2,1633:2,1687:2,1738:2,1783:2,1793:2}}
我认为这几乎可以解决您的问题,但是我仍然不清楚拆分EmployerState
的规则。 也许您可以包括其他示例?
df = pd.DataFrame(
{'BranchesNr': ['(0, 1, 0, 1, 0, 1, 0)',
'(0, 1, 0, 1, 0, 1, 0)'],
'BranchesState': ['(MN, AZ, GA, AZ, NY, AZ, AZ)',
'(MN, AZ, GA, AZ, NY, AZ, AZ)'],
'Employer': ['(4210, 7463, 23130, 133752)',
'(4210, 7463, 23130, 133752)'],
'EmployerState': ['(MN, GA, NY, AZ)',
'(MN, GA, NY, AZ)'],
'Individual': [872570, 872570]})
df['Employer'] = df.Employer.str.findall('(\d+)')
df['EmployerState'] = df.EmployerState.str.findall('(\w+)')
df['BranchesState'] = df.BranchesState.str.findall('(\w+)')
df['BranchesNr'] = df.BranchesNr.str.findall('(0|1)+')
indices = [[0] + [n for n, flag in enumerate(branches, 1) if flag == '1']
for branches in df.BranchesNr]
>>> [(row. Individual, row.Employer[n], row. EmployerState[n])
for idx, row in df.iterrows()
for n in range(len(row.Employer))]
[(872570, '4210', 'MN'),
(872570, '7463', 'GA'),
(872570, '23130', 'NY'),
(872570, '133752', 'AZ'),
(872570, '4210', 'MN'),
(872570, '7463', 'GA'),
(872570, '23130', 'NY'),
(872570, '133752', 'AZ')]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.