简体   繁体   English

如何在python中用两个不同的单词拆分两个名字

[英]How to split two first names that together in two different words in python

I am trying to split misspelled first names.我正在尝试拆分拼写错误的名字。 Most of them are joined together.它们中的大多数是连接在一起的。 I was wondering if there is any way to separate two first names that are together into two different words.我想知道是否有任何方法可以将两个在一起的名字分成两个不同的词。

For example, if the misspelled name is trujillohernandez then to be separated to trujillo hernandez .例如,如果拼写错误的名称是trujillohernandez则将其分隔为trujillo hernandez

I am trying to create a function that can do this for a whole column with thousands of misspelled names like the example above.我正在尝试创建一个函数,该函数可以为具有数千个拼写错误的名称的整列执行此操作,例如上面的示例。 However, I haven't been successful.然而,我并没有成功。 Spell-checkers libraries do not work given that these are first names and they are Hispanic names.拼写检查程序库不起作用,因为这些是名字并且它们是西班牙名字。

I would be really grateful if you can help to develop some sort of function to make it happen.如果您能帮助开发某种功能来实现它,我将不胜感激。

As noted in the comments above not having a list of possible names will cause a problem.正如上面的评论中所指出的,没有可能的名称列表会导致问题。 However, and perhaps not perfect, but to offer something try...然而,也许并不完美,但提供一些尝试......

Given a dataframe example like...给定一个数据框示例,例如...

    Name
0   sofíagomez
1   isabelladelgado
2   luisvazquez
3   juanhernandez
4   valentinatrujillo
5   camilagutierrez
6   joséramos
7   carlossantana

Code (Python):代码(Python):

import pandas as pd
import requests

# longest list of hispanic surnames I could find in a table
url = r'https://namecensus.com/data/hispanic.html'

# download the table into a frame and clean up the header
page = requests.get(url)
table = pd.read_html(page.text.replace('<br />',' '))
df = table[0]
df.columns = df.iloc[0]
df = df[1:]

# move the frame of surnames to a list
last_names = df['Last name / Surname'].tolist()
last_names = [each_string.lower() for each_string in last_names]

# create a test dataframe of joined firstnames and lastnames
data = {'Name' : ['sofíagomez', 'isabelladelgado', 'luisvazquez', 'juanhernandez', 'valentinatrujillo', 'camilagutierrez', 'joséramos', 'carlossantana']}
df = pd.DataFrame(data, columns=['Name'])

# create new columns for the matched names
lastname = '({})'.format('|'.join(last_names))
df['Firstname'] = df.Name.str.replace(str(lastname)+'$', '', regex=True).fillna('--not found--')
df['Lastname'] = df.Name.str.extract(str(lastname)+'$', expand=False).fillna('--not found--')

# output the dataframe
print('\n\n')
print(df)

Outputs:输出:

    Name                Firstname   Lastname
0   sofíagomez          sofía       gomez
1   isabelladelgado     isabella    delgado
2   luisvazquez         luis        vazquez
3   juanhernandez       juan        hernandez
4   valentinatrujillo   valentina   trujillo
5   camilagutierrez     camila      gutierrez
6   joséramos           josé        ramos
7   carlossantana       carlos      santana

Further cleanup may be required but perhaps it gets the majority of names split.可能需要进一步清理,但可能会使大多数名称分裂。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM