[英]How to use parse from phonenumbers Python library on a pandas data frame?
How can I parse phone numbers from a pandas data frame, ideally using phonenumbers library?如何从 pandas 数据框中解析电话号码,最好使用电话号码库?
I am trying to use a port of Google's libphonenumber library on Python, https://pypi.org/project/phonenumbers/ .我正在尝试在 Python, https://pypi.org/project/phonenumbers/上使用 Google 的 libphonenumber 库的一个端口。
I have a data frame with 3 million phone numbers from many countries.我有一个包含来自许多国家的 300 万个电话号码的数据框。 I have a row with the phone number, and a row with the country/region code.
我有一行是电话号码,一行是国家/地区代码。 I'm trying to use the parse function in the package. My goal is to parse each row using the corresponding country code but I can't find a way of doing it efficiently.
我正在尝试在 package 中使用解析 function。我的目标是使用相应的国家代码解析每一行,但我找不到有效的方法。
I tried using apply but it didn't work.我尝试使用 apply 但它没有用。 I get a "(0) Missing or invalid default region."
我收到“(0) 个缺失或无效的默认区域”。 error, meaning it won't pass the country code string.
错误,意味着它不会传递国家代码字符串。
df['phone_number_clean'] = df.phone_number.apply(lambda x:
phonenumbers.parse(str(df.phone_number),str(df.region_code)))
The line below works, but doesn't get me what I want, as the numbers I have come from about 120+ different countries.下面的行有效,但没有得到我想要的,因为我的数字来自大约 120 多个不同的国家。
df['phone_number_clean'] = df.phone_number.apply(lambda x:
phonenumbers.parse(str(df.phone_number),"US"))
I tried doing this in a loop, but it is terribly slow.我尝试循环执行此操作,但速度非常慢。 Took me more than an hour to parse 10,000 numbers, and I have about 300x that:
我花了一个多小时来解析 10,000 个数字,我有大约 300 倍:
for i in range(n):
df3['phone_number_std'][i] =
phonenumbers.parse(str(df.phone_number[i]),str(df.region_code[i]))
Is there a method I'm missing that could run this faster?有没有我缺少的方法可以运行得更快? The apply function works acceptably well but I'm unable to pass the data frame element into it.
apply function 工作得很好,但我无法将数据框元素传递给它。
I'm still a beginner in Python, so perhaps this has an easy solution.我还是 Python 的初学者,所以也许这有一个简单的解决方案。 But I would greatly appreciate your help.
但我将非常感谢你的帮助。
Your initial solution using apply
is actually pretty close - you don't say what doesn't work about it, but the syntax for a lambda function over multiple columns of a dataframe, rather than on the rows within a single column, is a bit different. 您最初使用
apply
解决方案实际上非常接近-您不会说什么不起作用,但是lambda函数在数据帧的多个列上而不是在单个列中的行上的语法有点不同。 Try this: 尝试这个:
df['phone_number_clean'] = df.apply(lambda x:
phonenumbers.parse(str(x.phone_number),
str(x.region_code)),
axis='columns')
The differences: 区别:
You want to include multiple columns in your lambda function, so you want to apply your lambda function to the entire dataframe (ie, df.apply
) rather than to the Series (the single column) that is returned by doing df.phone_number.apply
. 您想要在lambda函数中包括多个列,因此您想将lambda函数应用于整个数据
df.apply
(即df.apply
),而不是应用于通过执行df.phone_number.apply
返回的Series(单个列) 。 (print the output of df.phone_number
to the console - what is returned is all the information that your lambda function will be given). (将
df.phone_number
的输出打印到控制台-返回的是将给出lambda函数的所有信息)。
The argument axis='columns'
(or axis=1
, which is equivalent, see the docs ) actually slices the data frame by rows, so apply 'sees' one record
at a time (ie, [index0, phonenumber0, countrycode0], [index1, phonenumber1, countrycode1]...) as opposed to slicing the other direction, which would give it ([phonenumber0, phonenumber1, phonenumber2...]) 参数
axis='columns'
(或axis=1
,等价,请参阅docs )实际上是按行对数据进行切片,因此一次应用一次“ sees”一条record
(即[index0,phonenumber0,countrycode0], [index1,phonenumber1,countrycode1] ...),而不是切成另一个方向([phonenumber0,phonenumber1,phonenumber2 ...])
Your lambda function only knows about the placeholder x
, which, in this case, is the Series [index0, phonenumber0, countrycode0], so you need to specify all the values relative to the x
that it knows - ie, x.phone_number, x.country_code. 您的lambda函数仅知道占位符
x
,在这种情况下,该占位符是Series [index0,phonenumber0,countrycode0],因此您需要指定与其已知的x
相关的所有值-即x.phone_number,x 。国家代码。
love the solution of @katelie.喜欢@katelie 的解决方案。 But here's my code.
但这是我的代码。 Added a try/except function to skip the phonenumber function from failing.
添加了 try/except function 以跳过电话号码 function 失败。 It cannot handle string with a length that is to long.
它无法处理长度过长的字符串。
import phonenumber as phon
def formatE164(self):
try:
return phon.format_number(phon.parse(str(self),"NL"),phon.PhoneNumberFormat.E164)
except:
pass
df['column'] = df['column'].apply(formatE164)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.