简体   繁体   中英

How to use parse from phonenumbers Python library on a pandas data frame?

How can I parse phone numbers from a pandas data frame, ideally using phonenumbers library?

I am trying to use a port of Google's libphonenumber library on Python, https://pypi.org/project/phonenumbers/ .

I have a data frame with 3 million phone numbers from many countries. I have a row with the phone number, and a row with the country/region code. I'm trying to use the parse function in the package. My goal is to parse each row using the corresponding country code but I can't find a way of doing it efficiently.

I tried using apply but it didn't work. I get a "(0) Missing or invalid default region." error, meaning it won't pass the country code string.

df['phone_number_clean'] = df.phone_number.apply(lambda x: 
phonenumbers.parse(str(df.phone_number),str(df.region_code)))

The line below works, but doesn't get me what I want, as the numbers I have come from about 120+ different countries.

df['phone_number_clean'] = df.phone_number.apply(lambda x:
 phonenumbers.parse(str(df.phone_number),"US"))

I tried doing this in a loop, but it is terribly slow. Took me more than an hour to parse 10,000 numbers, and I have about 300x that:

for i in range(n): 
    df3['phone_number_std'][i] = 
phonenumbers.parse(str(df.phone_number[i]),str(df.region_code[i]))

Is there a method I'm missing that could run this faster? The apply function works acceptably well but I'm unable to pass the data frame element into it.

I'm still a beginner in Python, so perhaps this has an easy solution. But I would greatly appreciate your help.

Your initial solution using apply is actually pretty close - you don't say what doesn't work about it, but the syntax for a lambda function over multiple columns of a dataframe, rather than on the rows within a single column, is a bit different. Try this:

df['phone_number_clean'] = df.apply(lambda x: 
                              phonenumbers.parse(str(x.phone_number), 
                                                 str(x.region_code)), 
                              axis='columns')

The differences:

  1. You want to include multiple columns in your lambda function, so you want to apply your lambda function to the entire dataframe (ie, df.apply ) rather than to the Series (the single column) that is returned by doing df.phone_number.apply . (print the output of df.phone_number to the console - what is returned is all the information that your lambda function will be given).

  2. The argument axis='columns' (or axis=1 , which is equivalent, see the docs ) actually slices the data frame by rows, so apply 'sees' one record at a time (ie, [index0, phonenumber0, countrycode0], [index1, phonenumber1, countrycode1]...) as opposed to slicing the other direction, which would give it ([phonenumber0, phonenumber1, phonenumber2...])

  3. Your lambda function only knows about the placeholder x , which, in this case, is the Series [index0, phonenumber0, countrycode0], so you need to specify all the values relative to the x that it knows - ie, x.phone_number, x.country_code.

love the solution of @katelie. But here's my code. Added a try/except function to skip the phonenumber function from failing. It cannot handle string with a length that is to long.

    import phonenumber as phon

def formatE164(self): 
  try:
    return phon.format_number(phon.parse(str(self),"NL"),phon.PhoneNumberFormat.E164)
  except:
    pass
    
df['column'] = df['column'].apply(formatE164)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM