简体   繁体   中英

Python - Matching and extracting data from excel with pandas

I am working on a python script that automates some phone calls for me. I have a tool to test with that I can interact with REST API. I need to select a specific carrier based on which country code is entered. So let's say my user enters 12145221414 in my excel document, I want to choose AT&T as the carrier. How would I accept input from the first column of the table and then output what's in the 2nd column?

Obviously this can get a little tricky, since I would need to match up to 3-4 digits on the front of a phone number. My plan is to write a function that then takes the initial number and then plugs the carrier that needs to be used for that country.

Any idea how I could extract this data from the table? How would I make it so that if you entered Barbados (1246), then Lime is selected instead of AT&T?

Here's my code thus far and tables. I'm not sure how I can read one table and then pull data from that table to use for my matching function.

testlist.xlsx

| Number |
|:------------|
|8155555555|
|12465555555|
|12135555555|
|96655555555|
|525555555555|

carriers.xlsx

| countryCode | Carrier |
|:------------|:--------|
|1246|LIME|
|1|AT&T|
|81|Softbank|
|52|Telmex|
|966|Zain|
import pandas as pd
import os

FILE_PATH = "C:/temp/testlist.xlsx"
xl_1 = pd.ExcelFile(FILE_PATH)
num_df = xl_1.parse('Numbers')

FILE_PATH = "C:/temp/carriers.xlsx"
xl_2 = pd.ExcelFile(FILE_PATH)
car_df = xl_2.parse('Carriers')


for index, row in num_df.iterrows():

Any idea how I could extract this data from the table? How would I make it so that if you entered Barbados (1246), then Lime is selected instead of AT&T?

carriers.xlsx

countryCode Carrier
1246 LIME
1 AT&T
81 Softbank
52 Telmex
966 Zain

script.py

import pandas as pd

FILE_PATH = "./carriers.xlsx"
df = pd.read_excel(FILE_PATH)
rows_list = df.to_dict('records')
code_carrier_map = {}

for row in rows_list:
    code_carrier_map[row["countryCode"]] = row["Carrier"]

print(type(code_carrier_map), code_carrier_map)

print(f"{code_carrier_map.get(1)=}")
print(f"{code_carrier_map.get(1246)=}")
print(f"{code_carrier_map.get(52)=}")
print(f"{code_carrier_map.get(81)=}")
print(f"{code_carrier_map.get(966)=}")

Output

$ python3 script.py 
<class 'dict'> {1246: 'LIME', 1: 'AT&T', 81: 'Softbank', 52: 'Telmex', 966: 'Zain'}
code_carrier_map.get(1)='AT&T'
code_carrier_map.get(1246)='LIME'
code_carrier_map.get(52)='Telmex'
code_carrier_map.get(81)='Softbank'
code_carrier_map.get(966)='Zain'

Then if you want to parse phone numbers, don't reinvent the wheel, just use this phonenumbers library.

Code

import phonenumbers

num = "+12145221414"
phone_number = phonenumbers.parse(num)
print(f"{num=}")
print(f"{phone_number.country_code=}")
print(f"{code_carrier_map.get(phone_number.country_code)=}")

Output

num='+12145221414'
phone_number.country_code=1
code_carrier_map.get(phone_number.country_code)='AT&T'

Let's assume the following input:

>>> df1
         Number
0    8155555555
1   12465555555
2   12135555555
3   96655555555
4  525555555555

>>> df2
   countryCode   Carrier
0         1246      LIME
1            1      AT&T
2           81  Softbank
3           52    Telmex
4          966      Zain

First we need to rework a bit df2 to sort the countryCode in descending order, make it as string and set it to index.

The trick for later is to sort countryCode in descending order. This will ensure that a longer country codes, such as "1246" is matched before a shorter one like "1".

>>> df2 = df2.sort_values(by='countryCode', ascending=False).astype(str).set_index('countryCode')
>>> df2
              Carrier
countryCode          
1246             LIME
966              Zain
81           Softbank
52             Telmex
1                AT&T

Finally, we use a regex (here '1246|966|81|52|1' using '|'.join(df2.index) ) made from the country codes in descending order to extract the longest code, and we map it to the carrier:

(df1.astype(str)['Number']
    .str.extract('^(%s)'%'|'.join(df2.index))[0]
    .map(df2['Carrier'])
)

output:

0    Softbank
1        LIME
2        AT&T
3        Zain
4      Telmex
Name: 0, dtype: object

NB. to add it to the initial dataframe:

df1['carrier'] = (df1.astype(str)['Number']
                     .str.extract('^(%s)'%'|'.join(df2.index))[0]
                     .map(df2['Carrier'])
                 ).to_clipboard(0)

output:

         Number   carrier
0    8155555555  Softbank
1   12465555555      LIME
2   12135555555      AT&T
3   96655555555      Zain
4  525555555555    Telmex

I can think only of an inefficient solution.

First, sort the data frame of carriers in the reverse alphabetical order of country codes. That way, longer prefixes will be closer to the beginning.

codes = xl_2.sort_values('countryCode', ascending=False)

Next, define a function that matches a number with each country code in the second data frame and finds the index of the first match, if any (remember, that match is the longest).

def cc2carrier(num):
    matches = codes['countryCode'].apply(lambda x: num.startswith(x))
    if not matches.any(): #Not found
        return np.nan
    return codes.loc[matches.idxmax()]['Carrier']

Now, apply the function to the numbers dataframe:

xl_1['Number'].apply(cc2carrier)
#1    Softbank
#2        LIME
#3        AT&T
#4        Zain
#5      Telmex
#Name: Number, dtype: object
    

If I understand it correctly, you just want to get the first characters from the input column (Number) and then match this with the second dataframe from carriers.xlsx.

Extract first characters of a Number column. Hint: The nbr_of_chars variable should be based on the maximum character length of the column countryCode in the carriers.xlsx

nbr_of_chars = 4
df.loc[df['Number'].notnull(), 'FirstCharsColumn'] = df['Number'].str[:nbr_of_chars]

Then the matching should be fairly easy with dataframe joins.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM