简体   繁体   English

使用python pandas如何进行一些分析以识别有效的手机号码

[英]Using python pandas how to do some analysis for identify valid mobile numbers

I have got daily MIS Fields are " Name,Number and Location ". 我每天都有MIS字段是“名称,号码和位置”。 Now, Total I have 100 Rows data daily basis and I have to first check that the numbers are in 10 digit or not, if Number fields are 1 to 9 digit , i have to remove that entry in my MIS, 现在,Total我每天有100行数据,我必须首先检查数字是否是10位数,如果数字字段是1到9位数,我必须删除我的MIS中的那个条目,

only valid number like 10 digit and additional +91 before 10 digit number is valid. 只有10位数字的有效数字和10位数字之前的额外+91有效。 so, in excel i have to daily remove that invalid numbers and all its manually. 所以,在excel我必须每天手动删除无效的数字及其全部。

next i have to send it to valid number in 2 branches.50% valid number in 1st branch and 50% valid number in 2nd branch, 接下来我必须将它发送到2个分支机构的有效号码。第一个分支机构的有效数字为50%,第二个分支机构的有效数量为50%,

In 1st branch there are two persons, so again I have to send to both person equally valid number data entry. 在第一个分支中有两个人,所以我必须再向两个人发送同等有效的数字数据条目。 So, For example : if out of 100 data rows, total valid number is 60 , Then in 1st branch total 30 valid numbers occurs, and each two person get 15-15 numbers. 因此,例如:如果100个数据行中,总有效数为60,那么在第1个分支中总共发生30个有效数,每个人得到15-15个数。

In 2nd branch there are three persons, valid 30 numbers occurs and each three get 10-10-10 numbers. 在第二个分支中有三个人,有效的30个数字出现,每个三个得到10-10-10个数字。

Any help it will grateful. 任何帮助都会感激不尽。

Here is my code. 这是我的代码。

import pandas as pd
import numpy as np
df = pd.read_csv('/home/desktop/Desktop/MIS.csv')
df
      Name        Number Location
0   Jayesh        980000     Pune
1     Ajay    9890989090   Mumbai
2   Manish    9999999999     Pune
3   Vikram  919000000000     Pune
4  Prakash  919999999999   Mumbai
5   Rakesh  919999999998   Mumbai
6   Naresh          9000     Pune


df['Number']=df['Number'].astype(str).apply(lambda x: np.where((len(x)<=10)))

Use - 采用 -

df['Number'].astype(str).str.match(r'(\+)*(91)*(\d{10})')

Output 产量

0    False
1     True
2     True
3     True
4     True
5     True
6    False
Name: Number, dtype: bool

Update 更新

Use this bool series to filter - 使用此bool系列过滤 -

df_filtered = df[df['Number'].astype(str).str.match(r'(\+)*(91)*(\d{10})', as_indexer=True)]


Name    Number  Location
1   Ajay    9890989090  Mumbai
2   Manish  9999999999  Pune
3   Vikram  919000000000    Pune
4   Prakash 919999999999    Mumbai
5   Rakesh  919999999998    Mumbai

It's tempting to convert your numbers to strings and then perform your comparisons. 将您的数字转换为字符串然后执行比较很有吸引力。 However, this isn't necessary and will typically be inefficient. 但是,这不是必需的,并且通常效率低下。 You can use regular Boolean comparisons with a direct algorithm: 您可以使用常规布尔比较和直接算法:

m1 = (np.log10(df['Number']).astype(int) + 1) == 12
m2 = (df['Number'] // 10**10) == 91

df_filtered = df[m1 & m2]

print(df_filtered)

      Name        Number Location
3   Vikram  919000000000     Pune
4  Prakash  919999999999   Mumbai
5   Rakesh  919999999998   Mumbai

用于将nan分配给不以91开头且小于10位的str:

df.Number[(~df.Number.str.startswith('91',na=False))&[len(df.Number[i])!= 10 for i in df.index]] = np.nan

If the data corresponds likely as given in example then below should work for you as per your requirement. 如果数据可能与示例中给出的相符,则下面的内容应根据您的要求适用于您。

DataFrame: 数据帧:

>>> df
      Name        Number Location
0   Jayesh        980000     Pune
1     Ajay    9890989090   Mumbai
2   Manish    9999999999     Pune
3   Vikram  919000000000     Pune
4  Prakash  919999999999   Mumbai
5   Rakesh  919999999998   Mumbai
6   Naresh          9000     Pune

Result: 结果:

using str.match : 使用str.match

>>> df[df.Number.astype(str).str.match(r'^(\d{10}|\d{12})$', as_indexer=True)]
      Name        Number Location
1     Ajay    9890989090   Mumbai
2   Manish    9999999999     Pune
3   Vikram  919000000000     Pune
4  Prakash  919999999999   Mumbai
5   Rakesh  919999999998   Mumbai

OR 要么

>>> df[df.Number.astype(str).str.match(r'^[0-9]{10,12}$', as_indexer=True)]
      Name        Number Location
1     Ajay    9890989090   Mumbai
2   Manish    9999999999     Pune
3   Vikram  919000000000     Pune
4  Prakash  919999999999   Mumbai
5   Rakesh  919999999998   Mumbai

I suggest to use the following regex pattern: 我建议使用以下正则表达式模式:

^\\+91\\d{10}$|^91\\d{10}$|^\\d{10}$

This is assuming there are no spaces and/or brackets in your Number column. 这是假设您的Number列中没有空格和/或括号。 The pattern makes sure the digit part is always 10 long (no more no less) and lets it be preceded by either +91 or 91. 该模式确保数字部分始终为10长(不多于不少),并使其前面加上+91或91。

to build a filtered dataframe you would then: 要构建过滤后的数据框,您将:

dff = df[df['Number'].astype(str).str.match(r'^\\+91\\d{10}$|^91\\d{10}$|^\\d{10}$')]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM