简体   繁体   中英

Python Regex (Pattern Matching) phone number start with area code search

I tried to write logic to extract the phone number which starts with certain area codes (first 3 digits of the phone number) in a given data file. My code is:

import re

data = "This is the sample test data. 2247279133 224dfa7279133 dhana 5107279133 subha 123456789 "

pattern = re.compile(r"((224|510)\d{7})")
matches = pattern.findall(data)

for match in matches:
    print (match[0])

I am getting the expected output as below:

2247279133

5107279133

Though I am getting the expected output I would like to know below things:

  1. Is this an efficient way?
  2. Is it possible to pass the list of area codes as an array variable instead of hardcoding ( 224|510 )?
  3. what is the recommended way to search these kinds of phone numbers over large file of 10 GB?

First an observation concerning your regular expression. As you currently have specified your regular expression, you would match 2247279133 in the following string:

This is sample test data: 1232247279133987 224dfa7279133 etc.

Is that what you would want? Perhaps you might require that the phone number be separated by white space on either side if not at the start or end of the string:

pattern = re.compile(r"(?:\A|\s)((224|510)\d{7})(?=\Z|\s)")

Now to answer your questions:

  1. My understanding is that the regular expression engine in Python relies on backtracking and is therefore not the most efficient method of finding your phone numbers in theory . You could do better if you hand-coded a finite state automaton to recognize the specific phone numbers you are looking for. But this may not gain you anything because this would be running as interpreted Python code versus the regular expression engine that is running as C-language code. Also, your regular expression is not overly complicated requiring a lot of backtracking. Finally, if you wanted to handle an array of area codes provided at run time, you would need to build the FSA at run time. This algorithm to do this is fairly complicated and probably a bigger task than you would want to take on. The bottom line is that what you are doing is probably the most efficient approach you can take in practice .

  2. Yes. The string you pass to re.compile can be created at run time in part by calling '|'.join(area_codes) , where area_code s is a list of area codes.

  3. You should investigate using the Unix/Linux grep utility as a first-step filter (grep unfortunately outputs the entire line containing the pattern you are looking for) if you are not satisfied with the Python-only performance.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM