简体   繁体   中英

Extracting data by regex and writing to CSV, Python glob (pandas?)

I have a large list of varyingly dirty CSVs containing phone numbers in various formats. What I want is to comb through all of them and export to a single-column CSV of cleaned phone numbers in a simple format. So far, I have pieced together something to work, though it has some issues: (partial revision further below)

import csv
import re
import glob
import string

with open('phonelist.csv', 'wb') as out:
    seen = set()
    output = []
    out_writer = csv.writer(out)
    csv_files = glob.glob('CSVs\*.csv')
    for filename in csv_files:
        with open(filename, 'rbU') as ifile:
            read = csv.reader(ifile)
            for row in read:
                for column in row:
                    s1 = column.strip()
                    if re.match(r'\b\d\d\d\d\d\d\d\d\d\d\b', s1):
                        if s1 not in seen:
                            seen.add(s1)
                            output.append(s1)
                    elif re.search(r'\b\(\d\d\d\) \d\d\d-\d\d\d\d\b', s1):
                        s2 = filter(lambda x: x in string.digits, s1)
                        if s2 not in seen:
                            seen.add(s2)
                            output.append(s2)
    for val in output:
        out_writer.writerow([val])

I'm putting this together with no formal python knowledge, just piecing things I've gleaned on this site. Any advice regarding pythonic stylization or utilizing the pandas library for shortcuts would all be welcome.

First issue: What's the simplest way to filter to just the matched values? IE, I may get 9815556667 John Smith , but I just want the number.

Second issue: This takes forever. I assume it's the lambda part. Is there a faster or more efficient method?

Third issue: How do I glob *.csv in the directory of the program and the CSVs directory (as written)?

I know that's several questions at once, but I got myself halfway there. Any additional pointers are appreciated.


For examples, requested, this isn't from a file (these are multi-gigabyte files), but here's what I'm looking for:

John Smith, (981) 991-0987, 9987765543 extension 541, 671 Maple St 98402
(998) 222-0011, 13949811123, Foo baR Us, 2567 Appleberry Lane
office, www.somewebpage.com, City Group, Anchorage AK
9281239812
(345) 666-7777

Should become:

9819910987
9987765543
9982220011
3949811123
3456667777

(I forgot that I need to drop a leading 1 from 11-digit numbers, too)


EDIT: I've changed my current code to incorporate Shahram's advice, so now, from for column in row above, I have, instead of above:

for column in row:
    s1 = column.strip()
    result = re.match(
        r'.*(\+?[2-9]?[0-9]?[0-9]?-?\(?[0-9][0-9][0-9]\)? ?[0-9][0-9][0-9]-?[0-9][0-9][0-9][0-9]).*', s1) or re.match(
        r'.*(\+?[2-9]?[0-9]?[0-9]?-?\(?[0-9][0-9][0-9]\)?-?[0-9][0-9][0-9]-?[0-9][0-9][0-9][0-9]).*', s1)
    if result:
        tempStr = result.group(1)
        for ch in ['(', ')', '-', ' ']:
            tempStr = tempStr.replace(ch, '')
        if tempStr not in seen:
            seen.add(tempStr)
            output.append(tempStr)

This seems to work for my purposes, but I still don't know how to glob the current directory and subdirectory, and I still don't know if my code has issues that I'm unaware of because of my hodge-podge-ing. Also, in my larger directory, this is taking forever - as in, about a gig of CSVs is timing out for me (by my hand) at around 20 minutes. I don't know if it's hitting a snag, but judging by the speed at which python normally chews through any number of CSVs, it feels like I'm missing something.

About your first question, You can use the below regular expression to capture different types of Phone Numbers:

  result = re.match(r'.*(\+?[0-9]?[0-9]?[0-9]?-?\(?[0-9][0-9][0-9]\)?-?[0-9][0-9][0-9]-?[0-9][0-9][0-9][0-9]).*', s1)
  if result:
    if result.group(1) not in seen:
       seen.add(result.group(1))
       output.append(result.group(1))

About your second question: You may want to look at the replace function. So the above code can be changed to:

  result = re.match(r'.*(\+?[0-9]?[0-9]?[0-9]?-?\(?[0-9][0-9][0-9]\)?-?[0-9][0-9][0-9]-?[0-9][0-9][0-9][0-9]).*', s1)
  if result:
    if result.group(1) not in seen:
       tempStr = result.group(1)
       tempStr.replace('-','')
       seen.add(tempStr)
       output.append(tempStr)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM