Splitting address strings on Sequential-Number strings (2nd, 8th, 3rd, first, etc..)

Question

I've been tasked with standardizing some address information. Toward that goal, I'm breaking the address string into granular values (our address schema is very similar to Google's format ).

Progress so far:
I'm using PHP, and am currently breaking out Bldg, Suite, Room#, etc... info.
It was all going great until I encountered Floors .
For the most part, the floor info is represented as "Floor 10" or "Floor 86" . Nice & easy.
For everything to that point, I can simply break the string on a string ( "room" , "floor" , etc..)

The problem:
But then I noticed something in my test dataset. There are some cases where the floor is represented more like "2nd Floor" .
This made me realize that I need to prepare for a whole slew of variations for the FLOOR info.
There are options like "3rd Floor" , "22nd floor" , and "1ST FLOOR" . Then what about spelled out variants such as "Twelfth Floor" ?
Man!! This can become a mess pretty quickly.

My Goal:
I'm hoping someone knows of a library or something that already solves this problem.
In reality, though, I'd be more than happy with some good suggestions/guidance on how one might elegantly handle splitting the strings on such diverse criteria (taking care to avoid false positives such as "3rd St" ).

Answer 1

first of all, you need to have exhaustive list of all possible formats of the input and decide, how deep you'd like to go. If you consider spelled out variants as invalid case, you may apply simple regular expressions to capture number and detect the token (room, floor ...)

Answer 2

I would start by reading up on regex in PHP. For example:

$floorarray = preg_split("/\sfloor\s/i", $floorstring)

Other useful functions are preg_grep , preg_match , etc

Edit: added a more complete solution.

This solution takes as an input a string describing the floor. It can be of various formats such as:

Floor 102
Floor One-hundred two
Floor One hundred and two
One-hundred second floor
102nd floor
102ND FLOOR
etc

Until I can look at an example input file, I am just guessing from your post that this will be adequate.

<?php

$errorLog = 'error-log.txt'; // a file to catalog bad entries with bad floors

// These are a few example inputs
$addressArray = array('Fifty-second Floor', 'somefloor', '54th floor', '52qd floor',
  'forty forty second floor', 'five nineteen hundredth floor', 'floor fifty-sixth second ninth');

foreach ($addressArray as $id => $address) {
  $floor = parseFloor($id, $address);
  if ( empty($floor) ) {
    error_log('Entry '.$id.' is invalid: '.$address."\n", 3, $errorLog);
  } else {
    echo 'Entry '.$id.' is on floor '.$floor."\n";
  }
}

function parseFloor($id, $address)
{
  $floorString = implode(preg_split('/(^|\s)floor($|\s)/i', $address));

  if ( preg_match('/(^|^\s)(\d+)(st|nd|rd|th)*($|\s$)/i', $floorString, $matchArray) ) {
    // floorString contained a valid numerical floor
    $floor = $matchArray[2];
  } elseif ( ($floor = word2num($floorString)) != FALSE ) { // note assignment op not comparison
    // floorString contained a valid english ordinal for a floor
    ; // No need to do anything
  } else {
     // floorString did not contain a properly formed floor
    $floor = FALSE;
  }
  return $floor;
}

function word2num( $inputString )
{
  $cards = array('zero',
    'one',    'two',    'three',    'four',     'five',    'six',     'seven',     'eight',    'nine',     'ten',
    'eleven', 'twelve', 'thirteen', 'fourteen', 'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', 'twenty');
  $cards[30] = 'thirty';  $cards[40] = 'forty';  $cards[50] = 'fifty'; $cards[60] = 'sixty';
  $cards[70] = 'seventy'; $cards[80] = 'eighty'; $cards[90] = 'ninety'; $cards[100] = 'hundred';
  $ords  = array('zeroth',
    'first',    'second',  'third',      'fourth',     'fifth',     'sixth',     'seventh',     'eighth',     'ninth',      'tenth',
    'eleventh', 'twelfth', 'thirteenth', 'fourteenth', 'fifteenth', 'sixteenth', 'seventeenth', 'eighteenth', 'nineteenth', 'twentieth');
  $ords[30] = 'thirtieth';  $ords[40] = 'fortieth';  $ords[50] = 'fiftieth';  $ords[60] =  'sixtieth';
  $ords[70] = 'seventieth'; $ords[80] = 'eightieth'; $ords[90] = 'ninetieth'; $ords[100] = 'hundredth';

  // break the string at any whitespace, dash, comma, or the word 'and'
  $words = preg_split( '/([\s-,](?!and\s)|\sand\s)/i', $inputString );

  $sum = 0;
  foreach ($words as $word) {
    $word = strtolower($word);
    $value = array_search($word, $ords); // try the ordinal words
    if (!$value) { $value = array_search($word, $cards); } // try the cardinal words
    if (!$value) {
      // if temp is still false, it's not a known number word, fail and exit
      return FALSE;
    }
    if ($value == 100) { $sum *= 100; }
    else { $sum += $value; }
  }

  return $sum;
}
?>

In the general case, parsing words into numbers is not easy. The best thread that I could find that discusses this is here . It is not nearly as easy as the inverse problem of converting numbers into words. My solution only works for numbers <2000, and it liberally interprets poorly formed constructs rather than tossing an error. Also, it is not resilient against spelling mistakes at all. For example:

forty forty second = 82
five nineteen hundredth = 2400
fifty-sixth second ninth = 67

If you have a lot of inputs and most of them are well formed, throwing errors for spelling mistakes is not really a big deal because you can manually correct the short list of problem entries. Silently accepting bad input, however, could be a real problem depending on your application. Just something to think about when deciding if it is worth it to make the conversion code more robust.

Splitting address strings on Sequential-Number strings (2nd, 8th, 3rd, first, etc..)

Question

2 answers

solution1
0 2013-02-26 17:16:32

solution2
0 2013-02-26 17:46:28

Splitting address strings on Sequential-Number strings (2nd, 8th, 3rd, first, etc..)

Question

2 answers

solution1 0 2013-02-26 17:16:32

solution2 0 2013-02-26 17:46:28

solution1
0 2013-02-26 17:16:32

solution2
0 2013-02-26 17:46:28