Extract address from string

Question

Let's say I have this string:

<div>john doe is nice guy btw 8240 E. Marblehead Way 92808  is also</div>

or this string:

<div>sky being blue? in the world is true? 024 Brea Mall  Brea, California 92821 jackfroast nipping on the firehead</div>

How would I go about extracting the address from one of these strings? This would involve some sort of Regex, right?

I've tried looking online for a solution using JavaScript or PHP, but to no avail. And no other post here on Stack Overflow (as far as I know) provides a solution that uses jQuery and/or Javascript and/or PHP. (The closest is Parse usable Street Address, City, State, Zip from a string , which DOESN'T have any code in the thread about extracting a postal code from a string.

Can somebody point me in the right direction? How would I go about accomplishing this in jQuery or JavaScript or PHP?

Answer 1

Tried this on twelve different strings that were similar to yours and it worked just fine:

function str_to_address($context) { 

    $context_parts = array_reverse(explode(" ", $context)); 
    $zipKey = ""; 
    foreach($context_parts as $key=>$str) { 
        if(strlen($str)===5 && is_numeric($str)) { 
            $zipKey = $key;
            break; 
        }
    }

    $context_parts_cleaned = array_slice($context_parts, $zipKey); 
    $context_parts_normalized = array_reverse($context_parts_cleaned); 
    $houseNumberKey = ""; 
    foreach($context_parts_normalized as $key=>$str) { 
        if(strlen($str)>1 && strlen($str)<6 && is_numeric($str)) { 
            $houseNumberKey = $key;
            break; 
        }
    }

    $address_parts = array_slice($context_parts_normalized, $houseNumberKey);
    $string = implode(' ', $address_parts);
    return $string;
}

This assumes a house number of at least two digits, and no greater than six. This also assumes that the zip code isn't in the "expanded" form (eg 12345-6789). However this can be easily modified to fit that format (regex would be a good option here, something like (\\d{5}-\\d{4}) .

But using regex for parsing user-inputted data... Not a good idea here, because we just don't know what a user is going to input because there were (as one can assume) no validations.

Walking through the code and logic, starting with creating the array from the context and grabbing the zip:

// split the context (for example, a sentence) into an array, 
// so we can loop through it. 
// we reverse the array, as we're going to grab the zip first. 
// why? we KNOW the zip is 5 characters long*.
$context_parts = array_reverse(explode(" ", $context));  

// we're going to store the array index of the zip code for later use 
$zipKey = ""; 

// foreach iterates over an object given the params, 
// in this case it's like doing... 
// for each value of $context_parts ($str), and each index ($key)
foreach($context_parts as $key=>$str) { 

    // if $str is 5 chars long, and numeric... 
    // an incredibly lazy check for a zip code...
    if(strlen($str)===5 && is_numeric($str)) {  
        $zipKey = $key;

        // we have what we want, so we can leave the loop with break
        break; 
    }
}

Do some tidying so we have a better object to garb the house number from

// remove junk from $context_array, since we don't 
// need stuff after the zip
$context_parts_cleaned = array_slice($context_parts, $zipKey); 

// since the house number comes first, let's go back to the start
$context_parts_normalized = array_reverse($context_parts_cleaned);

And then let's grab the house number, using the same basic logic that we did the zip code:

$houseNumberKey = ""; 
foreach($context_parts_normalized as $key=>$str) { 
    if(strlen($str)>1 && strlen($str)<6 && is_numeric($str)) { 
        $houseNumberKey = $key;
        break; 
    }
}

// we probably have the parts we for the address.
// let's do some more cleaning 
$address_parts = array_slice($context_parts_normalized, $houseNumberKey);

// and build the string again, from the address
$string = implode(' ', $address_parts);

// and return the string
return $string;

Answer 2

Regular expressions are used to test against patterns . You need to know what pattern you're looking for. From the two examples you provided, I would look for a number, then some text, ending with a five digit number.

All the addresses would have to be in this format. You can't magically just extract addresses from a string.

Answer 3

If all yours Address start and end's with numbers, you can use this Regular Expression to extract data you need:

/[0-9].+[0-9]/gi

Javascript exemple:

"<div>john doe is nice guy btw 8240 E. Marblehead Way 92808  is also</div>".match(/[0-9].+[0-9]/gi) // ["8240 E. Marblehead Way 92808"]
"<div>sky being blue? in the world is true? 024 Brea Mall  Brea, California 92821 jackfroast nipping on the firehead</div>".match(/[0-9].+[0-9]/gi) // ["024 Brea Mall  Brea, California 92821"]

For the new example, that contains phone number, you can do:

/[0-9].*[0-9]/gi

Javascript exemple:

"john doe 7143138656 is 8240 e marblehead way 92808".match(/[0-9].*[0-9]/gi) // ["7143138656 is 8240 e marblehead way 92808"]

But this will help you only if you have an match info per line. If you really need's a powerfull address matcher, you wil need to go ahead, and create powerfull analysis.

You can begin search in the text for target keywords, then filter the paragrapher, to then strip the info you seeking for.

It's not an easy question, but can be done, you can use more then one regexp for some matches, but if the address doesn't have an pattern, the regexp will be useless, that time you will need to change your aproach.

Answer 4

It is a common "mistake" to try and parse everything with Regular Expressions due to convenience. However, regular expressions are not the answer to everything. In this case it doesn't look like you are looking for regular patterns in text, but rather "natural" expressions someone would write as if they are talking to you. These natural expression won't necessarily follow any consistent pattern at all. Some people put appt numbers first then building number, some people leave out the city and skip to the zip code, some people might put city, state, country THEN zip. It just won't be possible to enumerate every possible regex pattern that someone could cook up with an address.

For natural language addresses I would forget regex address detection and move towards a stateful parsing algorithm.

I would start by reading the text from left to right (at least in English) one word at a time. At each word you would do one logical test "could this word be the start of an address?". I would suppose this is a number for either a building number or appt/unit/box number (so "Box XXX", "PO BOX XXX", "PO XXX", "Unit XXX", "#XXX" or any number less than 6 digits in length). While I don't know this to be factually true I've never seen a north american building number 7 digits in length which is the minimum for a phone. So I would suspect you could sort out phone numbers vs building numbers fairly easily. This "start of address" test could be a set of regex matches, but we're not matching the whole address, just testing for words or phrases that start an address. I'd probably even say it'd be simpler without regex matching .
Once you've detected the start of an address you create an "address parsing state object" (some class you use to hold the address as your continue parsing and keep track of what you have so far and what you expect next). Now you can continue stepping through the sentence and continue adding to your parser state object. Following a building number, I'd probably expect a street name or a directional indicator (NEWS NE. NW. SE. SW.). If neither of those come next stop your address parsing and assume an invalid or incomplete address, keep looking for new start of address words. Otherwise add the street name and/or directional indicators to your parse tree and keep going!
Anything following a street name could be infinitely variable. Some users may just stop at building number and street name (assuming their local city/region/country). Otherwise you are probably looking for either a city name or a postal code/zip code. If found, add to your address parsing state object, if not assume an incomplete address (fill with user default location info?) or invalid address (ignore and continue looking for another start of address?).

Ultimately this approach could be one fairly simply JavaScript method with maybe a couple hundred lines of code (I'm not a PHP guy, but I assume it'd be similar). If you were to try and enumerate every possible regex pattern, someone could construct an address with, you'd have hundreds of those alone and it'd still be unreliable! (Probably slow too if you are trying to match hundreds of regex patterns).

Answer 5

My thinking says you should have something to tell your code that 'form here to here is a address and the rest is simple text'. For that either you make an array of address or keep the addresses in a database from where you can compare it with your inserted values

Answer 6

I've had the best luck using Google Geocode API . It takes the difficulty of trying to think of every possible way an address string may be input.

I recently had to extract parts of an address from a single string for a real estate website, and I found that the best option was to use google geocode API. It allowed me to get Street, City, State, Zip, Latitude, Longitude, and more for every address entered.

I found a great guide on getting set up with google geocode API (PHP) here: http://www.andrew-kirkpatrick.com/2011/10/google-geocoding-api-with-php/

The best part, it even works with names of places. So a search for 'UCLA' or 'Apple Headquarters' will give you all the parts of an address that you might need.

Extract address from string

Question

6 answers

solution1
21 ACCPTED 2013-01-02 00:16:40

solution2
2 2012-12-30 00:15:15

solution3
2 2012-12-30 13:34:44

solution4
1 2013-01-01 20:16:44

solution5
0 2013-01-01 10:41:00

solution6
0 2013-07-17 04:48:02

Extract address from string

Question

6 answers

solution1 21 ACCPTED 2013-01-02 00:16:40

solution2 2 2012-12-30 00:15:15

solution3 2 2012-12-30 13:34:44

solution4 1 2013-01-01 20:16:44

solution5 0 2013-01-01 10:41:00

solution6 0 2013-07-17 04:48:02

solution1
21 ACCPTED 2013-01-02 00:16:40

solution2
2 2012-12-30 00:15:15

solution3
2 2012-12-30 13:34:44

solution4
1 2013-01-01 20:16:44

solution5
0 2013-01-01 10:41:00

solution6
0 2013-07-17 04:48:02