简体   繁体   中英

How to extract all the address information from text?

Using Nutch I have crawled the URL, scraped data and dumped the output as text. Now I have text data, out of which I want to extract/strip only the address information. How can I do this ?

If I am not wrong only reg-ex wont help me in this case, I should write a reg-ex followed by some code logic.

Can anyone help me to solve this problem ?

Thanks in advance.

Pastebin url for Sample text: http://pastebin.com/n8Eftp2K

Sample text :

Recno:: 0
URL:: http://hiltongardeninn3.hilton.com/en/hotels/alabama/hilton-garden-inn-auburn-opelika-AUOAPGI/offers/index.htm

ParseText::
Hotels Auburn, AL - Hilton Garden Inn Auburn Opelika Deals Skip to Content My Reservations My Reservations View Promotions Sign In Join       /   My Account Sign Out Show Sign In Form     View/change a specific reservation: Your Confirmation # Last Name Find   Find - OR - Sign in to view all reservations in your account Existing Travel Reservations: Hotel + Air + Car Reservation Air Itinerary Car Rental Details Close Your Next Stay: See all     Digital Key Offered View/Edit Go To My Account Page Update your password regularly to keep your account safe. Your Last Stay: See all     View Receipt Book Again Join Hilton HHonors™ Upgrade your account and earn points at over 3,600 hotels in 82 countries around the world. Join HHonors       |   |                                                  Close Search GI Skip brand navigation Find a Hotel Offers Meetings and Events About Hilton Garden Inn skip form Where are you going? City, airport, address, attraction, or hotel Arrival You are now focused on a datepicker field. Press the down arrow to enter the calendar table. Once focused on the table, press left or right to navigate days. Press up or down to navigate between weeks. Enter to select. Escape to close datepicker. Your arrival date must be within the next year.   Departure You are now focused on a datepicker field. Press the down arrow to enter the calendar table. Once focused on the table, press left or right to navigate days. Press up or down to navigate between weeks. Enter to select. Escape to close datepicker. Your departure date must be within 4 months after your arrival date.   Use flexible dates Use HHonors Points Rooms Adults (18+) Children Rooms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26+ Adults in Room 1 1 2 3 4 Children in Room 1 0 1 2 3 4 Room 2 : Adults in Room 2 1 2 3 4 Children in Room 2 0 1 2 3 4 Room 3 : Adults in Room 3 1 2 3 4 Children in Room 3 0 1 2 3 4 Room 4 : Adults in Room 4 1 2 3 4 Children in Room 4 0 1 2 3 4 Room 5 : Adults in Room 5 1 2 3 4 Children in Room 5 0 1 2 3 4 Room 6 : Adults in Room 6 1 2 3 4 Children in Room 6 0 1 2 3 4 Room 7 : Adults in Room 7 1 2 3 4 Children in Room 7 0 1 2 3 4 Room 8 : Adults in Room 8 1 2 3 4 Children in Room 8 0 1 2 3 4 Room 9 : Adults in Room 9 1 2 3 4 Children in Room 9 0 1 2 3 4 *Best Price Guarantee | More Options Less Options Add special rate codes (AAA, AARP, etc) Find it   Find it Check Rooms & Rates Promotion/Offer code: Group code: Corporate account: Travel agent AAA rate * AARP rate * Senior rate * Government / Military rates * * ID required at check-in Close Close close tab panel Offers Discover award-winning service, thoughtful amenities and great deals at HGI hotels. View all Offers Earn 2X Points Book early and save Bed N Breakfast Deal End of tab panel close tab panel Meetings & Events Host your next meeting or special event in one of our newly renovated hotels. Meetings & Events Meetings Weddings Planning Tools Small Meeting Packages End of tab panel close tab panel About Hilton Garden Inn We’re here to help you be successful with great service and complimentary amenities. Learn More about HGI Locations Search more than 640 Hilton Garden Inn hotels worldwide to find the right one for your next trip. See our locations New Hotels See our new hotels and find out where we’re scheduled to open soon. See all new hotels End of tab panel   menu_item_property_offers AUOAPGI Hilton Garden Inn Auburn/Opelika 2555 Hilton Garden Drive , Auburn , Alabama , 36830 , USA TEL: +1-334-502-3500 FAX: +1-334-502-3572 Skip secondary navigation Hotel Home Hotel Details Amenities & Services Maps & Directions Rooms & Suites Plan an Event Special Offers Dining Things To Do   Not what you're looking for? Find Nearby Hotels Our Promise We promise to do whatever it takes to ensure you're satisfied, or you don't pay. You can count on us. GUARANTEED. The Hilton Garden Inn Promise reflects our focus on hospitality and integrity. We are committed to providing an excellent hotel experience for every guest, every time. Not what you're looking for? Find Nearby Hotels HHonors Reward Category: 7   Find a Special Offer Arrival   Departure   Find Offers Find Offers Hotel Information Check-in: 3:00 pm Check-out: 12:00 pm Smoking: Non-Smoking A fee of up to 250 USD will be assessed for smoking in a non-smoking room. Please ask the Front Desk for locations of designated outdoor smoking areas. Parking: Self parking: (Complimentary) Valet: Not Available Pets: Service animals allowed: Yes Pets allowed: No Hotel Policies Where we are Find where we are located View the Maps & Directions Page Share Print Special Offers Sort by:   Brand Book Date Compare offer Premium Wi-Fi Compare this offer , You can compare up to 4 offers Premium Wi-Fi Boost your speed with Premium Internet access <

Code:

private static Pattern regexPreciseData = Pattern.compile("(?:(no|NO|No|(?:(p|P)?\\.?\\s?(o|O)?\\.?\\s?((B|b)(o|O)(x|X))))?(?:\\s?\\.?\\#?\\/?\\:?\\-?\\s?)[0-9]\\/?\\:?)+(?:\\,?\\s?\\w+\\s?){2,5}(?:\\s?\\w+\\s?)\\-?\\s?(\\d{2,5})",Pattern.CASE_INSENSITIVE);    
utils.getText(sb, root); // extract text
text = sb.toString();
sb.setLength(0);
Matcher addressPattern = regexPreciseData.matcher(text);
while (addressPattern.find()) {
      int start = addressPattern.start();
      int end = addressPattern.end();
      sb.append(text.substring(start, end));
         }
text = sb.toString();

From the above code you can see I am trying to match my text with the regex pattern and strip only the address information out of it. But my regex is matching some irrelevant data also. Many suggested that only reg-ex can't be accurate enough to strip the address info from the text.

Use an Information Extraction library or framework for the detection of addresses. There has been loads of work on this and this will always be better than writing regular expressions on text.

You could for instance leverage GATE which comes with ANNIE, a simple IE pipeline which can extract addresses. One way of doing would be to use Behemoth and either run GATE within it or export to the GATE format so that you can run GATE separately. You could also piggyback the code in the GATE module for Behemoth and write a custom parser for Nutch so that the extraction gets done within Nutch.

There are other NLP resources that can do this, check UIMA etc... Again it's a well-known sport in the Natural Language Processing field and you don't need to reinvent the wheel.

You should also have a look at schema.org and write a custom ParseFilter for Nutch to handle pages annotated with microdata straight from Nutch.

Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM