简体   繁体   English

如何从文本中提取所有地址信息?

[英]How to extract all the address information from text?

Using Nutch I have crawled the URL, scraped data and dumped the output as text.我使用 Nutch 抓取了 URL,抓取了数据并将输出转储为文本。 Now I have text data, out of which I want to extract/strip only the address information.现在我有文本数据,我只想从中提取/剥离地址信息。 How can I do this ?我怎样才能做到这一点 ?

If I am not wrong only reg-ex wont help me in this case, I should write a reg-ex followed by some code logic.如果我没有错,在这种情况下只有 reg-ex 不会帮助我,我应该编写一个 reg-ex 后跟一些代码逻辑。

Can anyone help me to solve this problem ?谁能帮我解决这个问题?

Thanks in advance.提前致谢。

Pastebin url for Sample text: http://pastebin.com/n8Eftp2K示例文本的 Pastebin 网址: http : //pastebin.com/n8Eftp2K

Sample text :示例文本 :

Recno:: 0
URL:: http://hiltongardeninn3.hilton.com/en/hotels/alabama/hilton-garden-inn-auburn-opelika-AUOAPGI/offers/index.htm

ParseText::
Hotels Auburn, AL - Hilton Garden Inn Auburn Opelika Deals Skip to Content My Reservations My Reservations View Promotions Sign In Join       /   My Account Sign Out Show Sign In Form     View/change a specific reservation: Your Confirmation # Last Name Find   Find - OR - Sign in to view all reservations in your account Existing Travel Reservations: Hotel + Air + Car Reservation Air Itinerary Car Rental Details Close Your Next Stay: See all     Digital Key Offered View/Edit Go To My Account Page Update your password regularly to keep your account safe. Your Last Stay: See all     View Receipt Book Again Join Hilton HHonors™ Upgrade your account and earn points at over 3,600 hotels in 82 countries around the world. Join HHonors       |   |                                                  Close Search GI Skip brand navigation Find a Hotel Offers Meetings and Events About Hilton Garden Inn skip form Where are you going? City, airport, address, attraction, or hotel Arrival You are now focused on a datepicker field. Press the down arrow to enter the calendar table. Once focused on the table, press left or right to navigate days. Press up or down to navigate between weeks. Enter to select. Escape to close datepicker. Your arrival date must be within the next year.   Departure You are now focused on a datepicker field. Press the down arrow to enter the calendar table. Once focused on the table, press left or right to navigate days. Press up or down to navigate between weeks. Enter to select. Escape to close datepicker. Your departure date must be within 4 months after your arrival date.   Use flexible dates Use HHonors Points Rooms Adults (18+) Children Rooms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26+ Adults in Room 1 1 2 3 4 Children in Room 1 0 1 2 3 4 Room 2 : Adults in Room 2 1 2 3 4 Children in Room 2 0 1 2 3 4 Room 3 : Adults in Room 3 1 2 3 4 Children in Room 3 0 1 2 3 4 Room 4 : Adults in Room 4 1 2 3 4 Children in Room 4 0 1 2 3 4 Room 5 : Adults in Room 5 1 2 3 4 Children in Room 5 0 1 2 3 4 Room 6 : Adults in Room 6 1 2 3 4 Children in Room 6 0 1 2 3 4 Room 7 : Adults in Room 7 1 2 3 4 Children in Room 7 0 1 2 3 4 Room 8 : Adults in Room 8 1 2 3 4 Children in Room 8 0 1 2 3 4 Room 9 : Adults in Room 9 1 2 3 4 Children in Room 9 0 1 2 3 4 *Best Price Guarantee | More Options Less Options Add special rate codes (AAA, AARP, etc) Find it   Find it Check Rooms & Rates Promotion/Offer code: Group code: Corporate account: Travel agent AAA rate * AARP rate * Senior rate * Government / Military rates * * ID required at check-in Close Close close tab panel Offers Discover award-winning service, thoughtful amenities and great deals at HGI hotels. View all Offers Earn 2X Points Book early and save Bed N Breakfast Deal End of tab panel close tab panel Meetings & Events Host your next meeting or special event in one of our newly renovated hotels. Meetings & Events Meetings Weddings Planning Tools Small Meeting Packages End of tab panel close tab panel About Hilton Garden Inn We’re here to help you be successful with great service and complimentary amenities. Learn More about HGI Locations Search more than 640 Hilton Garden Inn hotels worldwide to find the right one for your next trip. See our locations New Hotels See our new hotels and find out where we’re scheduled to open soon. See all new hotels End of tab panel   menu_item_property_offers AUOAPGI Hilton Garden Inn Auburn/Opelika 2555 Hilton Garden Drive , Auburn , Alabama , 36830 , USA TEL: +1-334-502-3500 FAX: +1-334-502-3572 Skip secondary navigation Hotel Home Hotel Details Amenities & Services Maps & Directions Rooms & Suites Plan an Event Special Offers Dining Things To Do   Not what you're looking for? Find Nearby Hotels Our Promise We promise to do whatever it takes to ensure you're satisfied, or you don't pay. You can count on us. GUARANTEED. The Hilton Garden Inn Promise reflects our focus on hospitality and integrity. We are committed to providing an excellent hotel experience for every guest, every time. Not what you're looking for? Find Nearby Hotels HHonors Reward Category: 7   Find a Special Offer Arrival   Departure   Find Offers Find Offers Hotel Information Check-in: 3:00 pm Check-out: 12:00 pm Smoking: Non-Smoking A fee of up to 250 USD will be assessed for smoking in a non-smoking room. Please ask the Front Desk for locations of designated outdoor smoking areas. Parking: Self parking: (Complimentary) Valet: Not Available Pets: Service animals allowed: Yes Pets allowed: No Hotel Policies Where we are Find where we are located View the Maps & Directions Page Share Print Special Offers Sort by:   Brand Book Date Compare offer Premium Wi-Fi Compare this offer , You can compare up to 4 offers Premium Wi-Fi Boost your speed with Premium Internet access <

Code:代码:

private static Pattern regexPreciseData = Pattern.compile("(?:(no|NO|No|(?:(p|P)?\\.?\\s?(o|O)?\\.?\\s?((B|b)(o|O)(x|X))))?(?:\\s?\\.?\\#?\\/?\\:?\\-?\\s?)[0-9]\\/?\\:?)+(?:\\,?\\s?\\w+\\s?){2,5}(?:\\s?\\w+\\s?)\\-?\\s?(\\d{2,5})",Pattern.CASE_INSENSITIVE);    
utils.getText(sb, root); // extract text
text = sb.toString();
sb.setLength(0);
Matcher addressPattern = regexPreciseData.matcher(text);
while (addressPattern.find()) {
      int start = addressPattern.start();
      int end = addressPattern.end();
      sb.append(text.substring(start, end));
         }
text = sb.toString();

From the above code you can see I am trying to match my text with the regex pattern and strip only the address information out of it.从上面的代码中,您可以看到我正在尝试将我的文本与正则表达式模式匹配,并仅从中删除地址信息。 But my regex is matching some irrelevant data also.但是我的正则表达式也匹配了一些不相关的数据。 Many suggested that only reg-ex can't be accurate enough to strip the address info from the text.许多人认为只有 reg-ex 不够准确,无法从文本中去除地址信息。

Use an Information Extraction library or framework for the detection of addresses.使用信息提取库或框架来检测地址。 There has been loads of work on this and this will always be better than writing regular expressions on text.在这方面有很多工作,这总是比在文本上编写正则表达式要好。

You could for instance leverage GATE which comes with ANNIE, a simple IE pipeline which can extract addresses.例如,您可以利用 ANNIE 附带的GATE ,这是一个可以提取地址的简单 IE 管道。 One way of doing would be to use Behemoth and either run GATE within it or export to the GATE format so that you can run GATE separately.一种方法是使用Behemoth并在其中运行 GATE 或导出为 GATE 格式,以便您可以单独运行 GATE。 You could also piggyback the code in the GATE module for Behemoth and write a custom parser for Nutch so that the extraction gets done within Nutch.您还可以搭载 Behemoth 的 GATE 模块中的代码,并为 Nutch 编写自定义解析器,以便在 Nutch 内完成提取。

There are other NLP resources that can do this, check UIMA etc... Again it's a well-known sport in the Natural Language Processing field and you don't need to reinvent the wheel.还有其他 NLP 资源可以做到这一点,检查 UIMA 等......同样,它是自然语言处理领域的一项众所周知的运动,你不需要重新发明轮子。

You should also have a look at schema.org and write a custom ParseFilter for Nutch to handle pages annotated with microdata straight from Nutch.您还应该查看schema.org并为 Nutch 编写一个自定义 ParseFilter 来处理直接从 Nutch 使用微数据注释的页面。

Hope this helps.希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从文本中提取信息 - Extract information from text 如何从网页中提取所有文本? - How to extract all the text from a webpage? 如何从LDAP中的给定组中提取所有成员的电子邮件地址 - How to extract the email Address of all members from a given group in Ldap 如何从jpanel中提取信息 - how to extract information from jpanel 从名称和地址的文本块中提取地址/联系人详细信息? - Extract address/contact details from a text block with name and address? 如何使用JSoup分别从网页的所有元素中提取文本? - How to extract text from all the elements in a webpage individually, using JSoup? 如何从JTextArea中提取给定字体的文字和换行信息 - How to extract the word and line wrapping information from JTextArea for text with given font 如何在没有额外信息的情况下有效地从一堆网页中提取文本 - How can I efficiently extract text from bunch for web pages without extra information 如何使用selenium web驱动程序从跨区文本中提取不是正则表达式格式的电子邮件地址? - How to extract an email address which is not in a regular expression format from a span text using selenium web driver? 如何从TransformerException中提取有用的信息 - How to extract useful information from TransformerException
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM