简体   繁体   中英

Parse Location, Person name, Date from string by NLTK

I have lots of strings like following,

  1. ISLAMABAD: Chief Justice Iftikhar Muhammad Chaudhry said that National Accountab
  2. KARACHI, July 24 -- Police claimed to have arrested several suspects in separate
  3. ALUM KULAM, Sri Lanka -- As gray-bellied clouds started to blot out the scorchin

I am using NLTK to remove the dateline part and recognize the date, location and person name?

Using pos tagging I can find the parts of speech. But I need to determine location , date , person name . How can I do that?

Update:

Note: I dont want to perform another http request. I need to parse it using my own code. If there is a library its okay to use it.

Update:

I use ne_chunk . But no luck.

import nltk

def pchunk(t):
    w_tokens = nltk.word_tokenize(t)
    pt = nltk.pos_tag(w_tokens)
    ne = nltk.ne_chunk(pt)
    print ne

# txts is a list of those 3 sentences.
for t in txts:                                            
    print t
    pchunk(t)

Output is following,

ISLAMABAD: Chief Justice Iftikhar Muhammad Chaudhry said that National Accountab

(S
  ISLAMABAD/NNP
  :/:
  Chief/NNP
  Justice/NNP
  (PERSON Iftikhar/NNP Muhammad/NNP Chaudhry/NNP)
  said/VBD
  that/IN
  (ORGANIZATION National/NNP Accountab/NNP))

KARACHI, July 24 -- Police claimed to have arrested several suspects in separate

(S
  (GPE KARACHI/NNP)
  ,/,
  July/NNP
  24/CD
  --/:
  Police/NNP
  claimed/VBD
  to/TO
  have/VB
  arrested/VBN
  several/JJ
  suspects/NNS
  in/IN
  separate/JJ)

ALUM KULAM, Sri Lanka -- As gray-bellied clouds started to blot out the scorchin

(S
  (GPE ALUM/NN)
  (ORGANIZATION KULAM/NN)
  ,/,
  (PERSON Sri/NNP Lanka/NNP)
  --/:
  As/IN
  gray-bellied/JJ
  clouds/NNS
  started/VBN
  to/TO
  blot/VB
  out/RP
  the/DT
  scorchin/NN)

Check carefully. Even KARACHI is recognized very well, but Sri Lanka is recognized as Person and ISLAMABAD is recognized as NNP not GPE.

If using an API vs your own code is OK for your requirements, this is something the Wit API can easily do for you.

在此输入图像描述

Wit will also resolve date/time tokens into normalized dates.

To get started you just have to provide a few examples.

Yahoo has a placefinder API that should help with identifying places. Looks like the places are always at the start so it could be worth taking the first couple of words and throwing them at the API until it hits a limit:

http://developer.yahoo.com/boss/geo/

May also be worth looking at using the dreaded REGEX in order to identify capitals: Regular expression for checking if capital letters are found consecutively in a string?

Good luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM