简体   繁体   中英

Parsing String to Dates - Java

This is the problem:

I have some .csv files with travels info, and the dates appear like strings (each line for one travel):

  • "All Mondays from January-May and October-December. All days from June To September"
  • "All Fridays from February to June"
  • "Monday, Friday and Saturday and Sunday from 10 January to 30 April"
  • "from 01 of November to 30 April. All days except fridays from 2 to 24 of november and sunday from 2 to 30 of december"
  • "All sundays from 02 december to 28 april"
  • "5, 12, 20 of march, 11, 18 of april, 2, 16, 30 of may, 6, 13, 27 june"
  • "All saturdays from February to June, and from September to December"
  • "1 to 17 of december, 1 to 31 of january"
  • "All mondays from February to november"

I must parse the strings to Dates, and keep them into an array for each travel.

The problem is that I don't know how to do it. Even my univesrity teachers told me that they don't know how to do so :S. I can't find/create a pattern using http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html

After parsing them i have to search all travels between two dates.

But how? How to parse them? it's possible?

You're in the domain of NLP (Natural Language Processing), what is possible or impossible is fuzzy in this domain. From a fast Google search, I've found that the Natty Date Parser might be useful for you.

For more theory background on NLP, you might be interested in Natural Language Processing course of Stanford University on Coursera (at the moment the course is not open for enrolment, but lectures are available for free.

You can also use a set of strict regular expressions that would match only one of your possible cases and apply them from the most restrictive to the most relaxed.

The first thing I would define to attack your problem is what you expect as an output of your method, since in some cases it's a single date, in some cases an interval, in some others multiple intervals.

This requires Natural Language Processing (NLP) , see Wikipedia for an account: http://en.wikipedia.org/wiki/Natural_language_processing .

Your problem as stated is very hard. There are many ways of representing a single date, and your examples include ranges of dates and formulae for generating dates. It sounds as if you have a limited subset of language - frequent use of "all", "from", etc.

If you are in control of the language (ie these are being generated by humans who comply with your documentation) then you have a chance of formalising it (although it will take a lot of work - months). If you are not in charge of it, then every time a new phrase appears you will have to add it to the specs.

I suggest you got through the file and look for stock phrases "All [weekdayname]s [from | between | until | before]". Or "in [January | February ...]". Then substitute these in in phrases. If you find this covers all the cases you may be able to extract particular phrases". But if you have anaphora like "next Tuesday" it will be much harder.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM