简体   繁体   中英

Scala sentences parsing using parser-combinators

How to effectively parse (without too much of code cluttering) statements like below? Keywords/separators are placed within [].

Manager, Delhi [for] The Company Pvt Ltd. [from] Jan, 2009 [to] Jan, 2012.

Name of person, company name and date range are to be extracted from the text using parsing combinators. (expected output is shown at the bottom)

Below is the code written for the above-



    case class CompanyWithMonthDateRange(company:String, position:String, dateRange:List[MonthYear])

    case class MonthYear(month:String, year:Int)

    object CompanyParser1 extends RegexParsers {
      override type Elem = Char
      override def skipWhitespace = false
      def keywords: Parser[String] = "for" | "in" | "with" |"at" | "from" | "pvt"|"ltd" | "company" | "co" | "limited" | "inc" | "corporation" | "jan" |\
     "feb" | "mar" | "apr" | "may" | "jun" | "jul" | "aug" | "sep" | "nov" | "dec" | "to" | "till" | "until" | "upto"

      val date = ("""\d\d\d\d""".r | """\d\d""".r)
      val integer     = ("""(0|[1-9]\d*)""".r) ^^ { _.toInt }
      val comma = ("""\,""".r)
      val quote = ("""[\'\"]+""".r)
      val underscore  = ("""\_""".r)
      val dot = ("""\.""".r)
      val space = ("""\s+""".r) ^^ {case _ => ""}
      val colon = (""":""".r)
      val ampersand = ("""(\&|and)""".r)
      val hyphen = ("""\-""".r)
      val brackets = ("""[\(\)]+""".r)
      val newline = ("""[\n\r]""".r)
      val months = ("""(jan|feb|mar|apr|may|jun|jul|aug|sep|nov|dec)""".r)
      val toTillUntil = ("""(to|till|until|upto)""".r)
      val asWord = ("""(as)""".r)
      val fromWord = ("""from""".r)
      val forWithAt = ("""(in|for|with|at)""".r)
      val companyExt = ("""(pvt|ltd|company|co|limited|inc|corporation)""".r)
      val alphabets = not(keywords)~"""[a-zA-Z]+""".r
      val name = not(keywords)~"""[a-zA-Z][a-zA-Z\,\-\'\&\(\)]+\s+""".r

      def possibleCompanyExts = companyExt <~ (dot *)  ^^ {_.toString.trim}
      def alphabetsExt = ((alphabets ~ ((quote | ampersand | hyphen | brackets | underscore | comma) *) <~ (space *))+) ^^ { case a => a.toString.trim}
      def companyNameExt = (alphabetsExt <~ (space *) <~ (possibleCompanyExts+)) ^^ {_.toString
      }
      def companyName = alphabetsExt *
      def entityName = (alphabetsExt+) ^^ {case l => l.map(s => s.trim).mkString(" ")}
      def dateWithEndingChars = date <~ ((comma | quote | dot | newline) *) <~ (space *) ^^ {_.toInt}
      def monthWithEndingChars = months <~ ((comma | quote | dot | newline) *) <~ (space *) ^^ { _.toString}
      def monthWithDate = monthWithEndingChars ~ dateWithEndingChars ^^ { case a~b => MonthYear(a,b)}
      def monthDateRange = monthWithDate ~ (space *) ~ toTillUntil ~ (space *) ~ monthWithDate ^^ { case a~s1~b~s2~c => List(a,c)}
      def companyWithMonthDateRange = (companyNameExt ~ (space *) ~ monthDateRange) ^^ {
        case a~b~c => CompanyWithMonthDateRange(company = a, dateRange = c, position = "")
      }
      def positionWithCompanyWithMonthDateRange = ((name+) ~ (space *) ~ forWithAt ~ (space *) ~ companyWithMonthDateRange) ^^ {             
        case a~s1~b~s2~c => c.copy(position = a.mkString(","))

      }
    def apply(input:String) =     {
        parseAll(positionWithCompanyWithMonthDateRange,input) match {
        case Success(lup,_) => println(lup)
        case x => println(x)
        }
      }
    }

Output should something like



    CompanyWithMonthDateRange(List(((()~Company)~List()), ((()~fd)~List()), ((()~India)~List('))),(()~Manager, ),(()~Delhi ),List(MonthYear(mar,2010), MonthYear(jul,2012)))

Also, how to remove the unwanted "~" appearing in the texts above.

Thanks, Pawan

I'm not trying to write this as a complete solution to your real problem, just to parse the sentence into the data structure you've provided, I'm not sure if it helps, just as a reference.

In your CompanyWithMonthDateRange , I didn't see where to put the extracted name, so, I'll leave it out, and it should be trivial to add it.

object CompParser extends RegexParsers {
  val For = "[for]"
  val From = "[from]"
  val To = "[to]"
  val Keyword = For | From | To
  val Def = """(?m)(?<=^|\]).*?(?=\[|(\.\s*[\n\r]+))""".r
  val End = """.""".r
  val Construct = opt(Def) ~ Keyword ~ Def ^^ {
    case p ~ `For` ~ s => {
      val arr = p.getOrElse("").split(",")
      val t2 = if (arr.length == 2) arr(0) -> arr(1) else ("", "")
      ("pos&com", (t2._1, s.toString))
    }
    case p ~ `From` ~ s => {
      val arr = s split ","
      val t2 = if (arr.length == 2) arr(0) -> arr(1) else ("", "")
      ("from", (t2._1, t2._2))
    }
    case p ~ `To` ~ s => {
      val arr = s split ","
      val t2 = if (arr.length == 2) arr(0) -> arr(1) else ("", "")
      ("to", (t2._1, t2._2))
    }
  }
  val Statement = rep(Construct) ^^ (Map() ++ _) ^^ { m =>
    if (m.size == 3) {
      val from = new MonthYear(m.get("from").head._1, m.get("from").head._2.trim.toInt)
      val to = new MonthYear(m.get("to").head._1, m.get("to").head._2.trim.toInt)
      val pos = m.get("pos&com").head._1
      val com = m.get("pos&com").head._2
      new Some(CompanyWithMonthDateRange(com, pos, List(from, to)))
    } else None
  }

  val Statements = rep(Statement <~ End)

  def apply(in: String) = {
    parseAll(Statements, in) match {
      case Success(r, i) => println(r)
      case failure => failure
    }
  }
}

and the parser stops at line breaks, here's the test for the parser:

object TestP extends App {
  val inStr1 = """ 
    Manager, Delhi [for] The Company Pvt Ltd. [from] Jan, 2009 [to] Jan, 2012. 
   """
  val inStr2 = """ 
    Manager, Delhi [for] The Company Pvt Ltd. [from] Jan, 2009 [to] Jan, 2012.
    Employee, Kate [for] The Company Pvt Ltd. [from] Feb, 2010 [to] Jun, 2012.  
    HR, Jane       [for] The Company Pvt Ltd. [from] May, 2010 [to] July, 2012. 
    """
  CompParser(inStr1)
  CompParser(inStr2)
}

the output is : inStr1:

List(Some(CompanyWithMonthDateRange(The Company Pvt Ltd. ,Manager,List(MonthYear(Jan,2009), MonthYear(Jan,2012)))))

inStr2:

List(Some(CompanyWithMonthDateRange(The Company Pvt Ltd. ,Manager,List(MonthYear(Jan,2009), MonthYear(Jan,2012)))), Some(CompanyWithMonthDateRange(The Company Pvt Ltd. ,Employee,List(MonthYear(Feb,2010), MonthYear(Jun,2012)))), Some(CompanyWithMonthDateRange(The Company Pvt Ltd. ,HR,List(MonthYear(May,2010), MonthYear(July,2012)))))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM