Thursday, January 17, 2013

Simple Scala Text Recognition Patterns

So many times I wanted to be able to extract email addresses, links, specific html tags, simple chunks of code, you name it from simple text, html content, code, etc.

But, every time I ended up looking for the right REGEX expressions, tweaking them, throwing them into my code and forget everything the day after. Next time I would start over again and again.

I think that this library could preserve all these efforts, so bear with me to see how Scala with its nice features like Extractors, REGEX, and Manifest classes can make your life much easier.


Let's assume that we have some text based content (txt, html, etc) and that our goals are the followings:

  • We want to be able to easily extract email addresses, links, images, etc from the above content
  • We want to have a decent and common data format of what we extract
  • We want to easily add new patterns to the existent text recognition patterns, so we would be able to recognize more patterns in our text based contents

The madeforall's extractor library is a Scala based library that is meant to help us reach the above goals.

Here is a simple description of how to use and extend this library:

We create a RecognizableItemsExtractor object, to which we pass as a parameter an RecognitionPattern object

Any RecognitionPattern object should:
- implement the unapply extraction method (Scala based extractor)
- mix in the RecognitionPattern trait
 
For instance, to support email recognition, we could do the followings:

case class Email(user: String, domain: String) extends Recognizable {
  override def toString() = {
    user + "@" + domain
  }
}
object EmailRecognitionPattern extends RecognitionPattern {
  // The extraction method
  def unapply(str: String): Option[(String, String)] = {
    val parts = str split "@"
    if (parts.length == 2) Some(parts(0), parts(1)) else None
  }
  // email regex
  val regexPattern = """(?i)\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b""".r

  def recognize(value: String): Option[Recognizable] = {
    value match {
      case EmailRecognitionPattern(user, domain) =>
        Some(Email(user, domain))
      case _ =>
        None
    }
  }
}

Then, to extract all the emails from our text we could do the followings:

val textToAnalyze: String = "<<some content>>"
val recog = RecognizableItemsExtractor(List(EmailRecognitionPattern))
val emailList: List[Recognizable] = recog.analyzeText(textToAnalyze)

If we want to extract emails and links we can also add more recognition patterns to the above RecognizableItemsExtractor.

RecognizableItemsExtractor(List(EmailRecognitionPattern, LinkRecognitionPattern))
val emailAndLinkList: List[Recognizable] = recog.analyzeText(textToAnalyze)

The above emailAndLinkList list contains both emails and links.
To filter out emails we could use the following filterByType function.

val onlyEmailList = recog.filterByType[Email](emailAndLinkList)

Here is the definition of the RecognizableItemsExtractor's filterByType function:

  def filterByType[T <: Recognizable](t: List[Recognizable])(implicit mf: Manifest[T]) =
    t.filter(e => mf.erasure.isInstance(e))

It is based on Scala class manifest feature, that is helping the runtime
with the type information provided as hint by the Scala compiler.
Type erasure is still present in Scala like in Java.
The creators of Scala gave us this helpful manifest class support to overcome
the JVM's type erasure limitations.

If you want to use and contribute in any way to this project here is the link to its Github repository:

https://github.com/ncaralicea/madeforall





      

No comments:

Post a Comment