la.scala._
capWord Competition

This competition was run on #scala. The task is to write a short def capWord which checks whether a given string is a Capital word. Some entries follow.

@paulp I:

def capWord(s: String) = (s.tail map (_.isLowerCase))
.reduceLeft(_ && _) && s.head.isUpperCase

— and he notes: of course that fails on “P”. but maybe that’s not a word either. more of a letter.

@paulp II:

def capWord(s: String) = !s.isEmpty && 
(s.head isUpperCase) && (s.tail forall (_.isLowerCase))

— and he notes: I won’t deal with null out of spite.

@RShulz:

def capWord(word: String): Boolean = 
{ word.count(_.isUpperCase) == 1 && word(0).isUpperCase }

— and he notes: (Count exists in my mind.)

@dcsobral I, the winning entry (no head or tail, and I like it):

def capWord(w: String) = w.toList match {
  case h :: t if ('A' to 'Z' contains h) && 
    (t forall ('a' to 'z' contains _)) => true
  case _ => false
}

@dcsobral II, the fold entry:

def capWord(w: String) = w.tail.foldLeft(w.head isUpperCase)
( (flag, l) => flag && l.isLowerCase )

@dcsobral III, fanciful entry:

def capWord(w: String) = (
    (
    w
    map (_ isUpperCase)
    map (if (_) 1 else 0)
    zipWithIndex
    )
  map Function.tupled((a,b) => a * (b+1))
  reduceLeft (_+_)
  ) == 1

@SethTisue, mindboggling entry:

def capWord(s: String) = 
(Math.log(s.map(c => 
    if(c.isUpperCase) 1 else 0).mkString.toDouble) 
/ Math.log(10)).toString.endsWith(".0")

@paulp III, the mondblowingboggling entry:

def capWord(s: String) = s.head.isUpperCase && 
(s.tail map (c => 
    io.Source.fromURL(new java.net.URL("http://www.simplyscala.com/interp?code='%s'.isLowerCase"
.format(c.toChar.toString))).mkString) 
forall (_ contains "true"))

— and he notes: I could use threads to speed that up a little.

Comments (View)
Salmon Run and other NLP Toolkits

There’re several important NLP toolkits in Java and soon Scala, as well as blogs about them. Here’s some.

Salmon Run Excellent soup-to-nuts usage guides to many JVM technologies.

LingPipe We’re using it for N-gram models.

OpenNLP My UPenn PhD classmate Tom Morton’s framework.

DRMacIver It’s not a project, it’s a person! Awesome tools.

Comments (View)
Bzip2

Needed to open bz2 files in the same way as zip ones. Old bzip2 streams for Java have a bug where you have to manually swallow two first characters from the underlying file stream, which are BZ, before you coudl give it to the bzip2 stream. The modern wrapper from Apache Commons relieves you from this tedious check.

Apache Commons Compress

Comments (View)
Value Serialization

So I needed to serialize a bunch of Scala Maps, and it’s supersimple: just tack a

@serialize

on top of the declaration, be it a class, or just a value! And then the usual Java serialization applies!

Comments (View)
Command-line options

DRMacIver has a super-lovely li’l class, optional.Application, from which you can derive your main object, with parameters for main being extracted from the cmdline.

Comments (View)
Graph Storage

When representing data as a graph, it’s important to get efficient storage and traversing platform for it. Some good JVM ones I found:

JGraphT

Neo4j

Any RDF storage can be used for triplets — edges with a weight/relationship, e.g.

AllegroGraph RDF Store

Other RDF stores, such as the one underlying DBLP++.

Comments (View)
Autovivification, anyone?

I’m storing a bunch of counts for twits in a map of maps, like

import scala.collection.mutable.{Map=>UMap}

class Repliers {
  type RepCount = UMap[UserID,(TwitCount,TwitCount)]
  var reps: UMap[UserID, RepCount] = UMap.empty

  // TODO here goes our favorite non-autovivification,
  // a question is, whether this can be made into auto-
  def addTwit(twit: Twit): Unit = {
    twit.reply match {
      case None =>
      case Some(reply) =>
        val uid = twit.uid
        val ruid = reply.replyUser
        val tinc = if (reply.replyTwit.isEmpty) 0 else 1
        val u = reps.get(uid)
           u match {
             case Some(repcount) => repcount.get(ruid) match {
               case Some((numUser,numTwit)) => {
                 val x = reps(uid)(ruid)
                 reps(uid)(ruid) = (x._1 + 1, x._2 + tinc)
               }
               case _ => reps(uid)(ruid) = (1, tinc)
             }
             case _ => reps(uid) = UMap(ruid->(1, tinc))
           }
    }
  }
// ...
}

How do we auto-vivify the maps, by a kind of more laconic upsert (insert or update)?

Comments (View)
Using Berkeley DB DPL for Implementations of Abstract Scala Classes

I have a class which is a spec for storage and may be implemented via PostgreSQL, Berkeley DB, db4o, jdbm, etc. Annotating it is not possible as it’s already published and backend-agnostic, so, for BDB, we create a mirror class with Java types and fill it from the original Scala class. This allows to use nullable values instead of Options to simplify indexing, and allows to serialize types Berkeley DB knows about, such as Date, instead of Joda’s DateTime. I have constructors which take the Scala original and fill in the annotated BDB Entity fields, applying conversions such as Long=>java.lang.Long, and toScalaStuff functions to return Scala ones back, with reverse conversions. Here’s how one class is mirrored in BDB:

The spec:

case class ReplyTwit (
  tid: TwitID,
  replyTwit: Option[TwitID],
  replyUser: UserID
  )

BDB body:

@Entity
class ReplyTwitBDB {
  @PrimaryKey
  var tid: java.lang.Long = null
  @SecondaryKey{val relate=MANY_TO_ONE}
  var replyTwit: java.lang.Long = null
  @SecondaryKey{val relate=MANY_TO_ONE}
  var replyUser: java.lang.Integer = null

  def this(
    _tid: TwitID,
    _replyTwit: Option[TwitID],
    _replyUser: UserID
    ) = { this()
      tid = _tid
      replyTwit = _replyTwit match {
        case Some(x) => x
        case _ => null
      }
      replyUser = _replyUser
    }
  def this(t: ReplyTwit) = this(t.tid,t.replyTwit,t.replyUser)
  def toReplyTwit: ReplyTwit = {
    val replyTwitLongOpt = replyTwit match {
      case null => None
      case x => Some(x.longValue)
    }
    ReplyTwit(tid.longValue,replyTwitLongOpt,replyUser.intValue)
  }

  override def toString:String = {
    var s = "Reply tid:%d ru:%d" format (tid,replyUser)
    if (replyTwit != null) s += " rt:"+replyTwit
    s
  }
}

It may be a bit verbose, but it works, and is used in db-bdb.scala in tfitter. I might also implement the original BDB Bind API with my own serialization, but then I’d have to manage secondary indices manually as databases. The DPL also allows me to evolve the class layouts with versioning and narrowing/widening between them.

Comments (View)
Back from the Concrete into the Abstract

Once I implemented a PostgreSQL interface for my TwitterDB abstract class, I did another backend with Berkeley DB, Java Edition. When doing it, I found that one method is almost identical, so I decided to push it back into the abstract class. But how’d I do construct backend-specific objects? One way is to pass them back in as function parameters. So here’s the result — in the abstract class:

case class SubParams (
  makeTwit:  TwitID => TwitDB, 
  makeUser:  UserID => UserDB,
  txnBegin:     () => Unit,
  txnCommit:    () => Unit,
  txnRollback:  () => Unit
)


// curry make/txn params
def insertUserTwitCurry(subParams: SubParams)(ut: UserTwit)
    : Unit = {
    import System.err

    val SubParams(makeTwit,makeUser,txnBegin,txnCommit,
        txnRollback) = subParams

    val UserTwit(user,twit) = ut
    val uid = user.uid
    val tid = twit.tid
    try {

      val t = makeTwit(tid) // TwitPG(tid)

      txnBegin
      t put twit // will cause exception if present and rollback
      val u = makeUser(uid) // UserPG(uid)
      u.updateUserForTwit(ut)
      txnCommit
    } catch {
      case e => {
        err.println(e)
        err.println("ROLLBACK uid="+uid+" tid="+tid)
        txnRollback
      }
    }
}

I needed to pass a several functions to a method as parameters, and created a class, subParams, to hold them. I pass case class constructors, such as TwitPG, directly as functions there, and it works.

Interestingly, one can supply the same def blah: Unit function via call by name into a def, but have to group together in a case class with ()=>Unit, you can’t have an x: => Unit parameters to a case class.

Here’s how I implement my abstract function in the PostgreSQL and Berkeley DB backends:

// PostgreSQL has no commit begin, passing Unit=>Unit do-nothing 
val subParams = SubParams(TwitPG,UserPG,()=>(),conn.commit _,
    conn.rollback _)
def insertUserTwitB = insertUserTwitCurry(subParams)(_)

// Berkleley DB
val subParams = SubParams(TwitBDB,UserBDB,txnBegin _,
    txnCommit _,txnRollback _)
def insertUserTwitB = insertUserTwitCurry(subParams)(_)
Comments (View)
sneaking closing actions into an iterator

I created an iterator to wrap a Berkeley DB cursor. One thing with iterating through it is closing the cursor when we reach the end of the iteration. So I simply stuck the closing into the hasNext call-through to the underlying Java iterator!

  class TwIteratorBDB extends TwIterator {
    val cursor = twitPrimaryIndex.entities
    val javaIter = cursor.iterator
    // TODO a way to stick closing actions into our iterator,
    // are there any official or better ways?
    def hasNext: Boolean = { val has = javaIter.hasNext
      if (!has) cursor.close
      has
    }
    def next: Twit = javaIter.next.toTwit
  }
Comments (View)