« Back to blog

Why Yieldbot chose Cascalog over Pig for Hadoop processing

This is a guest post by Soren Macbeth, Chief Data Hacker at Yieldbot. Yieldbot captures and organizes the realtime intent existing in web publishers and makes it available to advertisers so they can match offers and ads at the exact moment consumers are most open to receiving relevant marketing. Previously, Soren was co-founder of StockTwits, an open, community-powered idea and information service for investors. Soren is active on Twitter at @sorenmacbeth.

At Yieldbot, we do a ton of batch processing of analytics data on Hadoop. As a small startup, speed is of the utmost importance, especially when it comes to iteration of our data processing codebase. Due to our speed requirement, we initially selected Pig to write our mapreduce jobs in. After a few months of getting our hands dirty with Pig, we decided to make the switch over to Cascalog. We have been extremely happy with that decision. 

Why we initially selected Pig

The fact that Pig uses a custom scripting language, called Pig Latin, makes it a very attactive choice for rapid development. The built-in shell, called Grunt, provides for interactive development and debugging which is hugely useful for iterative development of algorithms. Another important factor was that Pig was included with Cloudera's Hadoop distribution, which we use. Finally, Pig is in use in production at many of the leading companies such as LinkedIn and Twitter.

Problems with Pig

As we began implementing our algorithms in Pig, we encountered a variety of issues which weren't immediately obvious. Most of them impacted the speed of development. The primary issues were centered around Pig Latin. Designing a programming language from scratch is a very difficult task. Pig Latin does many things very well, especially different type of joins, filtering, and grouping. However, once you move beyond the basics that Pig Latin covers, you find yourself writing Java code. A lot of Java code. Loading custom data, storing custom data, functions for transforming data, filtering data all involve writing code in Java, packaging it up as a jar and loading it up in your Pig Latin scripts. This jumping back and forth between editing, compiling, and packaging your Java code and running your Pig scripts leads to long development cycles. Debugging becomes especially challenging as bugs can occur in Java land and/or in Pig Latin land. Testing also suffers from this split of Java and Pig Latin. Java certainly has many different unit testing packages, but until the most recent version of Pig, no such facility existed.

Enter Cascalog

Faced with these issues and after a brief period of experimentation, we decided to make the leap and migrate all of our data processing over to Cascalog. There was a bit of a steep learning curve as none of us were very familiar with Clojure. However because Clojure is a very interesting and general purpose language we felt comfortable investing time to learn it. Now with several months of using Cascalog and Clojure in production, we could not be happier with our decision. 

Speed

What Cascalog gives us is the ability to implement and iterate our data processing task with extreme speed. True iterative development is possible via the magic of the Clojure REPL and its java interop magic. Rather then having to constantly jump back and forth between Pig Latin land and Java land, everything can be accessed, written, run, and tested all without leaving the REPL. Because Cascalog code is just Clojure code, you immediately gain all the benefits of Clojure when doing your data processing. Another benefit is the easy deployment to a production cluster. Clojure tools such as Leiningen make it easy to create an 'uberjar' with everything, including Clojure bundled in making running on your hadoop cluster a snap. A final benefit is that writing unit tests becomes a simple matter. Cascalog and Clojure come with great builtin support for writing unit tests.

Conclusion

If you are new to Clojure as we were, the learning curve to start using Cascalog is definitely steep. However, the investment in time learning Clojure and Cascalog has paid many dividends for us at Yieldbot. As a very small team, the speed at which we can write and iterate on your code is of the utmost importance. I know the folks at BackType share this sentiment and one needs to look no further then Cascalog for evidence. The bottom line is that Cascalog allows us to run faster then Pig did. Lastly, we'd like to thank the entire BackType team for releasing Cascalog into the wild.

Follow @BackTypeTech on Twitter!