Archive for the ‘Technology’ Category

“Traackr allows us to get back on offense…”

Monday, March 19th, 2012

A client of ours recently told us:

“When I use general monitoring tools, I feel like I’m constantly on the defensive.  But when I use Traackr, I feel like I am playing offense again.”  

I thought this was a really interesting, insightful statement.  And I think it provides an important hint at the future for social media work.

 

To date, MONITORING has been at the heart of any work within social media.  It is a key task – maybe THE key task for most people working on the marketing/PR side of social.  There are two real reasons for this:

1.  THE VOLUME OF SOCIAL MEDIA DATA IS STAGGERING

Part of the issue is the sheer volume of content created and shared on social media.  It absolutely dwarfs the media activity of the pre-SM world (yes, young Traackr employees, there was a time when Facebook didn’t exist).   I remember when a HUGE weekly clip book contained a couple hundred articles.  Today, 10,000 social media mentions would be a slow week for some of the bigger brands (for some, this would be a slow DAY).  So, naturally, the amount of time and effort needed to monitor the vast activity on the social web is much greater than that which was needed in the pre-SM world.  At the same time…

2.  UNTIL RECENTLY, SM TOOLS HAVE FOCUSED EXCLUSIVELY ON MONITORING

The first wave of tools to address this social media market were built to solve this first, basic need – MONITORING the content of the social web.  The tools were essentially built as ‘Clipping Services 2.0.’ built to help PR/Marketers more efficiently access mentions of their products/brands across the social web.  Fair enough.  And not for nothing – these companies also did a very nice job of inciting fear among their potential clients  - “Missing a single post can destroy your business!” they SHOUTED, leading communication pros to believe that they had to track/pay attention to every, single piece of content on the web.

These two factors, together, helped to put MONITORING at the center of social media work.

The problem is that MONITORING only tells you about the things that have happened in the past.  It gives you great  access to YESTERDAY’S news, but the value of yesterday’s news is fairly limited.  The only thing you can really do with yesterday’s news is to…react to it.  Which puts you on defense.

So….like our client from earlier said, the more time you spend MONITORING (and with general monitoring tools), the more time you will spend playing defense.

IN PR, YOU WANT TO BE ON THE OFFENSIVE

I know what you’re going to say…What’s wrong with defense?  Doesn’t defense win championships??  Maybe in the NFL, but not in PR.  In PR, you want to be on the offensive.

This has been true since the beginning of PR.  In fact, one would argue that this is why PR was created.  PR was created to give you full control of your (or your client’s) message.  What it is, how it’s said, who hears it and when…basically everything.  In PR, the goal is to always be on the offensive.

WHAT’S THE SOLUTION?

So…the issue remains.  If MONITORING social media is the key activity for any social media initiative, but this activity forces you to spend most of your time on defense….how do you get back on offense?

In my humble (and completely biased) opinion, the key is stepping away from general monitoring (not completely, per say, but definitely a few steps away) and finding a solution that delivers the useful, actionable data.  To us at Traackr, the most important and most actionable piece of data you can derive from social media is WHO.  Information about people – the right people – is incredibly actionable.  With this information, you can do what our clients are doing every day – creating the right messaging & engagement strategies, engaging, inviting, collaborating, tracking, measuring, reporting, and succeeding.

In short — general social monitoring gives you access to YESTERDAY’s news.  Knowing the right people offers you the opportunity to become TOMORROW’s news.

Getting back on Offense.  Because, that’s what wins championships in this game  :)

DS

 

Share

Tapping Into the Value of Influencer Data

Wednesday, February 15th, 2012

Traackr was one of the very early market-makers in the influencer space and, yet, we never gave into the hype of the influencer conversation, promising a silver bullet that was going to solve influencer communication for communication professionals. We actually never understood those who did, whether vendors, agencies or brands, because by making themselves or others believe that this was easy, they tremendously hurt their chances of succeeding (who’d want to run a marathon without being told??!).

We have always said that finding, analyzing, monitoring and engaging influencers is HARD WORK. It can yield amazing results as many of our clients have proven time and time again, but it takes effort, craft, and, yes, the right toolset. We have never sold Traackr as the Holy Grail to our clients; our promise has been to be the best platform on the market to accompany their efforts towards high-yield high-return communication programs with influencers.

Another critical element of our approach from very early on was that we never believed that there would ever be “one social media platform to rule them all.” Instead of trying to build a platform that did a little bit of everything from social media monitoring, sentiment analysis, influencer search, to social CRM, etc., we focused on our craft, influencer search, and made it easy for our partners and customers to integrate our influencer data with a solution that catered to their business.

This approach has paid off in more ways than we could have hoped. For one thing, it gave us a sense of focus that enabled us to leap ahead of anyone else on our core expertise of influencer discovery (to my knowledge, we have come ahead in every single due diligence exercise conducted by prospective customers). But maybe even more importantly, this approach has enabled our partners and customers to be creative in ways we never expected. They wanted to exploit the value of the data generated by Traackr and built synergetic ensembles of data, revealing new pockets of value that they identify and build on from.

In this context, last week Engage121, a leading social media platform with a focus on point-of-sale, has announced their integration with Traackr. We have never focused specifically on use cases for our search technology for retail but we now have a partner who is bringing all the pieces together for that market. In all likelihood, they will be able to unleash value from our data for this market in better ways than we could. I’m sure that the partnership and their input will in turn help us build smarter influencer searches.

In the months to come, don’t be surprised as you see more announcements like this one coming out of Traackr, as we’re expanding our reach and leveraging the expertise of business partners to make Traackr a part of many new uses cases on how to create value from influencer and expert search. So stay tuned…

Share

Traackr’s migration from HBase to MongoDB

Wednesday, February 8th, 2012

Let’s get one thing out of the way before we start: this post is not an attempt to disparage HBase. HBase is an extremely powerful tool; applied appropriately and skillfully under the right scenarios, it can move mountains. This post is about the evolution of Traackr’s data storage needs and how MongoDB ended up satisfying them. It’s also a tip of the hat at the MongoDB team and 10gen and the tremendous work they have done.

Back in late 2009 early 2010, Traackr was designing the foundations of its’ search engine and hunting for an appropriate datastore to back it up. Some of the requirements were:

  • Built-in support for storing terabytes of text: that meant that we shouldn’t have to use or modify the software in an unconventional fashion beyond its’ original design to get it to store and retrieve the quantities of data we wanted.
  • Flexible schema: Traackr deals with heterogenous data sources from the web, constantly discovering new content and new properties that characterize that content. The database had to allow us to model those properties across all stored content without requiring extensive schema migrations that would take the system offline.
  • Ability to batch process the data: Traackr’s scoring algorithms take into account statistical measurements derived from our entire active data set. Those computations need to be run at least once a week to account for the continuous growth and shifts in data samples. We therefore needed a system that allowed performing computations on the whole data set in a “reasonable” amount of time. We didn’t have an exact number for “reasonable” but our goal was to keep those processes running under 6 hours early on Saturdays and leave enough time for other batch jobs that depended on these refreshed computations to run the remainder of the weekend hours.

Some of the contenders in our product selection matrix were:

  • Traditional enterprise packages such as Oracle: ruled out because they were way out of our budget.
  • MySQL: our content sizes vary from 140 character tweets to multi-page articles and using one size fits all BLOBs would be a tremendous waste of space. Granted, storage is cheap or one could design the schema to split content to different tables accordingto size but that would add needless complexity when there were other solutions that didn’t. Also, the flexibleschema requirement would not be fulfilled as every new column added or modified in a multi-million row table would require a time consuming migration. There are ways to mitigate this by creating attribute tables but those tables become very large on their own (think a dozen attributes per post times millions of posts). And while MySQL can be sharded to split large tables and data sets it requires the code the be cognizant about it while other solutions don’t. So we passed not only on MySQL, but also on most other open-source RDBMS solutions for the same reasons. Enter NoSQL.
  • Cassandra: It fit the bill in terms of schema flexibility and storage capacity. The model of having the same deployable for each cluster node was very enticing; it made for much easier setup and maintenance for a small team like ours. But in late 2009 / early 2010, it still lacked batch processing options like MapReduce (those were introduced later on in 2010). It also seemed that there had been some period of inactivity around the project afterits’ initial 2008 release, so at the time, we were uncertain about its’ future adoption.
  • MongoDB: it was still new at the time, so we had concerns about its’ stability and adoption. The document-based schema flexibility looked great but auto-sharding was still not available (came out mid-2010) and there were no out-of-the-box options for batch processing. So we made a note of it and kept looking.
  • Riak: it was a serious contender for us; most of our requirements were being met and it presented the same promise of ease of use and deployment as Cassandra did. The team was an impressive bunch from Akamai that really seemed to know what they were doing. To top it all off, they were local to Boston and startup-friendly. Despite all of this, we ended up shying away due to questions of adoption. It was still too young of a project for us.
  • HBase: back then, it was one of the most polished solutions with quite a bit of traction. The requirements were all there: ability to grow with large data sets, flexible schema, built-in batch processing with MapReduce, healthy community for support.We had our pick. It also provided “out-of-the box” secondary indexing through a contrib package that came with the source. This allowed us to avoid writing our own app layer indexing code or so we thought. Those secondary indexes ended up being a lot more critical in the longer run than we originally anticipated.

So we picked HBase and started running with it. We had to deal the learning curve of its’ setup and the various components and configurations. This pretty much took most of my time which unfortunately detracted from working on features. But we eventually got there and were able to get it humming (most of the time). Our weekend batch scoring requirements were met as expected: we were able to re-score our entire database in less than 30 minutes. We even contributed back to the code base (HBASE-2438 and HBASE-2426). Things looked good.

Then came the upgrade from HBase 0.20.x to 0.89.x/0.90. The code base was changing fast and we wanted to keep up with the latest speed and stability fixes. But there was one problem: the secondary indexing contrib packages were moved out of the main code base and as a result, our HBASE-2426 customizations were becoming stale. This was also signaling that indexing was in fact about to fall even further behind in priority instead of making it to the core source. Bad news for us since we depended on it; we had no choice but to keep our customizations up to speed. We eventually ended up dropping the contrib packages all together and completely re-wrote our secondary indexes using a more generic approach to avoid running on unsupported 3rd-party code. Even then, we still knew that app layer indexing was going to be slower and more brittle than any DB layer solution. This became even more apparent when we needed to evolve our domain model.

The basis of our data is built on 3 core entities: influencers, the channels on which they post content and the posts themselves. We had originally de-normalized the relationship between influencers and their affiliated channels, admittedly to be more inline with how our NoSQL datastore was intended to be used. While we knew that a given channel could be shared across multiple authors, we chose to repeat some channel data across influencers to simplify runtime random access. While this decision simplified queries, it ended up putting more strain and complexity on our content attribution logic. With more complexity come more bugs and those started rearing their ugly heads in the form of some mis-attributed content. The issue was compounded when we cranked our content tracking up a notch and introduced our daily monitor feature in mid-2011. To add to this challenge, we had also discovered after months in production that we needed a better approach for modeling the details of the relationship between an influencer and a channel as each influencer interacted with a given channel in their own way. So after a year and half of running on our original assumptions it was becoming clear that our model needed revisiting.

All of this was happening around the time we created an opening for a big-data engineer. The 2011 Santa Clara Hadoop Summit was also being held. We ended up attending the conference hoping to meet some talent. We even showed up at the HBase Contributor meetup where we mingled with some of the great minds behind the scenes. The trip was both exciting and revealing. The two major take aways for us were:

  • Big-data engineers came at a premium: good luck competing with some of the big pocket books in the valley.
  • Experienced HBase engineers were even more rare and the good ones where all strategically positioned around firms in the valley, so we would be better off either hiring on the East Coast or developing the capabilities in-house.

For a shop our size, there is a limit to how many of your resources you can take away from feature development to dedicate to infrastructural concerns. We had already spent a lot of dev time on the datastore infrastructure and our impending model changes were about to call for some more. Adding this up with the seemingly slim odds of us attracting an experienced HBase developer, it became apparent that continuing down the HBase route would be akin to trying to fit a square peg in a round hole. It was time to move on and look for a more appropriate solution. The tool was just not built for what we were trying to do and we could not afford to try to get it there. So we went back to re-evaluating NoSQL solutions. This time, solid support for secondary indexes was added to the requirements. The round two contenders were:

  • Neo4j: a very powerful graph database capable of efficiently traversing complex relationships; too much for what we needed and primarily designed to be used in an embedded JVM mode, we would have to make some significant changes to our system to integrate it.
  • MongoDB: it had matured by leaps and bounds since the last time we looked at it, with increased adoption from many shops and great support from 10gen. It came with advanced indexing out-of-the-box as well as some batch processing options (http://www.mongodb.org/display/DOCS/MapReduce, https://github.com/mongodb/mongo-hadoop). To top it all off, it was a breeze to use, well documented and fit into our existing code base very nicely.
  • Cassandra: it had matured as well and now had support for secondary indexes but those seemed more restrictive than Mongo’s and Mongo still had an edge over it in terms of developer friendliness.
  • Riak: still a strong contender and supported secondary indexes since release 1.0 but still lacked the traction that MongoDB had.

This time, the choice was much more straight forward. Many of the solutions had come a long way to meeting our needs, so we were able to make a selection not only based on specs but also based on ease of adoption. MongoDB was hands down the most approachable solution for us.

Having worked with it now, it’s no wonder why MongoDB is currently enjoying such growth. While the migration from HBase took us about three months the integration with MongoDB itself was achieved in just a couple weeks since we already had a DAO layer abstracted from the rest of our applications. The rest of the time was spent tweaking our new model and re-writing our content acquisition and attribution services. And at every step of that refactoring, we found that MongoDB was making things easier for us:

  • Normalizing channels and influencers became a straight forward exercise as we were now able to model the associations as influencer collection sub-documents.
  • Creating indexes for fast random access queries no longer required specialized code. We still had to be mindful of the memory implications but the implementation was much cleaner and easier to maintain.
  • Ad hoc queries and reports became easier to write: there was no longer a need for a Java developer to write map reduce code to extract the data in a usable form. The plethora of admin UIs meant that our own product manager could use his JavaScript skills to whip out reports using the the powerful out-of-the-box query facilities.
  • Basic things like backups became a breeze again. While HDFS has built-in redundancy within a cluster, replicating data in a backup cluster is still advisable but becomes expensive from a hardware and network point of view. For the longest time, our approach consisted in exporting the HBase tables to S3 on a regular basis, which took a lot of time and far from guaranteed data consistency on restore. With MongoDB, all our data currently fits on a single instance with a hot replica that we can switch to if the master goes down and a third backup machine whose EC2 EBS drive we snapshot on a regular basis after freezing its’ XFS file system to ensure data consistency. While such a setup is of course possible with HDFS/HBase, support for it comes out-of-the-box with MongoDB and it can be done more affordably with a lot less hardware.
  • The documentation was fantastic. Leaps and bounds ahead of the other solutions we evaluated (although the Riak folks are doing a great job as well), it was very well organized and the disqus comment integration meant that the dev community could easily pitch in if there were specific gaps.
  • The speed was really impressive. We found that we were able to replace our weekly influencer re-scoring MapReduce jobs with straight MongoDB cursor iterators and still get the final results faster than before. Our data may of course outgrow this approach at some point but we are confident that we will be able to adapt using the available hadoop integration if we need to.
  • The community was within reach and the feedback was consistant. We attended Mongo Boston and we were pleased to see that the size of the crowd confirmed what we were hearing and reading about the adoption of MongoDB. The sessions were super informative with great tips about how to optimize one’s setup and what the major gotchas are. What’s more, the suggestions and advise were consistant with what we were reading on line from separate sources, which was a refreshing change from some of the blind trial and error experience we had with our previous setup.

Looking forward, we think that we have finally nailed the solution that fits out needs for the foreseeable future. We no longer feel that we are fighting our datastore every step of the way. On the contrary, all our developers have given nothing but positive feedback on their MongoDB experience thus far and report being a lot more productive with it. This is a testament to the thoughtfulness the MongoDB engineering crew and 10gen have put behind the solution and we are looking forward to working with them in the months and years to come.

Share