Guest post: Why you should track page views with MongoDB

mongodbThis is a guest post written by Max Gutman, Senior Software Engineer at Eventbrite. Max will occasionally be contributing posts to the new “Tech Corner” section of the blog.

Very few companies track page views in-house these days. Whether needing visibility into web traffic or evaluating the effectiveness of marketing campaigns, more companies are deploying third-party solutions. And why not? With a flexible API, you can query for the data you need and hand off all the dirty work to a proven service.

About 4 months ago, Eventbrite needed a new solution for effectively tracking page views. Our existing architecture had been set up to track page views for each eventholder by incrementing a row on a large and quickly growing MySQL table. The spike in traffic to an individual events page would sound off a Nagios alert saying, “Hey, your site is slow because the database can’t keep up constantly updating a single row.”  Our site was slow, and having to respond to those alerts was not fun.

Clearly, tracking page views under these conditions was neither desired nor scalable.  So what were our solutions?  They came down to:

  • Google Analytics
  • Shard the MySQL table
  • ETL process
  • MongoDB (NoSQL)

We eventually chose MongoDB, and here are some quick reasons for why the other alternatives were dismissed:

  • Google Analytics
    • Time-consuming to set up and test
    • Migrating existing page-view data is tricky
    • Not real time
  • Shard the MySQL table
    • Requires downtime to make schema changes
    • Introduces routing at the code level
    • Would still be prone to row locks if we outgrew # of shards
  • ETL process (aka, write to log file and have a process that aggregates and periodically writes to the database)
    • No data integrity
    • Not real time
    • Requires management of log files over multiple web servers

MongoDB for me was a clear choice because it’s fast, non-locking, easy to implement and grows with you. When attending the NoSQL conference in Boston and PyCon in Atlanta, the topic of MongoDB was talked about with as much excitement as Steve Jobs announcing the next iPhone.  Sites with giant data stores and large percentage of database writes were gloating over how “blazing fast” MongoDB was handling their requests. A deeper overview of why MongoDB is fast and great for page view tracking is already provided here: http://blog.mongodb.org/post/171353301/using-mongodb-for-real-time-analytics

The entire process from starting MongoDB to implementing it, rolling it, and migrating our existing data took less than a weekend. It was a rare experience to witness an architectural rewrite executed so smoothly.

The few steps that we took to install MongoDB in Python were as follows:

1)   Install Mongo: http://www.mongodb.org/display/DOCS/Ubuntu+and+Debian+packages

2)   Install mongoengine:
#:easy_install -U mongoengine
3)   Add collection and indexes:
#:./mongo
#:>use sitestats
#:>db.eventclick.ensureIndex({event_id: 1, sdate: 1});

4)   Write code:

from mongoengine import *
from datetime import datetime

def connection(network_timeout=0.8):
    """ connection manager """
    try:
        connect('sitestats', network_timeout=network_timeout)
    except ConnectionError:
        return False
    return True

class EventClick(Document):

    event_id = IntField(required=True)
    sdate = StringField(default=datetime.now().strftime("%y-%m-%d"))
    created = DateTimeField(default=datetime.now())
    views = IntField(default=0)

    def bump(self):
        EventClick.objects(event_id=self.event_id, sdate=self.sdate).update(upsert=True, inc__views=1, set__created=self.created)
        return {'event_id': self.event_id}

    def reload(self):
        new_event_click_instance = EventClick.objects(event_id=self.event_id, sdate=self.sdate).first()
        return new_event_click_instance or self

class EventClickQuerySet():

    def __init__(self, event_id, sdate=None):
        self.event_id = event_id
        self.sdate = sdate or datetime.now().strftime("%y-%m-%d")
        self.response = {'event_id':event_id}

    def count_all_views(self):
        try:
          pageviews = int(EventClick.objects(event_id=self.event_id).sum('views'))
        except:
          pageviews = 0
        self.response['pageviews'] = pageviews
        return self.response

    def fetch_today(self):
        event_click_objects = EventClick.objects(event_id=self.event_id, sdate=self.sdate).first()
        self.response['event_click_objects'] = event_click_objects
        return self.response

    def fetch_all(self, date1=None, date2=None):
        event_click_objects = EventClick.objects(event_id=self.event_id, created__gte=date1, created__lte=date2).order_by('+created')
        self.response['event_click_objects'] = event_click_objects
        return self.response

    def reset_today(self):
        EventClick.objects(event_id=self.event_id, sdate=self.sdate).delete()
        return self.response

    def reset_all(self):
        EventClick.objects(event_id=self.event_id).delete()
        return self.response

One of the great advantages of choosing MongoDB is that you will have a flexible solution that can be used for many other applications, such as caching or context searching.  Use cases can be found here:  http://www.mongodb.org/display/DOCS/Use+Cases

In conclusion, if your page tracking needs require real-time analytics, non-locking, and fast read-writes, give MongoDB a go.  It’s a great product and has a solid community backing it up. We love it here at Eventbrite.