A planet of blogs from our members...

Joe GregorioPiccolo Now Sports LaTeX Support

In the first of two enhancements to piccolo, LaTeX support has been added. Just put the LaTeX in a latex-pic element and it will be converted to a PNG, with the alt text set to the LaTeX code.

<latex-pic>E = mc^2</latex-pic>

Transforms into:

E = mc^2

This is mostly driven by me trying to learn Geometric Algebra and wanting to talk about things like:

e_{23} = e_2 \wedge e_3

Next up is adding inline Julia.

Frank WierzbickiJython 2.7.1 beta3 released!

On behalf of the Jython development team, I'm pleased to announce that the third beta of Jython 2.7.1 is available! This is a bugfix release. Bug fixes include improvements in zlib and pip support.

Please see the NEWS file for detailed release notes. This release of Jython requires JDK 7 or above.

This release is being hosted at maven central. There are three main distributions. In order of popularity:
To see all of the files available including checksums, go to the maven query for Jython and navigate to the appropriate distribution and version.

Jeff TrawickYour Errata Submission for ...

My phone screen flashed earlier this a.m. with the reception of e-mails indicating that a couple of fixes for typographical errors which I submitted some time ago for Fluent Python have been accepted by the author. I was motivated to submit them because of the near-perfection of the book, beautiful in concept and in implementation; it was my opportunity to help maintain a wonderful work. Just as importantly, the publisher made it easy.

The contribution of a quotation mark and a two-letter English word to a 740-page book is hardly remarkable. It is instead what is now an ordinary task made easy by the Internet — the same Internet that contributed immensely to the creation of the book in the first place, from the development and popularity of the subject matter of the book to the tools which were used to create it to the ability of a publisher to interact with more authors to the electronic marketing and commerce which resulted in my purchase.

This is not unlike the creation of the software we all use daily. A large amount of it has the mark of a company but it relies to a tremendous extent on open source software, easily obtained, usually easy to contribute to, and truly ubiquitous even in so-called proprietary software with which we interact. It works because of countless contributions big and small from developers all over the world, using collaboration methods made easy by the Internet. Ease of collaboration enables contributions which have further eased collaboration (not to mention the rest of electronic life), and the software industry is built on the result.

It is time to close the door on what has been called open source strategy, as the use of open source software and the need for strategies for appropriate consumption of the software and interaction with the communities has invaded even the darkest corridors of proprietary software development and become business as usual. All software projects are a fusion of open source and custom-built components, whether or not everyone involved acknowledges it.

I look forward to a refresh of my electronic copy of Fluent Python with the latest corrections. But since submitting those fixes to the book text, I've collaborated with a handful of open source projects in the Django ecosystem for the first time and seen most of my small contributions there accepted; I am still watching some of those for feedback from the project maintainer or for inclusion in a new release. Those contributions were an important day-job activity, enabling features that our customer requested which didn't quite fit into the existing application.

Possible upcoming book contribution — convince the Two Scoops of Django authors to rework their claim that you can't pass environment variables to Django apps running with Apache httpd :) I'm sure they think that Apache+Django implies mod_wsgi, and I guess it is not fun to pass through OS-level environment variables in that configuration. My 2 cents on that matter: Deploying Python Applications with httpd (PDF)

Caktus GroupWriting Unit Tests for Django Migrations

Testing in a Django project ensures the latest version of a project is as bug-free as possible. But when deploying, you’re dealing with multiple versions of the project through the migrations.

The test runner is extremely helpful in its creation and cleanup of a test database for our test suite. In this temporary test database, all of the project's migrations are run before our tests. This means our tests are running the latest version of the schema and are unable to verify the behavior of those very migrations because the tests cannot set up data before the migrations run or assert conditions about them.

We can teach our tests to run against those migrations with just a bit of work. This is especially helpful for migrations that are going to include significant alterations to existing data.

The Django test runner begins each run by creating a new database and running all migrations in it. This ensures that every test is running against the current schema the project expects, but we'll need to work around this setup in order to test those migrations. To accomplish this, we'll need to have the test runner step back in the migration chain just for the tests against them.

Ultimately, we're going to try to write tests against migrations that look like this:

class TagsTestCase(TestMigrations):

    migrate_from = '0009_previous_migration'
    migrate_to = '0010_migration_being_tested'

    def setUpBeforeMigration(self, apps):
        BlogPost = apps.get_model('blog', 'Post')
        self.post_id = BlogPost.objects.create(
            title = "A test post with tags",
            body = "",
            tags = "tag1 tag2",

    def test_tags_migrated(self):
        BlogPost = self.apps.get_model('blog', 'Post')
        post = BlogPost.objects.get(id=self.post_id)

        self.assertEqual(post.tags.count(), 2)
        self.assertEqual(post.tags.all()[0].name, "tag1")
        self.assertEqual(post.tags.all()[1].name, "tag2")

Before explaining how to make this work, we'll break down how this test is actually written.

We're inheriting from a TestCase helper that will be written to make testing migrations possible named TestMigrations and defining for this class two attributes that configure the migrations before and after that we want to test. migrate_from is the last migration we expect to be run on machines we want to deploy to and migrate_to is the latest new migration we're testing before deploying.

class TagsTestCase(TestMigrations):

    migrate_from = '0009_previous_migration'
    migrate_to = '0010_migration_being_tested'

Because our test is about a migration, data modifying migrations in particular, we want to do some setup before the migration in question (0010_migration_being_tested) is run. An extra setup method is defined to do that kind of data setup after 0009_previous_migration has run but before 0010_migration_being_tested.

def setUpBeforeMigration(self, apps):
    BlogPost = apps.get_model('blog', 'Post')
    self.post_id = BlogPost.objects.create(
        title = "A test post with tags",
        body = "",
        tags = "tag1 tag2",

Once our test runs this setup, we expect the final 0010_migration_being_tested migration to be run. At that time, one or more test_*() methods we define can do the sort of assertions tests would normally do. In this case, we're making sure data was converted to the new schema correctly.

def test_tags_migrated(self):
    BlogPost = self.apps.get_model('blog', 'Post')
    post = BlogPost.objects.get(id=self.post_id)

    self.assertEqual(post.tags.count(), 2)
    self.assertEqual(post.tags.all()[0].name, "tag1")
    self.assertEqual(post.tags.all()[1].name, "tag2")

Here we've fetched a copy of this Post model's after-migration version and confirmed the value we set up in setUpBeforeMigration() was converted to the new structure.

Now, let's look at that TestMigrations base class that makes this possible. First, the pieces from Django we'll need to import to build our migration-aware test cases.

from django.apps import apps
from django.test import TransactionTestCase
from django.db.migrations.executor import MigrationExecutor
from django.db import connection

We'll be extending the TransactionTestCase class. In order to control migration running, we'll use MigrationExecutor, which needs the database connection to operate on. Migrations are tied pretty intrinsically to Django applications, so we'll be using django.apps.apps and, in particular, get_containing_app_config() to identify the current app our tests are running in.

class TestMigrations(TransactionTestCase):

    def app(self):
        return apps.get_containing_app_config(type(self).__module__).name

    migrate_from = None
    migrate_to = None

We're starting with a few necessary properties.

  • app is a dynamic property that'll look up and return the name of the current app.
  • migrate_to will be defined on our own test case subclass as the name of the migration we're testing.
  • migrate_from is the migration we want to set up test data in, usually the latest migration that's currently been deployed in the project.
def setUp(self):
    assert self.migrate_from and self.migrate_to, \
        "TestCase '{}' must define migrate_from and migrate_to properties".format(type(self).__name__)
    self.migrate_from = [(self.app, self.migrate_from)]
    self.migrate_to = [(self.app, self.migrate_to)]
    executor = MigrationExecutor(connection)
    old_apps = executor.loader.project_state(self.migrate_from).apps

After insisting the test case class had defined migrate_to and migrate_from migrations, we use the internal MigrationExecutor utility to get a state of the applications as of the older of the two migrations.

We'll use old_apps in our setUpBeforeMigration() to work with old versions of the models from this app. First, we'll run our migrations backwards to return to this original migration and then call the setUpBeforeMigration() method.

# Reverse to the original migration


Now that we've set up the old state, we simply run the migrations forward again. If the migrations are correct, they should update any test data we created. Of course, we're validating that in our actual tests.

# Run the migration to test

And finally, we store a current version of the app configuration that our tests can access and define a no-op setUpBeforeMigration()

    self.apps = executor.loader.project_state(self.migrate_to).apps

def setUpBeforeMigration(self, apps):

Here's a complete version:

from django.apps import apps
from django.test import TransactionTestCase
from django.db.migrations.executor import MigrationExecutor
from django.db import connection

class TestMigrations(TransactionTestCase):

    def app(self):
        return apps.get_containing_app_config(type(self).__module__).name

    migrate_from = None
    migrate_to = None

    def setUp(self):
        assert self.migrate_from and self.migrate_to, \
            "TestCase '{}' must define migrate_from and migrate_to properties".format(type(self).__name__)
        self.migrate_from = [(self.app, self.migrate_from)]
        self.migrate_to = [(self.app, self.migrate_to)]
        executor = MigrationExecutor(connection)
        old_apps = executor.loader.project_state(self.migrate_from).apps

        # Reverse to the original migration


        # Run the migration to test

        self.apps = executor.loader.project_state(self.migrate_to).apps

    def setUpBeforeMigration(self, apps):

class TagsTestCase(TestMigrations):

    migrate_from = '0009_previous_migration'
    migrate_to = '0010_migration_being_tested'

    def setUpBeforeMigration(self, apps):
        BlogPost = apps.get_model('blog', 'Post')
        self.post_id = BlogPost.objects.create(
            title = "A test post with tags",
            body = "",
            tags = "tag1 tag2",

    def test_tags_migrated(self):
        BlogPost = self.apps.get_model('blog', 'Post')
        post = BlogPost.objects.get(id=self.post_id)

        self.assertEqual(post.tags.count(), 2)
        self.assertEqual(post.tags.all()[0].name, "tag1")
        self.assertEqual(post.tags.all()[1].name, "tag2")

Caktus GroupShipIt Day Recap: Q1 2016

Last Friday, the Cakti set aside regular client projects for our quarterly ShipIt Day, a chance for personal development and independent projects. People work individually or in groups to flex their creativity, tackle interesting problems, or expand their personal knowledge. This quarter’s ShipIt Day saw everything from cat animations to improvements on our Taylor Swift lyric generator app. Read about the various ShipIt Day projects for Q1 of 2016 below.

Neil Ashton worked with the open source DemocracyOS, a platform for collaborative decision making. The platform is used around the world to increase participation in democratic systems. Neil took a look at the basic web app stack for the platform and submitted several pull requests to fix bugs in that system. All of his pull requests have since been approved, making Neil an official contributor to DemocracyOS!

Calvin Spealman went into ShipIt Day with the intention of building a story generator. However, after running into a roadblock late Thursday evening, he transitioned instead to work on frontend tooling upgrades and related documentation in Caktus’ Django project template. In the meantime, Vinod upgraded one of Calvin’s projects to the latest version of Margarita. This enabled Calvin to demonstrate the new frontend tooling upgrades, while allowing Vinod to test his upgrade pathway documentation.

Like Calvin, Colin Copeland built in frontend changes to the Django project template and integrated those changes into an existing project.

cat animation still

Inspired by the work of Rachel Nabors, Erin Mullaney made a CSS3 animation out of one of her own drawings. By cycling through sprites of different images, she made an animation of a cat blinking while a mouse walked by. You can see the full animation here.

taylor swift lyric generator

Mark Lavin, Rebecca Conley, and Dmitriy Chukhin built on their work from a previous ShipIt Day, The Taylor Swift Song Generator. The team cleaned up some of the code and fixed the test suite, using Codecov for code test coverage and OpBeat for exception and performance monitoring. Mark also created Twitter preview cards for tweeting particularly choice TayTay lyrics generated by the app while Rebecca and Dmitriy enabled user-creation of a song title that would then become the first line of the chorus, making the app more interactive. By the end of the day the team was working on word visualization and cleaning up their data set, which are the main goals for the next chance the team has to work on the app.

Dan Poirier, Scott Morningstar, and Jeff Bradberry continued their work from a previous ShipIt Day as well, with the goal of getting the project template to deploy on Ansible in order to move away from Salt. They improved the usage of variables, added Supervisor, nginx, and gunicorn, pushed source code to a deployment target without going through github, and updated documentation. Though the team couldn’t get deployment to a virtual machine to work, they are incredibly close and hope to have a deploy by the next ShipIt Day!

Hunter MacDermut got into using es2015 with Gulp, Browserify, and Babel, building a task list that would auto-organize by priority. This to-do list app would be organized by project and task, and further taps on each item would increase that item’s priority value. Though Hunter didn’t have time to finish the sorting feature, in its current state, it is a functional to-do list app that relies on localStorage getting and setting. The repo for the app can be found here.

Though she didn’t technically participate in ShipIt Day, NC Nwoko did break from her usual routine and went to the Regional AIDS Interfaith Network (RAIN) in Charlotte, which helps youth get involved in their treatment. She helped train caseworkers in the use of our Epic Allies app.

durham school navigator

Victor Rocha continued his work on Code for Durham’s School Navigator app, building a new school profile to be tested in the community. David Ray added features and improvements to the mobile app version of the School Navigator, including geolocation functionality, a clear input button to make subsequent address searches more efficient, and technical debt cleanup, changing some code that was in triplicate to an angular directive.

Finally, Tobias McNulty worked on refactoring and cleaning up the deployment code for a client project. The project was based on an old Fabric-based deployment that Caktus used to use. He cleaned up the Fabric file, making it client agnostic, pulled out the configuration into a yml file, and merged the changes back into FabulAWS, the parent project. The next step will be to break these down into smaller files for autoscaling purposes. Meanwhile, Karen Tracey reviewed Tobias’ work.

Tim HopperMentions of John Cook on Github

People mention John Cook's blog a lot in Github repos.

I scraped the Github search pages to try to figure out which of his pages are most mentioned. His post Accurately computing running variance gets many more mentions than any other post. It provides C++ code for Knuth's algorithm for computing the mean, sample variance, and standard deviation for a stream of data.

Here are top 12 pages from his site most linked in Github:

  1. Accurately computing running variance (377 mentions)
  2. Stand-alone code for numerical computing (58 mentions)
  3. A Bayesian view of Amazon Resellers (52 mentions)
  4. johndcook.com/blog (47 mentions)
  5. Computing the distance between two locations on Earth from coordinates (44 mentions)
  6. Math.h in POSIX, ISO, and Visual Studio (38 mentions)
  7. Three algorithms for converting color to grayscale (26 mentions)
  8. johndcook.com (21 mentions)
  9. Computing skewness and kurtosis in one pass (20 mentions)
  10. What’s so hard about finding a hypotenuse? (19 mentions)
  11. Random number generation in C++ (19 mentions)
  12. R language for programmers (19 mentions)

Tim HopperQuotes from Former Professors

Tim HopperTweet Your Moon

I created a Twitter account that tweets a moon emoji each evening1 representing the current phase of the moon.

I've wanted to do this for a while, but Joel Grus' posts about AWS Lambda inspired me to make it happen.

The code for this is on Github.

I also created a Github project that provides a template for creating Lambda-powered Twitter bots in Python.

  1. Evening in the eastern United States. 

Caktus GroupThe Trials, Tribulations, and Triumphs of Choosing an M&E Platform

republished with permission from ICTWorks.org

At the recent MERL Tech conference, Tania Lee (Caktus Group), Tom Walker (Engine Room), Laura Walker McDonald (SIMLab), and Lynnae Day (Oxfam America) led a session called, “The Trials, Tribulations, and Triumphs of Choosing an M&E Platform.” They’ve written up their reflections and learning from that session focusing on project design, tool design/research, and getting things off the ground, whether that means finding external help or building a custom solution.

Choosing a tool: where to start?

Many organizations come to a procurement process with a favourite platform in mind, either from prior experience or because they’ve heard compelling marketing or stories. Starting from the tool means that you are not being true to what really matters – your needs and those of your users.

SIMLab and Caktus Group use an Agile methodology that prioritizes a general sense of direction and strong knowledge of the user, and uses that to get to a prototype that actual users can test as early as possible (whether you’re building a tool or configuring an existing one). Getting feedback on a system’s capabilities while you’re developing it lets you respond as your idea meets reality.

Understand your direction of travel, through workshops with your team, potential users and other stakeholders

  • Create and understand the theory of change for the tool thoroughly.
  • Write ‘user personas’ – a concise, clear picture of the people you are building for, including their habits, levels of technical skill and working environment.
  • Write or draw a process map that shows exactly what information is being exchanged, where it comes from, where it goes and how. This helps you to see potential blockers or impractical elements of your plan.
  • From the process map and user personas, develop user stories (individual tasks that the system should be able to perform). Prioritize those, and develop the first few.
  • As soon as you have something you can test, show a prototype system to real users.
  • When you can, use it ‘live’ in a pilot and continue to build and release new versions as you go.

Things to consider while going through this process:

  • Project timeframes: is this system constrained by a particular project timeframe, or does it need to be sustainable beyond the end of your project? Is it an internal operational system – in which case there may be no end date for its use?
  • What level of security do you need the system to have? What kind of data will it handle, and what are the legal requirements in the country (or countries) where it will be used?
  • What level of user support and documentation do you need?
  • Will users need training?
  • What level of system analytics are you hoping for?
  • How tolerant of risk are you/your organization? For example, do you need to choose an established vendor or product that you know will still be there five years from now?


The session acknowledged some of the challenges with this kind of work:

  • In large organizations, working on internal projects requires consultation with a wide range of stakeholders.
  • Project managers are often trying to bridge gaps in knowledge (and even language) between technical and non-technical people. They may themselves not be technologists.
  • Donor funding may not allow the kind of flexibility inherent in an Agile development process. The proposal process requires you to have done much of the design work – or to presuppose its outcome – before you get the funds to support it.
  • During the development process, the environment and needs themselves may change.
  • No product or system is ever truly ‘finished’.
  • Systems development is rarely funded well enough to make something robust that has all non-functional requirements, and allows for maintenance and updates.

What’s already out there?

If you don’t already have a tool in mind, the next step is to conduct a market analysis. A market analysis helps you determine what tools are available and decide whether to use or customize an existing tool, or develop your own custom solution.

There are several common challenges organizations face in their search for M&E tools, including:

  • Finding a tool that’s an exact fit for all of your needs, particularly for larger projects; you’re likely to have to pick and choose.
  • Finding all of the relevant options quickly and easily, or…
  • …when you find a host of similar tools, quickly and easily understanding their specific differences and value-adds.
  • Developing user needs and tool requirements in an often-changing context (internal or external).

There are also several common approaches organizations take to navigating these challenges. Some of these include:

  • Staff that are familiar with existing tools or have experience with a specific tool that fits the requirements, can do some simple testing and make recommendations.
  • Taking to peer organizations to see what tools they’ve used (or not used, and why) for similar processes or projects.
  • Hiring consultants or working with IT departments to support the search, selection, and contracting.
  • Referencing some of the tool guides that exist (like NetHope or Kopernik), but without relying completely on them for decision-making, as they’re not always up-to-date or have all of the information you need.

Finally, there are some key questions to ask when assessing your final pool of solutions, besides whether they fit the bulk of your minimum user requirements, including:

  • How will the user interact with the system? What kind of software/hardware requirements and training will your users need to get the most out of their experience?
  • What kind of tech support is available – end user support? customization support? break-fixes? anything at all? And how much will varying levels of support cost?
  • What are the up-front costs (for software and hardware)? What are the on-going costs at the current project size? What are the on-going costs at scale (whatever the size of your “scale” is)?
  • If you’re planning to scale, how well does the tool adapt to different environments? For example, does it work offline and online? Does the interface support localization in the language that you need? Does it have the capacity to integrate with other tools?

You’ve done the groundwork: what next?

Once you’ve taken the time to understand your needs and look into what tools are already out there, you might find you still need features that an off-the-shelf tool can’t provide. If you don’t have the necessary skills within your organisation, you might need to consider getting technical support from outside your organisation, or building a custom solution.

Finding the right external help, if you need it

During the discussion, many people said that they found it difficult to know where to find good sources of external support. This seems to be a struggle in many contexts: it was also a common theme in the engine room’s research project (supported by Making All Voices Count) into how transparency and accountability initiatives in Kenya and South Africa choose tools.

Sometimes, this is just a question of networks and knowing who to ask. The discussion threw up a collection of useful places to start (such as Pelican Initiative) for this, and we’re currently collecting more. But there are some things that you can do to help pick the right support provider.

Choose criteria that make sense for your organisation. In the discussion, a lot of people said that the most important factor determining whether they could work well with a technical provider was how well the provider understood their organisation’s long-term goals. Providers of technology tools can sometimes focus on the tool itself rather than the organisation and its goals. For a relationship to be successful, there needs to be a clear understanding of how a tool fits into an organisation.

Be clear with your needs

This makes it critically important to present what you need to support providers in a clear, comprehensive way. Whether this takes the form of a formal request for proposals, a short brief or a specification document, communicating your needs to people outside your organisation is actually a good way of narrowing down which features are really essential – and which you can do without.

Writing this in language that technical providers can understand often takes a lot of time and effort. You may already have a lot of the raw material for a good brief if you’ve already documented your needs and got a good idea of existing products, including elements like user personas and user stories (see the Understanding your needs blog, above).

At this point, several members of the discussion said that they had found it helpful to work with a consultant who could translate their needs into technical specifications – which they could then distribute to actual providers. It’s worth looking at existing templates (like this one by Aspiration) and guides on the request-for-proposal process (like this one by TechSoup).

Different organisations will have different criteria for making a selection. Some might want having knowledge of a particular country context, experience in a particular sector or a more general sense of how the organisation was likely to grow and change over time. Others found it more helpful when providers were physically located in the same city as them, or available to answer queries at specific times. Several people in the discussions mentioned times when they hadn’t considered factors like these – and came to regret them later.

Building a custom solution

Building a custom solution, in this context, means building a tool (or integrating a set of tools) that meet needs that no other tools in the marketplace can. Deciding to build a custom tool may require weighing tradeoffs: short-term needs vs. long-term flexibility, specialized usability features vs. time to market, or unique program needs vs. organizational standards.

One of the most often repeated challenges identified by session participants was securing the necessary funding to properly build a custom solution. It can be difficult to convince donors to pay for technical resourcing of a software development project and/or maintenance and support of the system.

Sometimes, organizations will find themselves cobbling together a few different grants to fund an initial prototype of a system. This approach can help to generate additional organizational support, as more programs have a stake in the custom system’s success. It’s also important to consider how and where a custom solution may fit into the organization’s overall strategy – if there is alignment, additional support across various teams could also help to ensure the project’s success.

Custom solutions require custom skills

Managing the custom solution project requires technical skillsets and a good understanding of the software development cycle. Without someone to line up necessary resources, make day-to-day decisions, and provide strategic guidance and technical oversight, a project can easily lose direction and the end result may not meet your needs. It may be helpful to collaborate with other teams within your organization, or to hire a vendor or consultant who can provide project management services.

Session participants also recommended building custom solutions using an iterative or “Agile” approach; this means that smaller sets of functions are tested at various points in the process to collect more useful and direct feedback. An Agile software development approach can help to identify the highest priority features so that a team can spend their time and energy working on what’s important. If this is your organization’s desired approach, it’s important to work with a team that has the right experience working in this way.

Some very helpful guidelines can be found in the Principles for Digital Development, which describes principles for building ICT tools and is a great way to approach building any custom solution.

Caktus GroupModified Preorder Tree Traversal in Django

Hierarchical data are everywhere, from product catalogs to blog post comments. A classic example is the tree of life, where kingdoms are subdivided into a hierarchy of phylum and class down to genus and species. What if you wish to store this data in a database table, which is inherently flat? Databases do not natively store hierarchies, so you need to work around that.

MPTT, or modified preorder tree traversal, is an efficient way to store hierarchical data in a flat structure. It is an alternative to the adjacency list model, which is inefficient. This post will cover the MPTT approach, examine its tradeoffs, then explore django-mptt, which is a package for adding MPTT to your Django models.

MPTT Structure

The MPTT approach adds a lft and a rght attribute to a model, which lets you easily determine parent-child relationships. See the example tree with lft and rght values below (GlobalCorp, for example, has a lft value of 1 and a rght value of 20). The dotted line in the image above shows the path taken to calculate child relationships within the tree.

The model attributes allow you to do SQL queries such as the one below to get all USACorp subsidiaries:

SELECT * FROM example_tree WHERE lft BETWEEN 2 AND 10;

I've used MPTT in two of my major projects at Caktus. A couple of model patterns which might suggest using a MPTT are:

  1. Many class names describing the same basic content type in an arbitrary hierarchy:
  1. Awkward containment structures where you are creating a bunch of linked models that are basically different words for the same model:

MPTT Tradeoffs

The MPTT approach is beneficial in that, with only a couple of fields you can determine an entire tree structure. Because of this economy, retrieval operations are efficient. You can very quickly query a tree to determine its relationships. The tradeoff is that inserts and moves are slow. If the structure of the tree is constantly changing, MPTT is not a good option because it needs to update many or all of the records. It also performs a whole table lock which prevents any updates to the affected table. This is obviously less than ideal if you have a heavily updated database.


The django-mptt project is a convenient way to incorporate MPTT into Django. It provides a base model, MPTTModel, which implements the following tree fields for you:

  • level (indicating how deep in the tree a node is)
  • lft
  • rght
  • tree_id (to identify tree membership for any given instance)

It also provides the following convenience methods which abstract these tree fields and help you to manage the tree:

  • get_ancestors(ascending=False, include_self=False)
  • get_children()
  • get_descendants(include_self=False)
  • get_descendant_count()
  • get_family()
  • get_next_sibling()
  • get_previous_sibling()
  • get_root()
  • get_siblings(include_self=False)
  • insert_at(target, position='first-child', save=False)
  • is_child_node()
  • is_leaf_node()
  • is_root_node()
  • move_to(target, position='first-child')

Of these methods, get_root() and insert_at() are particularly helpful. Manually modifying lft and rght values is not a good idea, and insert_at() is a safe way to update the tree. I use get_root() all the time to double check or even short circuit child values. For example, if you have five product types, you could have five trees and specify all the product related information at the root of each tree. Then any child node could simply ask for its root’s values:

product = Products.objects.get(id=123)

Example Class

from mptt.models import MPTTModel, TreeForeignKey

class Company(MPTTModel):
    name = models.CharField(max_length=255)
    parent = TreeForeignKey('self',
    is_global_ultimate = models.NullBooleanField()
    is_domestic_ultimate = models.NullBooleanField()

Using the tree

This example method finds the domestic headquarters for any given company in a tree.

def get_domestic_ultimate(self):
    If current company is flagged as domestic ultimate, return self.
    Otherwise iterate through ancestors and look for the
    first domestic ultimate.
    if self.is_domestic_ultimate:
        return self
    mytree = self.get_ancestors(ascending=True, include_self=False)
    for comp in mytree:
        if comp.is_domestic_ultimate:
            return comp
    return None

Setting up a Test Tree

class CompanyTestCase(TestCase):
    def setUp(self):
        self.globalcorp = factories.CompanyFactory(name="GlobalCorp",
        self.usacorp = factories.CompanyFactory(parent=self.globalcorp,
        self.companya = factories.CompanyFactory(parent=self.usacorp,
                                                   name="Company A")
        self.companyb = factories.CompanyFactory(parent=self.usacorp,
                                                   name="Company B")

Testing the Tree

def test_tree_parameters(self):
    self.assertEqual(self.globalcorp.lft, 1)
    self.assertEqual(self.globalcorp.rght, 20)
    self.assertEqual(self.globalcorp.level, 0))
    self.assertEqual(self.asiacorp.lft, 10)
    self.assertEqual(self.asiacorp.rght, 19)
    self.assertEqual(self.asiacorp.level, 1)
    self.assertEqual(self.globalcorp.get_descendant_count(), 9)
    self.assertEqual(self.usacorp.get_descendant_count(), 3)
    self.assertEqual(self.asiacorp.get_descendant_count(), 4)


We find MPTT very useful here at Caktus and have even created a product that integrates with django-mptt called django-treenav. This product is an extensible, hierarchical, and pluggable navigation system for Django sites.

One Last Gotcha

There is an incompatibility with django-mptt and GIS in PostGreSQL. If you are using django-mptt and CREATE EXTENSION postgis;, you can't use MPPTMeta attributes in your MPTTModels:

class MPTTMeta:
    level_attr = 'mptt_level'

Adding these meta options won't do anything obviously wrong. It will cheerfully report an incorrect tree structure.

Tim HopperEverything a Stranger Said About My Height in 2015

Caktus GroupWhat We Open Sourced in 2015: A New Year's Retrospective

This year we had the pleasure of building a number of unique solutions for several organizations. In addition, we had the support of these clients to open source the tools we built. By open sourcing our work, we enable others to use, replicate, and even improve upon the tools we’ve created.

With the year coming to a close, we thought we would take a moment to reflect on the projects we’ve had the opportunity to open source this year.

Commodity Tracking System, International Rescue Committee (IRC)

On behalf of the IRC, Caktus developed significant upgrades to a previous iteration of the Commodity Tracking System (CTS). The fedex-style systemove enables IRC employees to reliably track the shipment of humanitarian aid to Syrian refugees across Jordan, Turkey, and Syria. To help handle increasing volume of shipments the IRC oversees, we created a cohesive web-based application that links information from an array of technologies to track and verify delivery of aid shipments. The enhanced application enables the IRC to quickly create and customize large shipments, generate custom barcodes for each shipment for scanning purposes, and pinpoint these shipments through a mobile data collection system. Finally, the system enables the IRC to easily report the delivery status of a donor’s aid donations, thereby providing the accountability and transparency donors desire.

Service Info, International Rescue Committee (IRC)

In addition, we worked with the IRC on another web-based app called Service Info, which helps Syrian refugees in Lebanon to identify, locate, and utilize the various aid service available to them. Aid providers can self-register on the site to promote their various services and users of the site can search for these services on an easily navigable interface. More importantly, the platform provides a space for users to comment on and rate these services, the hope being that such feedback will, in turn, improve the quality of service. In addition, Service Info will continue to offer a localized network of vetted, rated and reliable service organizations even after larger non-governmental organizations like the IRC may have to move their operations elsewhere.

Smart Elect, Libya’s High National Elections Commission (HNEC)

After building an SMS voter registration app—the first of its kind—that enabled 1.5 million Libyans to register to vote across two national elections, we had the privilege of further working with the Libyan High National Elections Commission (HNEC) and the United Nations Support Mission to Libya to open source the SmartElect platform. The tools in this elections management platform range from SMS voter registration, to bulk alerts to voters, and call center support software.

Ultimate Tic Tac Toe, Internal ShipIt Day Project

Caktus’ Ultimate Tic Tac Toe game started out as an internal ShipIt Day project built in AngularJS by developers Victor Rocha and Calvin Spealman with initial design help from Wray Bowling. Several iterations later it is a fully fledged game with design by Trevor Ray, AI components by Jeff Bradberry that have received recognition from Hacker News, and challenging gameplay that captivated visitors with an interactive touch screen at our booth at both PyCon and DjangoCon.

django-treenav and django-scribbler, Contributions from our Open Source Fellow

Inspired by the Django Software Foundation’s fellowship as well as the Two Day Manifesto, we launched our own Open Source Fellowship this year. As part of that program, Ben Phillips worked on significant improvements and upgrades to the code for Caktus built django-treenav, an extensible, hierarchical, and pluggable navigation system for Django sites as well as django-scribbler an application for editing and managing snippets of text and HTML on the front-end through custom template tags.

We are thrilled to have had the opportunity to work on such a wide range of projects this year, and even more so to have been able to open source the code for some of our work, especially those projects in the international development sector. We’re truly excited to see what 2016 will bring, and to continue our efforts to support and extend the influence and potential of open source tech solutions.

Og MacielEnd of Year - 2015

Review of 2015

Another year has gone by and I guess it is time to review the things I set out to do and grade myself on how well (or poorly) I fared. Here are some of my goals for 2015:

Read 70 Books

Grade: PASS

Even though I had a very, very busy year at work, with many releases of Red Hat Satellite 5 and Red Hat Satellite 6 shipped to our customers, I managed to surpass my goal of reading 70 books, finishing the year with a whopping 79 books read! You can see the books I read here: Year in Books

This year I also spent a good chunk of my time looking at old, used books, and my personal book collection increased considerably. At one point I had so many piles of books lying around the house that I had to buy 4 new book cases to store them. At first I wanted to have them custom made, but the estimates I got from 3-4 different people were way out of my budget. In the end I went with 4 Billy Bookcases from Ikea, which cost me about 10 times less!

If you want to see what I'm reading or want to recommend a book which you think I might enjoy reading, please feel free to add me on GoodReads.

Caktus GroupReflecting on My Time as Caktus' Open Source Fellow

My name is Ben Phillips and I am Caktus' Open Source Fellow. As my fellowship comes to a close, I wanted to reflect on my time at Caktus and to share my experience and some of what I've learned here. First, however, I should probably share how I ended up here in the first place.

Initial Commit

Six months ago I was a biologist. That had been the plan all along, after all. It had taken me quite a while. I'd had to work as a barista, stream surveyor, machinist/welder, and mouse husbandry technician and to get a bachelor's degree in biology, but I eventually achieved my goal and found a job as a research technician in a molecular/cell biology lab. Unfortunately, as is often the case with the plans of mice and men (both of which, as a former biologist and mouse technician, I am qualified to speak about), there was a small problem. It slowly became clear to me that this was not what I wanted to do with my life. The research was interesting, but the work was not. I was at something of a loss until one day when, in the middle of a particularly tedious task, I began to wonder if I could figure out how to write some kind of program to help make my task a bit easier. Though it turned out that integrating a computer vision library into a program to track the movements of mice spinning in circles was a bit too ambitious for a first project, I had found something interesting and challenging that I really enjoyed. A few months later, I quit my job to enroll in The Iron Yard, a local code school, and soon found myself with the technical skills of a junior Python developer. Now what? The answer, of course, was Caktus.

Getting Started

I found out about Caktus when I attended the local Python developer meetup they host every month in the Caktus Tech Space and began to learn about them as a company and as people. I was immediately attracted to both the exciting work they do and their focus on giving back to the local and global community. After graduating from code school, I quickly applied to their fall internship position and after one of the friendliest interviews I've ever had and an anxious week or two of waiting I found myself joining the Caktus team as their new Open Source Fellow. The aim of my internship was for me to make contributions to open source projects I found interesting or useful, thereby giving back to that community, which is so vital to Caktus and to developers in general. Where does one start, however, with such a huge breadth of possibilities? For me, the journey began with Django-Treenav.


Treenav is a project created by some of the Caktus team that is designed as “an extensible, hierarchical, and pluggable navigation system for Django sites”. As a fairly well-established, older project, Treenav provided an ideal starting point for me. Most of the work required was updating and removing deprecations throughout the project. With some direction from my incredibly helpful coworkers, I got into a flow of finding and updating deprecated areas of the codebase which, as a relatively simple but very important task, allowed me to quickly add value to the project and gain confidence in my abilities as a developer. It also helped teach me the fundamentals of contributing to a project: opening and responding to issues, making discreet and focused pull requests, having code reviews, and maintaining good communication. A string of updated deprecations and even a couple of bug fixes later, I was able to help publish the 1.0 release of Treenav which was a huge milestone for me. I had helped contribute to a tool that was available for people to use in projects all over the world. So cool.


My next project was another Caktus original: Django Scribbler. Scribbler is an application for editing and management of snippets of text and HTML on the front-end through a few custom template tags. I started out working on this project much the same as I did with Treenav, updating deprecations and the like. But with the guidance of my mentors and coworkers, I began to tackle more challenging work, such as fixing bugs and even starting to add some features, like support for multiple template engines.

Additionally, I took my first real dive into testing. I had gotten some experience with test driven development and writing tests in code school, but had never worked within an established test suite before nor had any experience with tools like Selenium, which was a necessity for such a JavaScript-heavy project. By far the largest task I undertook working on Scribbler, however, was revamping its module-loading/package management system. The project relies on a number of JavaScript libraries and when I started working on Scribbler these libraries were all managed with RequireJS. While I was working on updating some of the JS dependencies, we decided to migrate to using Browserify for that purpose instead. Throughout the process of this migration, I learned not only how to convert from RequireJS and an AMD style loading system to Browserify and its Node-style bundling, but I learned why that was a good decision for the project, observing the sorts of discussions and considerations that go into making such design choices for a project. RequireJS and Browserify accomplish much the same thing in very different ways and we had to decide which fit best with our project from not only a technical standpoint but also from that of a design philosophy. In the end, my work on the conversion had helped streamline the entire package management and build process for the project and, just as importantly, I understood why it was better.


Everything I had worked on up until this point, though open source, had originated from Caktus, and an important purpose of my internship was to give back to the open source community at large— a purpose I strongly believe in. As such, while I was wrapping up work on the Browserify transition and a few smaller features for Scribbler, I began to look around for other projects I could contribute to. It was a daunting prospect and, to be perfectly honest, one that I was initially unsure I was ready for. However, as I began to canvas issues on projects and communicate with other developers and get positive feedback from my coworkers, I realized I had a lot more to offer than I had previously thought. There were issues I had experience with, discussions I could weigh in on, and bugs I could dig in on, figure out, and fix. The codebases were unfamiliar and often large and imposing, but I found I could understand a lot as I started to look over them. Before long, I had pull requests out on a number of projects, including some on Django itself. I often had to heavily refactor or completely rewrite my contributions based on comments I received, but instead of being discouraged I found myself actually enjoying learning how to adjust my code based on feedback and improving what I had written until it was something I could be proud of. There is a great sense of satisfaction that comes from that feeling of partnership one gets working with other developers on an open source project, just as much as there is in actually having content you wrote merged into a project and knowing you've helped make a project better for everyone who uses it. As of the writing of this article, I have had the privilege of having contributions merged into several projects, including Django Rest Framework and Django, an accomplishment that would not have been possible for me even a few short months ago.

Release Notes

My time here at Caktus has been brief, but so incredibly rewarding. With the help of my mentors and coworkers, I have made huge leaps in both my technical skills and my confidence in my ability to add value to projects. Beyond even those technical skills, by working with and watching the people here at Caktus I have begun to learn some of the even more important, if less tangible, developer skills. I have learned not to just fix problems I come across, but to dig in and understand them first, and to ask questions quickly when I'm stumped. I have also seen the importance of approaching development not just as writing code but as crafting projects, building a whole rather than a series of parts. These are the sorts of skills that cannot be easily taught, but the people working here at Caktus have been exemplary teachers and examples. You would be hard-pressed to find a more knowledgeable, helpful, and friendly group of people. At no point in my internship did I ever feel like an intern. Everyone always treated me with respect as a true developer and member of the team, and I will always appreciate that. As my internship comes to a close and I look towards my next steps, I know that I will always look back on my time at Caktus with fondness and gratitude. I have learned so much and am a stronger developer than when I started. And I have gained a love for open source that I doubt I will ever lose. I have developed new skills, new tools, new confidence, and new friends. So if you are considering an internship at Caktus, I cannot recommend it enough. You'll never regret it.

Caktus GroupCaktus Participates in Tree of Bikes at American Tobacco Campus

This year, our neighbors at American Tobacco Campus (ATC) hosted Durham’s first ever Tree of Bikes event to help collect and distribute new bikes to children in need in the local community. The tree is a 25-foot tall sculpture made entirely of new children’s bicycles donated by individuals and local businesses in the Triangle. The bikes are then distributed to children living in the Cornwallis Road affordable housing community managed by the Durham Housing Authority.

Having heard about the drive in a tweet from ATC, Caktus employees were incredibly excited to participate. We all chipped in to buy a beautiful and sturdy little bike from our next door neighbors at Bullseye Bicycle. The great team at Bullseye made sure the bike was ship-shape for immediate riding! In addition, they were kind enough to provide a voucher for free tuneups and care on the bike for one year after purchase.

In conjunction with Habitat for Humanity and Happy Roots Entertainment, the bikes were collected at ATC throughout the month of November. The organizers reached their goal of collecting 120 bikes for the tree, which was lit in a ceremony held on December 4th. Though Happy Roots has been distributing bikes to children in affordable housing communities for the past 7 years, the number collected for Tree of Bikes will enable Happy Roots to provide bikes to ten times as many children than in the past.

Tree of Bikes is on display at ATC until December 18th, when they will be distributed to their new owners. Be sure to check out the tree while you can!

Caktus GroupCaktus CTO Colin Copeland Helps Launch Open Data Policing Website

Today, at Caktus headquarters, CTO and co-founder of Caktus Colin Copeland will stand at a press conference along with activists, police representatives, and elected officials to announce the launch of OpenDataPolicingNC.com. The first site of its kind, OpenDataPolicingNC.com draws on public records to publish up-to-date stop, search, and use-of-force data—broken down by race and ethnicity—for every police department and officer in the state of North Carolina. The volunteer effort, led by The Southern Coalition for Social Justice (SCSJ) and technical leadership by Colin, includes approximately 20 million anonymized data points from 15 years of NC traffic stop data.

Colin’s development team for the project included fellow Durham residents, data scientist Andy Shapiro and software engineer Dylan Young. Combined, the team holds over 29 years of software development experience. Working outside their normal, full-time jobs, Colin, Andy, and Dylan worked iteratively to build, test, and open source the new site. In addition, staff attorney of SCSJ, Ian Mance used his understanding of the needs of police agencies, lawyers, and government officials to ensure a user-friendly site.

“Traffic stops are the most common way citizens interact with police officers,” said Ian. “This site enables anyone who engages with these issues—whether they be police chiefs, courts, lawyers, or policymakers—to ground their conversation in the facts.” Today’s site’s launch is the exciting culmination of a long period of collaborative work between the sites development team, SCSJ, and the Fayetteville PD. This past summer, SCSJ presented a beta version of OpenDataPolicingNC.com to the White House, the initial step to eventual partnership with the Fayeteville PD and its Chief, Harold Medlock, members of the White House’s Police Data Initiative. “The future of data-driven justice lies in collaboration between the tech community, nonprofits, and government services.”

That collaboration doesn’t end with the site’s launch. The Fayetteville PD and Chief Medlock will continue working with the OpenDataPolicingNC team to build new features for the site as well as promote increased police transparency. In addition, the team hopes to replicate OpenDataPolicingNC.com for other states.

Gary PosterElixir: Erlang records and the Erlsom XML library

I wanted to jot down some notes about Elixir, because I'm learning it, and because some of the pieces I assembled for my most recent code exercise were hard to find across the web. Hopefully it will help someone else, or at least be something I can refer to again in the future. I was playing around implementing the last exercise from Chapter 13 of Dave Thomas' Programming Elixir book: get the

Tim HopperI Love Twitter

Caktus GroupCyber Monday: 50% off Django book and videos

Are you looking for a good gift for a current or future Django developer? Check out Caktus technical director Mark Lavin's work for O'Reilly:

Both are now 50% off at the O'Reilly site with the discount code CYBER15.

Tim HopperMy Python Environment Workflow with Conda

Many new Python programmers rely on their system install of Python to run their scripts. There are several good reasons to stop using the system Python. First, it's probably an old version of Python. Secondly, if you install 3rd party packages with pip, every package is installed into the same globally accessible directory. While this may sound convenient, it causes problems if you (1) install different packages with the same name (2) need to use different versions of the same package (3) upgrade your operating system (OS X will delete all the packages you have installed).

For many years, best practice for Python developers was to use virtualenv to create a sandbox-ed environment for each project. If you use virtualenv, each project you work on can have its own version of Python with its own 3rd party packages (hopefully specified in an requirements.txt file). In my experience, getting started with virtualenv is cumbersome and confusing; to this day, I have to look up the command to create a Python 3 virtualenv.1

In 2015, I have almost exclusively used Python installations provided through Continuum Analytics's Conda/Anaconda platform. I have also switched from using virtualenvs to using conda environments, and I am loving it.

Before explaining my workflow, here's a quick glossary of the similarly-named products that Continuum offers.

  1. conda: "Conda is an open source package management system and environment management system for installing multiple versions of software packages and their dependencies and switching easily between them. It works on Linux, OS X and Windows, and was created for Python programs but can package and distribute any software."2 A conda install provides a whole suite of command line tools for installing and managing packages and environments. Because conda works for any software, it can even install different versions of Python (unlike pip).
  2. Anaconda: "Anaconda is a completely free Python distribution (including for commercial use and redistribution). It includes more than 300 of the most popular Python packages for science, math, engineering, and data analysis." It is available across platforms and installable through a binary.
  3. Anaconda Cloud: Also known as Anaconda.org and formerly known as Binstar, "Anaconda Cloud is a package management service where you can host software packages of all kinds." Anaconda Cloud is a package repository analogous to PyPI. Packages are installed via the conda command line tool instead of Pip. By default, the conda install command installs packages from a curated collection of packages (a superset of those in Anaconda). Continuum allows users to host their own packages on Anaconda Cloud; these packages can also be installed through conda install using the -n flag with the username.

Conda, Anaconda, and Anaconda cloud are distinct but interrelated tools; keeping them straight can be hard, but is helpful.

Conda (the package manager) can be installed in two ways. Through the Miniconda installer or the Anaconda installer. Both install the package manager, but the latter also installs the 300+ packages for scientific Python. (Installing Anaconda is equivalent to installing Miniconda and then running conda install anaconda.)

Conda Environment Files

It has become standard for pip users to create a requirements.txt file for specifying dependencies for a particular project. Often, a developer working a project will (1) create and activate a virtual environment (2) run pip install -r requirements.txt to build an isolated development environment with the needed packages.

Conda provides an analogous (but more powerful) file: environment.yml.3

A simple environment.yml file might look like this:

name: numpy-env
- python=3
- numpy

If you are in a directory containing this file, you can run $ conda env create to create a Conda environment named numpy-env that runs Python 3 and has numpy installed4. Run $ source activate numpy-env to activate this environment. Once activated, running $ python will run Python 3 from your environment instead of the globally installed Python for your system. Moreover, you will be able to import numpy but not any of the 3rd party packages installed globally.

environment.yml can also install packages via pip with this syntax:

name: pip-env
- python
- pip
- pip:
    - pypi-package-name

I see environment.yml files as a positive development from requirements.txt files for several reasons. Foremost, they allow you to specify the version of Python you want to use. At Pydata NYC 2015, many presenters provided their code in Github repositories without specifying anywhere whether they were using Python 2 or 3. Because I included a YAML file, attendees could see exactly what version I was using and quickly install it with conda env create. I also like being able to specify the name of the environment in the file; this is particularly helpful when working with others. Finally, because conda can install from PyPI via pip, environment.yml files provide no less functionality than a requirements.txt file provides.

My Python Environment Workflow

Lately, whenever I am working on a new project (however big or small), I follow the following steps:

  1. Create a project folder in the ~/repos/ directory on my computer.
  2. Create an environment.yml file in the directory. Typically the environment name will be the same as the folder name. At minimum, it will specify the version of Python I want to use; it will often include anaconda as a dependency.5
  3. Create the conda environment with $ conda env create.
  4. Activate the conda environment with $ source activate ENV_NAME.
  5. Create a .env file containing the line source activate ENV_NAME. Because I have autoenv installed, this file will be run every time I navigate to the project folder in the Terminal. Therefore, my conda environment will be activated as soon as I navigate to the folder.
  6. Run $ git init to make the folder a Git repository. I then run $ git add environment.yml && git commit -m 'initial commit' to add the YAML file to the repository.
  7. If I want to push the repository to Github, I use $ git create using Github's hub commands. I then push the master branch with $ git push -u origin master.

As I add dependencies to my project, I try to be sure I add them to my environment.yml file.

A major benefit of all this is how easily reproducible a development environment becomes. If a colleague or conference attendee wants to run my code, they can setup the dependencies (including Python version) by (1) cloning the repository, (2) running $ conda env create, (3) running $ source activate ENV_NAME. It's easy enough for me to drop those instructions and further instructions for running the code in a README file. If I'm feeling especially helpful, I'll create a Makefile or Fabfile to encapsulate commands for core functionality of the code.

An even larger benefit is that I can return to a project after, days, months, or years and quickly start developing without first having to hunt for print statements to figure out whether I was using Python 2 or 3.

I've come to love environment.yml files, and I think you might too.

  1. virtualenv also provides no helping in actually managing Python versions. You have to install each version yourself and then tell virtualenv to use it. 

  2. From the conda docs

  3. Though there is currently a pull request for adding requirements.txt support to conda: https://github.com/conda/conda-env/pull/172

  4. Numpy will be installed from a binary from Anaconda Cloud, not built from source. 

  5. I created a bash command conda-env-file to automatically create an environment.yml file named after the current directory. 

Tim HopperSequential Minimal Optimization Algorithm for Support Vector Machines

In my nonlinear optimization class in grad school at North Carolina State University, I wrote a paper on the famed SMO algorithm for support vector machines. In particular, I derive the Lagrangian dual of the classic formulation of the SVM optimization model and show how it can be solved using the stochastic gradient descent algorithm.

You can find the paper here.

Caktus GroupWhat human-centered design can do for international development

Cross-posted with Creative Associates International. Written by Gina Assaf (Creative Associates International) and Tania Lee (Caktus Group). Image courtesy of Creative Associates International.

A human-centered design process is a critical step that is often overlooked when making important decisions in technology for development.

In the development space, funding usually comes from a donor or an agency, and the service or products it benefits goes to a different place, local communities in need around the world. The indirect link between the donors and the beneficiaries can lead to a lack of accountability—as Dave Algoso explains in his blog on Reboot.org.

But practicing human-centered design, or user-centered design, which puts the needs, wants and limitations of users first in each stage of the design process, can help ensure that the services and products actually meet the needs of the end users—the communities we seek to empower and serve.

For example, funding a software product that requires high bandwidth connectivity when the population it is meant to serve has minimal Internet connectivity will not benefit that community. Similarly, a mobile banking tool utilizing lots of text may not be appropriate for low-literacy communities.

Not practicing human-centered design could mean wasting funding on initiatives that simply aren’t effective and losing collaborative opportunities for designing better solutions.

At this year’s UXDC conference, organized by the DC Chapter of the User Experience Professionals Association, technologists, development practitioners and design professionals gathered to discuss “Lessons and Challenges in Implementing User Centered Design and UX Practices in the International Development Sector.” Human-centered design is gaining more and more traction in the international development realm.

Is international development ready for human-centered design?

International development funding mechanisms are mainly donor-driven, and convincing donors to invest in user research, user experience design or iterative prototyping (all important tenets of human-centered design) can be a tough sell. Funding models in development are not ideally set up for iteration, a key component for human-centered design.

Development funds are often taxpayer-funded and somewhat restricted in how they can be spent. Investing in a process that may require more upfront work and iteration may be considered wasteful.

The Request for Proposal process is set up in a way that assumes we have enough information to make decisions when pitching a proposal, and the turnaround timelines are very short. This is challenging for proposing and implementing technology innovations since it leaves little room for a human-centered design process, which requires more upfront thought, needs assessment and iteration.

To add to the hurdles, there is a lot of the confusion in the vocabulary between the user experience community and the development and humanitarian communities. The two groups speak different languages. For example, humanitarian and development communities use terms like “access,” “capture,” “accountability,” “dignity” and “livelihoods.” User experience communities use terms such as “co-design,” “user-centered,” “journey mapping” and “need-finding.”

But there is room for finding common ground, if not a common language.

Some parts of the human-centered design process are already implemented (to some degree) in development but described and worded differently.

For example, the “capture” phase in development is when “need-finding” occurs in design. “Need-finding” is an essential part of human-centered design, where we understand the needs, limitations and concerns of the target users. The “capture” phase in development performs a similar task, but could be further enhanced by making the beneficiary more central to this research and make use of some of the design research method tools used in the human-centered design process.

Importance of participatory design

Despite the challenges, there are many opportunities to promote and improve on the use of human-centered design practices within development.

For example, participatory design attempts to actively involve all stakeholders (e.g. employees, partners, customers, citizens, and end users) in the design process to help ensure the result meets their needs and is usable. This directly relates to the first principle of the Principles for Digital Development: “Design with the user.” A participatory design process builds collaboration across various internal teams that often work in silos: programs, technology, grants, procurement, and more.

However, “design with the user” isn’t always as easy.

Within development and humanitarian spaces, often the user or the person most affected could be a part of an extremely vulnerable group or hard to reach, and working directly with extremely vulnerable groups is not always possible or advised. In addition, funding restrictions may inhibit participatory design. For example, some U.S. government agencies won’t allow more than 10 external persons to participate in projects due to funding restrictions.

Also, there are multiple groups of stakeholders in development: community members, community leaders, development staff, development staff managers, donors, and more. Balancing and prioritizing needs is a challenge.

Bridging the development-user experience divide

There is a lack of understanding across user experience professionals and international development communities—a fact confirmed by panelists and participants at the UXDC session. But it is not an unbridgeable gap. More conversations and discussions around this topic are needed and panels, like the one at UXDC, help bridge this gap.

While there are many obstacles to implementing the entire human-centered design process and methods in international development projects, there are positive trends—including the emergence of development labs and innovation hubs with an emphasis on capacity building for teams and human-centered design training. There also seems to be receptiveness and an eagerness among local innovators in developing countries to move forward with human-centered design.

For example, Creative Associates International worked in Honduras using a human-centered design approach, and performed research in partnership with the local field staff, to understand how the computer lab could better serve the wants and need of the community.

Creative trained the local Honduran staff and performed user interviews together, synthesized the results and brainstormed solutions together. Creative is now in the process of testing and iterating some of the proposed solutions. The team there is very positive and enthusiastic about this process, and they are leading it forward and are seeing how the work they are doing can make a big impact.

Caktus Group offers similar services to clients and recently conducted a discovery workshop in Turkey with a humanitarian NGO serving local refugee communities. The discovery workshop was conducted with roughly 15 client staff and utilized a series of brainstorming, mapping, outlining, crowdsourcing and voting activities to develop terms of success, user personas, journey maps and defined a minimum viable product.

Caktus has since been working with this client to co-design and iteratively build a custom web application that fits their unique needs.

Human-centered design principles are already being utilized in many development programs, but there are many more opportunities for improvement. Better metrics and further discussion and collaboration of human-centered design practice in development can help to address the challenges, demonstrate value and achieve lasting results in people’s lives.

*Gina Assaf is a Product and User Experience Manager with the Technology for Development team at Creative Associates International. Tania Lee is a SMS & Web Product Strategist at Caktus Group.

For more information about the UXDC community or to join the human-centered design working group, please contact. Gina Assaf, ginaa@creativedc.com; Tania Lee,tlee@caktusgroup.com; or Ayan Kishore, ayank@creativedc.com.*

Tim HopperKnitting

True story: I'm a closet knitter. I don't have much time for it these days, but it helped keep me sane in grad school. Here are some things I've made over the years.






Caktus GroupInitial Data in Django

I've struggled to find an ideal way to load initial data for Django projects. By “initial data,” I'm referring to the kind of data that you need on a new system for it to be functional, but could change later. These are largely lists of possible choices, such as time zones, countries, or crayon colors.

Here are my requirements:

  • Fairly simple to run on initial server deploy, initial development environment setup, and when starting a test run.
  • Does not risk overwriting changes that are made to records in the live database after they're initially created.
  • Not too hard to update from the current live data, so that future new deploys etc get the latest data
  • Copes well as models evolve, because they will
  • Well supported by Django
  • Not too great a performance impact on testing

Here are some of the approaches that I've tried.


Fixtures are how Django used to recommend loading initial data.


  • It's fairly easy to update the fixtures as the "initial" data evolves - e.g. you've added more options in your live server, and want to preserve them in the initial data, so just do another dumpdata.
  • Fixtures don't slow down test startup if they're not named "initial.XXXX" or you're using a recent version of Django, because they don't get loaded automatically.
  • Easy enough to load at the beginning of tests that need them by adding a fixtures attribute to the test case class.


  • fatal - If a fixture is loaded again, it overwrites any changed data in the database with the original values
  • Discouraged by current Django documentation
  • Hard to keep valid when models evolve. The right way would be every time a model changes, you update the fixtures from the current data, then create a fresh temporary database without applying the new migrations, load the current fixtures, apply the new migrations, and make a fresh dump of the initial data. But that’s a lot of work, and hard to remember to do every time models change.
  • Data is not automatically available during tests, and since our system won't run correctly without some of this data, you have to arrange to load or create it at test setup.
  • Not loaded automatically so:
  • When setting up new development environments, you must document it and it’s still easily overlooked, or else get a developer to run some script that includes it
  • For automated deploys, not safe to run on every deploy. Probably the only safe approach is to run manually after the first deploy.

Summary: rejected due to risk of data loss, inconvenience during development, and negative recommendation from Django documentation.

Fixture hack

I played around with a modified loaddata command that checked (using natural keys) if a record in the fixture was already in the database and did not overwrite any data if the record had previously been loaded.

This means it's safer to add to scripts and automatic deploys.


  • Fairly easy to update as "initial" data evolves - e.g. you've added more options in your live server, and want to preserve them in the initial data, so just do another dumpdata
  • Fixtures don't slow down test startup if they're not named "initial.XXXX" or you're using a recent version of Django, because they don't get loaded automatically
  • Easy enough to load at the beginning of tests that need them by adding a fixtures attribute to the test case class.
  • Can add to env setup scripts and automated deploys safely


  • Hard to keep valid when models evolve
  • Data is not automatically available during tests
  • Not loaded automatically, so when setting up new development environments, you must document it and it’s still easily overlooked, or else get a developer to run some script that includes it

Summary: rejected; it mitigates one problem with fixtures, but all the others remain.

Post-migrate signal

Something else I experimented with was running code to create the new records in a post-migrate signal, even though the docs warn against data modification in that signal.


  • Runs automatically each time migrations are run, so will automatically get run during most automated deploys
  • Runs automatically when tests are setting up the test database, so all tests have the data available - but is part of the initial database, so we don't have the overhead of loading initial data during every test's setUp.


  • fatal - Runs every time migrations are run, even reverse migrations - so it runs when tables are in the wrong state and breaks development when you might be migrating forward and back
  • If it fails, the whole migration fails, so you can't just ignore a failure even if you didn't care about creating the initial data that time
  • Slows down database creation when running tests, unless you use --keepdb

Summary: rejected; not a valid way to load initial data.

In a migration

Add a migration that creates the initial records.


  • This is what the Django documentation currently recommends
  • Runs automatically
  • The migration only runs when the database schema matches what it was when you wrote it, so it won't break as models evolve
  • You can write it to ignore records that already exist, so it won't overwrite later changes in the database


  • fatal in some cases - migrations don't use the actual model class, so models with custom behavior (like MPTTModel) won't get created correctly. You might be able to find workarounds for this on a case-by-case basis.
  • Slows down database creation when running tests, unless you use --keepdb
  • Harder than fixtures to update as the initial data evolves. Options:
    • Go back and edit the original migration - but then it won't run on existing databases and they won't get the new records
    • Add a new migration that adds the whole updated initial data set, then go back and comment out the code in the previous initial data migration since there's no point running it twice on new database setup
    • Add yet another migration for just the new data - probably the simplest in terms of updating the migrations, but it'll be harder to extract just the new data from the current database than to just extract the whole dataset again. Also, you don't preserve any edits that might have been made over time to older records.

Summary: best option so far. It has some drawbacks, but not as bad as the other options.


The best approach in most cases is probably to load initial data in a migration, as the Django documentation recommends. It's not perfect, but it avoids some of the fatal flaws of other approaches. And the new (in Django 1.8) --keepdb option helps ameliorate the slow test startup.

I'm still curious if there are other approaches that I haven't considered, though.

Tim HopperMy First Publication

When I worked at RTI International, I worked on an exploratory analysis of Twitter discussion of electronic cigarettes. A paper on our work was just published in the Journal of Internet Medical Research: Using Twitter Data to Gain Insights into E-cigarette Marketing and Locations of Use: An Infoveillance Study.1

Marketing and use of electronic cigarettes (e-cigarettes) and other electronic nicotine delivery devices have increased exponentially in recent years fueled, in part, by marketing and word-of-mouth communications via social media platforms, such as Twitter. ... We identified approximately 1.7 million tweets about e-cigarettes between 2008 and 2013, with the majority of these tweets being advertising (93.43%, 1,559,508/1,669,123). Tweets about e-cigarettes increased more than tenfold between 2009 and 2010, suggesting a rapid increase in the popularity of e-cigarettes and marketing efforts. The Twitter handles tweeting most frequently about e-cigarettes were a mixture of e-cigarette brands, affiliate marketers, and resellers of e-cigarette products. Of the 471 e-cigarette tweets mentioning a specific place, most mentioned e-cigarette use in class (39.1%, 184/471) followed by home/room/bed (12.5%, 59/471), school (12.1%, 57/471), in public (8.7%, 41/471), the bathroom (5.7%, 27/471), and at work (4.5%, 21/471).

  1. I have no idea what "Infoveillance" means. 

Caktus GroupThe Long, Hard Haul to Uncovering a Single, Tiny CSS Property Bug in IE10

There’s a very small but devastatingly crash-inducing bug in Internet Explorer 10. Watch out for setting a background-color to inherit on any pseudo element (like ::before and ::after), because this will crash IE completely every single time.

#element::after {
    background-color: inherit; /* CRASH! */

Rather than dedicate an entire post to a single CSS property, let’s trace the work that was required to track this down. The payoff is obviously important (a supported browser crashing is a big problem!) but it can feel frustrating when a big issue comes down to something so seemingly tiny. It is important to remember the real measure of the bug is not the number of lines involved in fixing it (one, in our case) but the amount of effort it took to track down. Don’t let the small fix (just give it a color directly) betray the hard work you put into understanding the issue.

First, how did we get here, anyway? A crashing browser is something you hope doesn’t go unnoticed for long.

It is pretty common to try to avoid running full regression tests across a site, both early on and throughout the project. When you have a dozen supported browsers on the low end, the most efficient workflow tends to be getting a lot of feature work and layout under your belt early, and then attacking each browser’s quirks after testing what you’ve built across them. If you try to test the entire supported set of features for every small changeset, especially on a pre-launch product, you’re going to be repeating a lot of work and sinking a lot of hours you could better spend getting more of the initial work wrapped up.

So a couple days of iteration had been done with new feature work and a few bug fixes, and then we found this crash that perplexed us. At that point, it could have been any one of a dozen commits across a number of components. We could have expected some tweaks would be needed to bring Safari and Internet Explorer in line, but the outright crash certainly took us by surprise. We began down a road to narrow the confirmation. We ruled out SauceLabs, the excellent browser testing service, by reproducing it locally on our own setup. IE 11 didn’t crash, and most of the site was just fine except this one page, so we knew it had to be a pretty specific issue.

We had enough to start tracking it down. Being able to reproduce it reliably and only on one browser meant we could really focus any tests we did, so we got started right away.

Having confirmed this is reproducible, how do we start to narrow this down when we aren’t sure when it happened and the crashing nature of the bug precludes us from looking into what’s going on?

While the problem happened on a single page, that wasn’t enough to go on to tell us where it was happening. We had not even discovered that it was a CSS issue at this point, so basically anything was suspect.

We turned to the ever useful Git Bisect command, which every developer should have in their toolbelt. Git Bisect, if you haven’t had an opportunity to learn about it yet, helps you locate a specific commit in your history that caused a bug. You start by telling it the last known commit where you can confirm the bug did not exist and the latest commit (usually the head of your branch or master) where the bug does exist. Git begins narrowing it down, basically stepping halfway between the two known good and bad commits and at each step asking you “What about this commit, is it good or bad?” At the end of the process, which should only take less than half a dozen steps in most cases, you’ll have a single commit where the problem first appears. This is great when you have a problem that went undetected for a while and you need a smaller target to aim for.

The diff in question is CSS only, so now we know the problem is in our CSS! This really helped to narrow it down, both to one aspect of our changes (neither markup nor Javascript seemed to be at fault) and in the number of changes to look closer at (it was not a very large diff). It was, however, much more surprising than a strange Javascript bug might have been. Who ever heard of CSS crashing a browser, after all? The first thought was a parser bug and we poured over the diff looking for errant curly braces or other oddities that might have gone unnoticed and caused this crash. Everything looked flawless, from a syntax perspective, so we were left with the conclusion that it was some actual style causing the failure, rather than a parser issue.

As both a sanity check and a confirmation of our findings so far, we cleared the entire CSS file where these changes were isolated to and pulled it back up in Internet Explorer 10. This confirmed our findings when a very ugly, but non-crash-inducing, version of the page was rendered. We now had clear confirmation that something in this simple CSS file was the culprit, and a quick look at the short commit we had narrowed down as the time when the problem was introduced gave us no good candidates.

There were a few color changes. A bit more padding on one element and a slimmer margin on another. Heck, several of the changes were just fixing indentation in a few places! Nothing stood out.

One by one we commented out the rulesets that had been changed in the diff until we found one that only caused the problem when it was present:

#cssmenu::after {
    position: absolute;
    top: 0;
    left: -50%;
    width: 150vw;
    height: 100%;
    content: "";
    background-color: inherit;
    z-index: -1;

This is a standard trick to create a wider background padding around a section that needed to expand beyond the content well. It is just a simple pseudo element with all the necessary properties. These pseudo elements require the position, top, left, width, height, and content properties. This leaves the final two properties, background-color and z-index, as likely causes.

Finding the answer was background-color was only a matter of commenting them each out one by one to test, and seeing this combination from the start of the article was our problem all along:

#cssmenu::after {
    background-color: inherit;

We had the problem and immediately knew the solution was easy. The color being inherited was just “white” and we had only used “inherit” to remove a minor redundancy. Explicitly stating the color here fixed the problem immediately.

But, it fixed the problem in seconds, while it had taken us nearly an hour to track down the bug between two of us. We had to know more about what had caused us so much headache and followed up with a few experiments.

We had a theory and a false-positive test that this affected ::after but not ::before elements, and quickly learned this affected both varieties of pseudo element.

We dug around as much as we could without sinking time, but could not find any mention of this problem in the usual places people complain about things (StackOverflow, MSDN forums).

If you’ve found this post because you googled the terms “css after background-color inherit crash” then I hope you can breathe a sigh of relief and move on with your day.

For the reader following this story for the journey and not the solution, remember this the next time you feel betrayed by a long debug session that reveals what seems like a small issue: you did not waste your time. The measure of a bug should not be in lines of code or the time to fix the problem once you understand it, but in the effort it requires to uncover, understand, and correct. If we let ourselves be deceived by what false measurements of how big or difficult a bug really is, focusing too much on only the very final and sometimes minor step of fixing it, we let ourselves devalue the very important and difficult work of uncovering complicated problems. And that is where real skill and experience lies, not in the few lines at most it might take to correct.

Joe GregorioHalloween Dragon

We do large displays for Halloween, usually involving converting the garage into a haunted house. They're major undertakings, with four or more actors, so we only do them every other year. I also try to roll a "build" into every Halloween. For example, 6 years ago I built this as part of the haunted house:

And two years ago the same skeleton got taken out of the jail and laid out on a table, this time animated with pneumatics. Neat tip on this one, those sprinkler valves for watering systes work just as well for controlling air.

This year Lynne requested that we do something less intensive. Now I agreed that the haunted houses were a lot of work, but I still wanted to do something big, or at least give the appearance of something big. That's when I hit on the idea of doing a dragon in the garage, but only opening the garage door a bit, so you could only see the nose and the tail of the dragon. Let the kid's imaginations fill in the rest of the beast.

The head was paper-mache, with a hot-glued cardboard frame, and two nostrils that could be hooked up to a fog machine. Here's photos of the build. And here's a shot of the fog test:

Next came the build for the articulated tail, with a wooden frame and bicycle brake cabling for the controls. Here are some photos from the build. And here's the tail in action, first w/o the scales, and then with the scales.

And finally putting it all together, with a "Beware of Dragon" sign, and a scattering of bones across the driveway, and the Dragon Sound Effects Panel I posted yesterday:

Now it doesn't look too scary during the day, but on Halloween night, only lit by a strobe light, with the candy bowl between the nose and the tail, and the proper "story" it was quite effective. Here's Lynne's description:

The kids were told the dragon was sleeping (cue - snoring noise) but could smell the kids when they got close (cue - sniffing noise) and "probably" couldn't get out of the garage, was "probably" full from all the kids he ate earlier, and "probably" wouldn't eat them if they were careful. They had to walk to candy bowl between head and tail. As they did the dragon awoke, started to move, and then huge banging and movement of garage door as he tried to get to them. They would run and scream and then bravely try again! It was precious.

I think we had over 100 people see the dragon, some coming back for multiple attempts. It was a blast to run and just as fun to build, as the three youngest kids all participated in the build. I can't wait for 2017, I'm thinking a giant pneumatic octopus...

Joe GregorioDragon Sound Effects Web Audio API

A sound effects board with the sounds of a dragon built using the Web Audio API. Because Halloween.

Chains: [Source]

Roar: [Source]

Sniff: [Source]

Snore: [Source]


Caktus GroupOpen Sourcing SmartElect: Libya's SMS Voter Registration System

We are proud to say that, with the Libyan High National Elections Commission (HNEC) and consultative support from the United Nations Support Mission to Libya, we have open sourced their elections management platform today under a permissive Apache 2.0 license. Open sourcing means other governments and organizations can freely adopt and adapt the elections tools which cover nine functional areas. The tools range from SMS voter registration, the first of its kind, to bulk alerts to voters and call center support software. You can learn more at our brand new SmartElect homepage. This is the cumulation of two years of work, so we’re incredibly excited to share SmartElect with the rest of the world.

With an established partner network, others interested in adopting or adapting this code can easily find experienced partners who've worked on this project before. The network consists of those of us that worked on the original code base and can provide commercial support.

How SmartElect got its start

Initially, the building of SmartElect was entirely driven by the Libyan High National Elections Commission’s need for a voter registration alternative. The interim Libyan government and HNEC faced extraordinary challenges in access to elections: challenging terrain, nomadic populations, and civil unrest. They turned to text message voter registration as a low-cost and more accessible alternative to reach more people safely.

Caktus joined the team at the start of HNEC’s efforts in Fall 2013. We then took over as a primary contractor after the February 2014 elections and open sourcing was immediately one of HNEC’s priorities for us. This thrilled us since we have long believed in the importance of open source to building affordable tools for others. We’ve been steadily working on open sourcing the project since then while simultaneously continuing to develop and strengthen the system for Libya.

SmartElect took shape as we discussed amongst ourselves the best way to grow the open source community around it.

What can SmartElect be used for

We hope to see SmartElect used for other organizations and governments. While the system used for Libya was highly customized to support Libya's electoral process, the SmartElect features have been generalized for open source use and must be adapted to each unique electoral process. Since each government is different, customization will be necessary. SmartElect includes the following adaptable components for governments to select from:

  • Voter registration and polling center management
  • SMS voter registration
  • Citizen call center support
  • Voter roll generation (a truly tricky task that involves literal tons of printed rolls)
  • Dashboards that show the public real-time registration data
  • Bulk messaging to registered voters
  • Voter registration verification tools
  • Text message reporting tools for poll workers
  • Quick vote tabulations to improve electoral integrity (built by Ona.io)

How we open sourced SmartElect

To open source Libya’s system, we examined two years of work—approximately 8,482 team hours since Fall 2013 and 83,095 lines of code. (We had some fun crunching the numbers and the code in length is equivalent to 1,179 sedans lined up end to end.) Parts of the system were built with open sourcing in mind, but there was much untangling to do between Libya’s system and what could be released securely and usefully. Our team of developers worked incredibly hard with our partners at HNEC to ensure the open source code met the highest security and usability standards. We also had a third party security firm audit both Libya’s system and SmartElect for potential technical vulnerabilities. We’re happy to say they gave us a clean bill of health.

The technology behind SmartElect

SmartElect is built using Django, the open source framework we specialize in and RapidSMS, a Django-based framework we helped UNICEF build especially for SMS tools. Django’s tagline, “the Web framework for perfectionists with deadlines,” is a true reflection of what this project needed. In the original development, there were elections deadlines looming and the weight of knowing we needed to build something quickly and well to support a new democracy.

Who was involved in building SmartElect

A system as complex and robust as SmartElect had many key players. Reboot, a consulting firm, first got the technical experts together so we could partner with HNEC, so we owe them a great thanks for connecting us before they left in the initial phase of the Libya project. As lead on the project, we subcontracted with Elliott Wilkes, the future principal and co-founder of Forsa Consulting, for in-country coordination and project management (an invaluable role). Together with our developers, Praekelt Foundation, Ona Data, and the MIS Department all provided critical components of Libya’s system. At HNEC’s behest, Caktus took the lead on open sourcing the code base.

What our hopes are for SmartElect

We’re looking forward to seeing new iterations of SmartElect take place as other developers build upon HNEC’s work. We’ve created a developer’s mailing list and a partner network to support both developers and organizations interested in adopting or adapting SmartElect. We’re excited to see what new versions of the software will take shape.

Read our official press release

Lead Developer Vinod Kurupmakes the repo public Lead Developer Vinod Kurup makes the SmartElect repo public.

SmartElect Team

Frank WierzbickiJython 2.7.1 beta2 released!

On behalf of the Jython development team, I'm pleased to announce that the second beta of Jython 2.7.1 is available! This is a bugfix release. Bug fixes include better unit testing under windows and the restoration of os.getpid() function.

Please see the NEWS file for detailed release notes. This release of Jython requires JDK 7 or above.

This release is being hosted at maven central. There are three main distributions. In order of popularity:
To see all of the files available including checksums, go to the maven query for Jython and navigate to the appropriate distribution and version.

Caktus GroupIdentifying Racial Bias in Policing with a Data-driven App

Recently, Caktus co-founder Colin Copeland spoke about the creation of a web app that analyzes North Carolina traffic stop data to identify racial bias during the Code for America 2015 Summit. The website allows both police departments and community members to visualize a dataset of more than 18 million stops statewide. Colin spoke with Ian Mance, the originator of the app idea and staff attorney with the Southern Coalition for Social Justice. Together with fellow community members, Andy Shapiro and Dylan Young, they used Django, an open source web framework, to make policing data more accessible.

Tim HopperTweets I'm Proud Of


On nonparametric Bayesian methods:

On code:

On key pressers:

Tim HopperWrapping Up on Nonparametric Bayes

Today is my last day at Qadium

I've been privileged to work with Eric Jonas on the data microscopes project for the past 8 months. In particular, I contributed the implementation of nonparametric latent Dirichlet allocation.

Here is an IPython notebook I prepared showing how that library can be used to analyze the topics discussed in Econtalk, my favorite economics podcast.

I also published a notebook deriving the collapsed Gibbs sampler for Bayesian mixture models, showing a toy example in Python, and showing how the model can be extended to the nonparametric case (with a Dirichlet process prior).

Next week, I am joining the data science team at Distil Networks.

Caktus GroupA Peek Behind the Scenes at Caktus

Over the years, Caktus has had the opportunity to employ a wonderfully talented staff from a wide range of backgrounds. This summer we decided we would conduct a survey of Caktus staff. This infographic is the result of that study, and allows a little peek into the Cakti way of life. Conclusions: cats are great and coffee keeps you sharp!

Caktus GroupShipIt Day ReCap: Q4 2015

Members of one team joined forces with local meetup Code for Durham to help with the alpha launch of the School Navigator App. Using publicly available data, the School Navigator, allows users to geolocate nearby Durham schools and view information like performance ratings. The team included Code for Durham co-founder Colin Copeland who upgraded the Django template for the latest version of SALT. Erin Mullaney helped expand a feature denoting different school zones on the map, using Angular for the first time to do so. She also worked on a query change to more closely match the rules of districting on the map’s display. [Victor Rocha] developed various bug fixes, and merged pull requests. In the meantime, David Ray put his new Ionic skills to the test by building a mobile app version of the School Navigator, now available from the Google App store. (David’s starting Ship It Day project was working through an Ionic tutorial to create a Reddit viewing app with pull refresh and infinite scrolling.)

Mark Lavin, Rebecca Conley, Dmitriy Chukhin, and Ben Phillips built a Taylor Swift song generator that uses her lyrics to create new songs. Not only can you create new songs out of beloved Swift lyrics, but you can save your favorites to reference later. The app was built using a library called Markovify to create a Markov chain model. This model analyzes a body of text to determine probabilities of word occurrence. This then allows the app to generate next words based on these probabilities. The team even designed a cheeky 404 error page. Readers, I now encourage you to break the internet.

Rob Lineberger also had music on the brain. He developed the A-Ching app (based on the Ancient Chinese I-Ching) to disperse advice based on ABBA lyrics. To achieve this, Rob wrote a management command that used Beautiful Soup to parse an open source lyrics site. He then sorted the data by verses and programmed the app to produce a random call-up. Users can ask a question and the app will provide them with the answer in the form of the inimitable wisdom of the 1970s Swedish pop group. Having just finished coaching the Django Girls RDU workshop, Rob followed the Django Girls Tutorial to deploy his site, in order to further familiarize himself with the tutorial and to better coach for our next Django Girls RDU workshop.

Dan Poirier, Jeff Bradberry, and Scott Morningstar fleshed out Tequila, our Ansible alternative to SALT. Dan worked on getting the Caktus Group Django project template to deploy through Tequila in place of SALT. Now our developers can choose to deploy on Salt or Ansible. Though this transition is still a work in progress and in its testing phases, the team made a good step forward.

Rebecca Muraya spent her day importing her expense tracking data from Google Sheets to Django in an attempt to allow more complex analysis across multiple years. She hit a lot of bumps along the road but made significant headway in transferring the data.

Karen Tracey looked into using Django’s new HStoreField for a client project involving surveys. In these surveys information pulled from the roster for the individual survey-taker can be used to personalize the survey, and over time various fixed roster fields have been added to access different information about the survey-taker. Each of these additions has required code changes to support these new roster fields, many of which are used only very rarely. Though she didn’t get beyond messing with roster import code, she was heartened to find that the experiment worked, and could enable our client to add in extra roster fields at their own discretion.

Tobias McNulty spent the day testing a base operating system and Postgres database upgrade for a client project. He found the base image could be switched from Ubuntu 12.04 to 14.04 with few other changes. The Postgres upgrade was more difficult to automate because cluster file formats change between major releases. Rather than upgrade the cluster directly, he found it easier to use the project’s existing feature to backup the database from Postgres 9.1 and restore it to a new server running Postgres 9.4. When load testing the performance before and after the Postgres upgrade, he found little to no difference for this app.

Our Project Management team, Ben Riesling, Sarah Gray, Daryl Riethof, and [Katie Twilley] reevaluated their weekly reporting methods and developed a series of questions to help them create a new format for these reports.

Though Hunter MacDermut and Calvin Spealman were both out of the office for the day, they couldn’t help but participate in ShipIt Day from afar. Hunter explored the vast world of geolocation. Using a browser-based HTML geolocation API, he built a simple compass. Usable on any device with built-in GPS, you can open it on your smartphone to see the compass in action. The source code can be found on Github. Calvin attempted prototyping a CSS framework based on Flexbox rules with mixed results.

Nicole Foster and Liza Chabot were able to break away from their duties for a few hours to get some reading done. Nicole read a few articles from The Harvard Business Review’s 10 Must-Reads on Managing People. Liza read ”Hello World!”: The History of Programming and started on Charles Petzold’s Code: The Hidden Language of Computer Hardware and Software.

Caktus GroupDjango Girls Workshop Recap

This past Saturday we had the pleasure of holding a Django Girls RDU workshop at Caktus in downtown Durham, NC. We hosted 22 students and 10 coaches for a free, all-day coding workshop. Aimed at increasing diversity in tech and encouraging women to gain the skills they need to fall in love with coding, Django Girls is a global nonprofit that provides the framework for these workshops. In regular Django Girls style, the day was part party, part coding marathon and every student left having deployed their very own website!

We were incredibly lucky to have great partners from the local Django and Python community supporting our work that day, including PyLadies RDU, Durham Women in Tech, Code for Durham, and Code for Raleigh. Coaches donated their time and their weekends and attendees came from all walks of life (even as far as Atlanta, Georgia!) to spend their Saturdays teaching and learning to code.

Working in groups of 3 with one coach per group, attendees worked their way through a tutorial covering the development of a blog application and deploying it to Python Anywhere. “It was such a warm and open environment,” remarked Viola, a workshop attendee. “It was very invigorating being part of it.” Viola, who hadn’t made a web page since 1999, was one of many on a vast spectrum of coding experience to attend the workshop.

“It was... immensely satisfying to finish the day with a completed, attractive website that even had some rudimentary security. I loved feeling as if I had worked on a puzzle all day and managed to solve it,” wrote another attendee.

We are thrilled to have had the opportunity to support this movement and can’t wait to help coordinate another workshop in the future. Keep a sharp eye for more events from Django Girls RDU!

Og MacielBooks - September 2015

Jeff Trawickcmake Windows nghttp2

This is just a web-search-findable note that I have an initial implementation of nghttp2 cmake-based build support for Windows (just the core library), at https://github.com/trawick/nghttp2-minimal-cmake. See the note on future plans for issues I hope to resolve soon before I submit it to the nghttp2 project.

Caktus GroupColin Copeland to Speak on Police Data and Racial Bias at Code for America Summit

This Thursday, Colin Copeland, CTO and Caktus Group Co-founder, will be co-presenting “Case Study from North Carolina: Identifying Racial Bias in Policing Practices” during the prestigious 2015 Code for America Summit in Oakland, CA. This invite-only event joins technologists, activists, and officials ranging from mayors to White House officials to discuss technology’s role in civic participation.

Colin, as volunteer co-founder and co-captain of Code for Durham, will be discussing how he and fellow developers transformed over 20 million data points and twelve years of NC police data into a website that can help identify race-based policing practices. His co-presenter will be civil rights lawyer Ian Mance of Southern Coalition for Social Justice, who originated the idea and guided the site’s development. Last year, Ian's work with a local coalition of Durham activists received front-page attention in the New York Times for using a data-driven approach to combat racial profiling.

For more about Code for America’s role in the open data and civic technology movement, visit codeforamerica.org.

Tim HopperA Joke

Caktus GroupIntroduction to Monte Carlo Tree Search

For DjangoCon 2015, Jeff Bradberry created an A.I. for our booth game, Ultimate Tic Tac Toe. Reprinted here from jeffbradberry.com is his explanation of the Monte Carlo Tree Search used to build the A.I.

The subject of game AI generally begins with so-called perfect information games. These are turn-based games where the players have no information hidden from each other and there is no element of chance in the game mechanics (such as by rolling dice or drawing cards from a shuffled deck). Tic Tac Toe, Connect 4, Checkers, Reversi, Chess, and Go are all games of this type. Because everything in this type of game is fully determined, a tree can, in theory, be constructed that contains all possible outcomes, and a value assigned corresponding to a win or a loss for one of the players. Finding the best possible play, then, is a matter of doing a search on the tree, with the method of choice at each level alternating between picking the maximum value and picking the minimum value, matching the different players' conflicting goals, as the search proceeds down the tree. This algorithm is called Minimax.

The problem with Minimax, though, is that it can take an impractical amount of time to do a full search of the game tree. This is particularly true for games with a high branching factor, or high average number of available moves per turn. This is because the basic version of Minimax needs to search all of the nodes in the tree to find the optimal solution, and the number of nodes in the tree that must be checked grows exponentially with the branching factor. There are methods of mitigating this problem, such as searching only to a limited number of moves ahead (or ply) and then using an evaluation function to estimate the value of the position, or by pruning branches to be searched if they are unlikely to be worthwhile. Many of these techniques, though, require encoding domain knowledge about the game, which may be difficult to gather or formulate. And while such methods have produced Chess programs capable of defeating grandmasters, similar success in Go has been elusive, particularly for programs playing on the full 19x19 board.

However, there is a game AI technique that does do well for games with a high branching factor and has come to dominate the field of Go playing programs. It is easy to create a basic implementation of this algorithm that will give good results for games with a smaller branching factor, and relatively simple adaptations can build on it and improve it for games like Chess or Go. It can be configured to stop after any desired amount of time, with longer times resulting in stronger game play. Since it doesn't necessarily require game-specific knowledge, it can be used for general game playing. It may even be adaptable to games that incorporate randomness in the rules. This technique is called Monte Carlo Tree Search. In this article I will describe how Monte Carlo Tree Search (MCTS) works, specifically a variant called Upper Confidence bound applied to Trees (UCT) , and then will show you how to build a basic implementation in Python.

Imagine, if you will, that you are faced with a row of slot machines, each with different payout probabilities and amounts. As a rational person (if you are going to play them at all), you would prefer to use a strategy that will allow you to maximize your net gain. But how can you do that? For whatever reason, there is no one nearby, so you can't watch someone else play for a while to gain information about which is the best machine. Clearly, your strategy is going to have to balance playing all of the machines to gather that information yourself, with concentrating your plays on the observed best machine. One strategy, called Upper Confidence Bound 1 (UCB1), does this by constructing statistical confidence intervals for each machine


  • xi: the mean payout for machine i
  • ni: the number of plays of machine i
  • n: the total number of plays

Then, your strategy is to pick the machine with the highest upper bound each time. As you do so, the observed mean value for that machine will shift and its confidence interval will become narrower, but all of the other machines' intervals will widen. Eventually, one of the other machines will have an upper bound that exceeds that of your current one, and you will switch to that one. This strategy has the property that your regret, the difference between what you would have won by playing solely on the actual best slot machine and your expected winnings under the strategy that you do use, grows only as O(lnn). This is the same big-O growth rate as the theoretical best for this problem (referred to as the multi-armed bandit problem), and has the additional benefit of being easy to calculate.

And here's how Monte Carlo comes in. In a standard Monte Carlo process, a large number of random simulations are run, in this case, from the board position that you want to find the best move for. Statistics are kept for each possible move from this starting state, and then the move with the best overall results is returned. The downside to this method, though, is that for any given turn in the simulation, there may be many possible moves, but only one or two that are good. If a random move is chosen each turn, it becomes extremely unlikely that the simulation will hit upon the best path forward. So, UCT has been proposed as an enhancement. The idea is this: any given board position can be considered a multi-armed bandit problem, if statistics are available for all of the positions that are only one move away. So instead of doing many purely random simulations, UCT works by doing many multi-phase playouts.

The first phase, selection, lasts while you have the statistics necessary to treat each position you reach as a multi-armed bandit problem. The move to use, then, would be chosen by the UCB1 algorithm instead of randomly, and applied to obtain the next position to be considered. Selection would then proceed until you reach a position where not all of the child positions have statistics recorded.

Selection: Here the positions and moves selected by the UCB1 algorithm. Note that a number of playouts have already been run to accumulate the statistics shown. Each circle contains the number of wins / number of times played.
Selection. Here the positions and moves selected by the UCB1 algorithm at each step are marked in bold. Note that a number of playouts have already been run to accumulate the statistics shown. Each circle contains the number of wins / number of times played.

The second phase, expansion, occurs when you can no longer apply UCB1. An unvisited child position is randomly chosen, and a new record node is added to the tree of statistics.

Expansion: The position marked 1/1 at the bottom of the tree has no further statistics records under it, so we choose a random move and add a new record for it (bold), initialized to 0/0.
Expansion. The position marked 1/1 at the bottom of the tree has no further statistics records under it, so we choose a random move and add a new record for it (bold), initialized to 0/0.

After expansion occurs, the remainder of the playout is in phase 3, simulation. This is done as a typical Monte Carlo simulation, either purely random or with some simple weighting heuristics if a light playout is desired, or by using some computationally expensive heuristics and evaluations for a heavy playout. For games with a lower branching factor, a light playout can give good results.

Simulation: Once the new record is added, the Monte Carlo simulation begins, here depicted with a dashed arrow. Moves in the simulation may be completely random, or may use calculations to weight the randomness in favor of moves that may be better.
Simulation. Once the new record is added, the Monte Carlo simulation begins, here depicted with a dashed arrow. Moves in the simulation may be completely random, or may use calculations to weight the randomness in favor of moves that may be better.

Finally, the fourth phase is the update or back-propagation phase. This occurs when the playout reaches the end of the game. All of the positions visited during this playout have their play count incremented, and if the player for that position won the playout, the win count is also incremented.

Back-Propagation: After the simulation reaches an end, all of the records in the path taken are updated. Each has its play count incremented by one, and each that matches the winner has its win count incremented by one, here shown by the bolded numbers.
Back-Propagation. After the simulation reaches an end, all of the records in the path taken are updated. Each has its play count incremented by one, and each that matches the winner has its win count incremented by one, here shown by the bolded numbers.

This algorithm may be configured to stop after any desired length of time, or on some other condition. As more and more playouts are run, the tree of statistics grows in memory and the move that will finally be chosen will converge towards the actual optimal play, though that may take a very long time, depending on the game.

For more details about the mathematics of UCB1 and UCT, see Finite-time Analysis of the Multiarmed Bandit Problem and Bandit based Monte-Carlo Planning.

Now let's see some code. To separate concerns, we're going to need a Board class, whose purpose is to encapsulate the rules of a game and which will care nothing about the AI, and a MonteCarlo class, which will only care about the AI algorithm and will query into the Board object in order to obtain information about the game. Let's assume a Board class supporting this interface:

class Board(object):
    def start(self):
        # Returns a representation of the starting state of the game.

    def current_player(self, state):
        # Takes the game state and returns the current player's
        # number.

    def next_state(self, state, play):
        # Takes the game state, and the move to be applied.
        # Returns the new game state.

    def legal_plays(self, state_history):
        # Takes a sequence of game states representing the full
        # game history, and returns the full list of moves that
        # are legal plays for the current player.

    def winner(self, state_history):
        # Takes a sequence of game states representing the full
        # game history.  If the game is now won, return the player
        # number.  If the game is still ongoing, return zero.  If
        # the game is tied, return a different distinct value, e.g. -1.

For the purposes of this article I'm not going to flesh this part out any further, but for example code you can find one of my implementations on github. However, it is important to note that we will require that the state data structure is hashable and equivalent states hash to the same value. I personally use flat tuples as my state data structures.

The AI class we will be constructing will support this interface:

class MonteCarlo(object):
    def __init__(self, board, **kwargs):
        # Takes an instance of a Board and optionally some keyword
        # arguments.  Initializes the list of game states and the
        # statistics tables.

    def update(self, state):
        # Takes a game state, and appends it to the history.

    def get_play(self):
        # Causes the AI to calculate the best move from the
        # current game state and return it.

    def run_simulation(self):
        # Plays out a "random" game from the current position,
        # then updates the statistics tables with the result.

Let's begin with the initialization and bookkeeping. The board object is what the AI will be using to obtain information about where the game is going and what the AI is allowed to do, so we need to store it. Additionally, we need to keep track of the state data as we get it.

class MonteCarlo(object):
    def __init__(self, board, **kwargs):
        self.board = board
        self.states = []

    def update(self, state):

The UCT algorithm relies on playing out multiple games from the current state, so let's add that next.

import datetime

class MonteCarlo(object):
    def __init__(self, board, **kwargs):
        # ...
        seconds = kwargs.get('time', 30)
        self.calculation_time = datetime.timedelta(seconds=seconds)

    # ...

    def get_play(self):
        begin = datetime.datetime.utcnow()
        while datetime.datetime.utcnow() - begin < self.calculation_time:

Here we've defined a configuration option for the amount of time to spend on a calculation, and get_play will repeatedly call run_simulation until that amount of time has passed. This code won't do anything particularly useful yet, because we still haven't defined run_simulation, so let's do that now.

# ...
from random import choice

class MonteCarlo(object):
    def __init__(self, board, **kwargs):
        # ...
        self.max_moves = kwargs.get('max_moves', 100)

    # ...

    def run_simulation(self):
        states_copy = self.states[:]
        state = states_copy[-1]

        for t in xrange(self.max_moves):
            legal = self.board.legal_plays(states_copy)

            play = choice(legal)
            state = self.board.next_state(state, play)

            winner = self.board.winner(states_copy)
            if winner:

This adds the beginnings of the run_simulation method, which either selects a move using UCB1 or chooses a random move from the set of legal moves each turn until the end of the game. We have also introduced a configuration option for limiting the number of moves forward that the AI will play.

You may notice at this point that we are making a copy of self.states and adding new states to it, instead of adding directly to self.states. This is because self.states is the authoritative record of what has happened so far in the game, and we don't want to mess it up with these speculative moves from the simulations.

Now we need to start keeping statistics on the game states that the AI hits during each run of run_simulation. The AI should pick the first unknown game state it reaches to add to the tables.

class MonteCarlo(object):
    def __init__(self, board, **kwargs):
        # ...
        self.wins = {}
        self.plays = {}

    # ...

    def run_simulation(self):
        visited_states = set()
        states_copy = self.states[:]
        state = states_copy[-1]
        player = self.board.current_player(state)

        expand = True
        for t in xrange(self.max_moves):
            legal = self.board.legal_plays(states_copy)

            play = choice(legal)
            state = self.board.next_state(state, play)

            # `player` here and below refers to the player
            # who moved into that particular state.
            if expand and (player, state) not in self.plays:
                expand = False
                self.plays[(player, state)] = 0
                self.wins[(player, state)] = 0

            visited_states.add((player, state))

            player = self.board.current_player(state)
            winner = self.board.winner(states_copy)
            if winner:

        for player, state in visited_states:
            if (player, state) not in self.plays:
            self.plays[(player, state)] += 1
            if player == winner:
                self.wins[(player, state)] += 1

Here we've added two dictionaries to the AI, wins and plays, which will contain the counts for every game state that is being tracked. The run_simulation method now checks to see if the current state is the first new one it has encountered this call, and, if not, adds the state to both plays and wins, setting both values to zero. This method also adds every game state that it goes through to a set, and at the end updates plays and wins with those states in the set that are in the plays and wins dicts. We are now ready to base the AI's final decision on these statistics.

from __future__ import division
# ...

class MonteCarlo(object):
    # ...

    def get_play(self):
        self.max_depth = 0
        state = self.states[-1]
        player = self.board.current_player(state)
        legal = self.board.legal_plays(self.states[:])

        # Bail out early if there is no real choice to be made.
        if not legal:
        if len(legal) == 1:
            return legal[0]

        games = 0
        begin = datetime.datetime.utcnow()
        while datetime.datetime.utcnow() - begin < self.calculation_time:
            games += 1

        moves_states = [(p, self.board.next_state(state, p)) for p in legal]

        # Display the number of calls of `run_simulation` and the
        # time elapsed.
        print games, datetime.datetime.utcnow() - begin

        # Pick the move with the highest percentage of wins.
        percent_wins, move = max(
            (self.wins.get((player, S), 0) /
             self.plays.get((player, S), 1),
            for p, S in moves_states

        # Display the stats for each possible play.
        for x in sorted(
            ((100 * self.wins.get((player, S), 0) /
              self.plays.get((player, S), 1),
              self.wins.get((player, S), 0),
              self.plays.get((player, S), 0), p)
             for p, S in moves_states),
            print "{3}: {0:.2f}% ({1} / {2})".format(*x)

        print "Maximum depth searched:", self.max_depth

        return move

We have added three things in this step. First, we allow get_play to return early if there are no choices or only one choice to make. Next, we've added output of some debugging information, including the statistics for the possible moves this turn and an attribute that will keep track of the maximum depth searched in the selection phase of the playouts. Finally, we've added code that picks out the move with the highest win percentage out of the possible moves, and returns it.

But we are not quite finished yet. Currently, our AI is using pure randomness for its playouts. We need to implement UCB1 for positions where the legal plays are all in the stats tables, so the next trial play is based on that information.

# ...
from math import log, sqrt

class MonteCarlo(object):
    def __init__(self, board, **kwargs):
        # ...
        self.C = kwargs.get('C', 1.4)

    # ...

    def run_simulation(self):
        # A bit of an optimization here, so we have a local
        # variable lookup instead of an attribute access each loop.
        plays, wins = self.plays, self.wins

        visited_states = set()
        states_copy = self.states[:]
        state = states_copy[-1]
        player = self.board.current_player(state)

        expand = True
        for t in xrange(1, self.max_moves + 1):
            legal = self.board.legal_plays(states_copy)
            moves_states = [(p, self.board.next_state(state, p)) for p in legal]

            if all(plays.get((player, S)) for p, S in moves_states):
                # If we have stats on all of the legal moves here, use them.
                log_total = log(
                    sum(plays[(player, S)] for p, S in moves_states))
                value, move, state = max(
                    ((wins[(player, S)] / plays[(player, S)]) +
                     self.C * sqrt(log_total / plays[(player, S)]), p, S)
                    for p, S in moves_states
                # Otherwise, just make an arbitrary decision.
                move, state = choice(moves_states)


            # `player` here and below refers to the player
            # who moved into that particular state.
            if expand and (player, state) not in plays:
                expand = False
                plays[(player, state)] = 0
                wins[(player, state)] = 0
                if t > self.max_depth:
                    self.max_depth = t

            visited_states.add((player, state))

            player = self.board.current_player(state)
            winner = self.board.winner(states_copy)
            if winner:

        for player, state in visited_states:
            if (player, state) not in plays:
            plays[(player, state)] += 1
            if player == winner:
                wins[(player, state)] += 1

The main addition here is the check to see if all of the results of the legal moves are in the plays dictionary. If they aren't available, it defaults to the original random choice. But if the statistics are all available, the move with the highest value according to the confidence interval formula is chosen. This formula adds together two parts. The first part is just the win ratio, but the second part is a term that grows slowly as a particular move remains neglected. Eventually, if a node with a poor win rate is neglected long enough, it will begin to be chosen again. This term can be tweaked using the configuration parameter C added to __init__ above. Larger values of C will encourage more exploration of the possibilities, and smaller values will cause the AI to prefer concentrating on known good moves. Also note that the self.max_depth attribute from the previous code block is now updated when a new node is added and its depth exceeds the previous self.max_depth.

So there we have it. If there are no mistakes, you should now have an AI that will make reasonable decisions for a variety of board games. I've left a suitable implementation of Board as an exercise for the reader, but one thing I've left out here is a way of actually allowing a user to play against the AI. A toy framework for this can be found at jbradberry/boardgame-socketserver and jbradberry/boardgame-socketplayer.

This version that we've just built uses light playouts. Next time, we'll explore improving our AI by using heavy playouts, by training some evaluation functions using machine learning techniques and hooking in the results.

UPDATE: The diagrams have been corrected to more accurately reflect the possible node values.

Caktus GroupPyCon 2016: Behind the Design

Having helped to design an award-winning event site for last year’s PyCon in Montreal, we are thrilled to collaborate again with the Python Software Foundation (PSF) on this year’s site for PyCon 2016.

PyCon 2016 will be held in Portland, OR and PSF knew they wanted to convey the distinctive mood and look of that city with the 2016 website. Working collaboritively with PSF, Designer Trevor Ray drew on everything from the unique architecture of the city’s craftsman-style bungalow houses and the surrounding mountainous landscape to the cool color scheme of the Pacific-Northwest region. The team developed a site that leads the user on a journey. As he or she scrolls, the user is brought further into the city, from the low, rolling, misty, forest-topped mountains on the outskirts, into the very heart of its neighborhoods.

Trevor also redesigned the PyCon logo for 2016, giving it a peaked shape, hand-lettered typography (a reference to the thriving craft community of Portland), and a classic, balanced layout. Ultimately, our team and PSF worked together to achieve a site that we hope is welcoming and functional.

We’re excited for this year’s PyCon, and we hope the site excites attendees as well. Only 249 days till PyCon!

Dave O'ConnorMySQL SQLAlchemy OS X Reference

Whenever I learn something new, I enjoy putting together a reference sheet of whatever it is I am learning. In regards to Python, this process typically involves working in the Python interpreter and diligently recording commands that worked as intended. Today I want to share the reference sheet I put together while exploring MySQL and [SQLAlchemy](http://www.sqlalchemy.org/) on OS

Tim HopperData Science and Agriculture

I'm excited to see tools developed for the web being applied to offline domains like agriculture and health. I posed a question on Twitter yesterday:

I got a number of replies. Companies in this space (not all hiring for DS) include:

  • The Climate Corporation: Known for their high profile acquisition by Monsanto, Climate Corp "combines hyper-local weather monitoring, agronomic modeling, and high-resolution weather simulations" "that help farmers improve profitability by making better informed operating and financing decisions". They are based in San Fransisco.
  • FarmLogs: A YCombinator-backed startup in Ann Arbor, MI, FarmLogs is combining satellite maps, soil surveys, GPS data, and more to "expose critical operational data insights" to row crop farmers.
  • Dairymaster: In Ireland, Darymaster is hiring data scientists for their business of dairy equipment manufacturing.
  • aWhere: aWhere "delivers agricultural intelligence into the hands of farmers, commercial growers, and policymakers everywhere" by collecting detailed weather data from around the world. They are based in Denver.
  • pulsepod: This small startup is building hardware to help farmers "know when to plant, fertilize, irrigate, and harvest to achieve quality and yield goals using data from your own field". They are based in Princeton, NJ.
  • Granular: "Granular, a new kind of farm management software and analytics platform, helps you improve efficiency, profit and yield so you are ready to farm more acres." They are based in San Fransisco.
  • PrecisionHawk: Based in my home of Raleigh, NC, PrecisionHawk is "an information delivery company that combines unmanned aerial systems, cutting-edge artificial intelligence and remote sensing technologies to improve business operations and day-to-day decision making."
  • mavrx: "We use aerial imagery to provide actionable insights to the global agriculture industry." They are based in San Fransisco.
  • AgSmarts: Memphis-based Agsmarts "is a Precision Agriculture company that automates existing agricultural irrigation systems with our universal retrofit solution to optimize crop yield, save water and AgSmarts minimize input costs via mesh networks of IP-enabled controllers & environmental sensors."

Tim HopperNotes on Gibbs Sampling in Hierarchical Dirichlet Process Models, Part 2

Update (2015-09-21): I moved this post to a PDF on Github.

Tim HopperNotes on Gibbs Sampling in Hierarchal Dirichlet Process Models, Part 2

In an earlier post, I derived the necessary equations to sample table assignments when applying the "Posterior sampling in the Chinese restaurant franchise" algorithm to the case of latent Dirichlet allocation. I complete the derivation of necessary equations here.

We need to sample $k_{jt}$ (the dish/topic for table $t$ in restaurant $j$):

\[\displaystyle p(k_{jt}=k \,|\, {\bf t}, {\bf k}^{-jt}) \propto \begin{cases} m_{\cdot k}^{-jt}\cdot f_k^{-{\bf x}_{jt}}({\bf x}_{jt}) & {\tiny \text{if } k \text{ previously used,}}\\ \gamma\cdot f_{k^\text{new}}^{-{\bf x}_{jt}}({\bf x}_{jt}) & {\tiny \text{if } t=k^{\text{new}}}. \end{cases} \]

where $f_k^{-{\bf x}_{jt}}({\bf x}_{jt})$ is the "conditional density of ${\bf x}_{jt}$ given all data items associated with mixture component $k$ leaving out ${\bf x}_{jt}$" (Teh, et al). (${\bf x}_{jt}$ is every customer in restaurant $j$ seated at table $t$). $m_{\cdot k}^{-jt}$ is the number of tables (in all franchises) serving dish $k$ when we remove table $jt$.

This requires $f_k^{-{\bf x}_{jt}}({\bf x}_{jt})$; this is different from Equation (30), though they look quite similar.

\[ \begin{align} f_k^{-{\bf x}_{jt}}({\bf x}_{jt}) &=\frac{\displaystyle \int {\prod_{x_{ji}\in {\bf x}_{jt}}} f(x_{ji} \,|\, \phi_k) \left[ \prod_{x_{i'j'}\not\in {\bf x}_{jt}, z_{i'j'}=k} f(x_{j'i'} \,|\, \phi_k) \right] h(\phi_k) d\phi_k } {\displaystyle \int \left[ \displaystyle\prod_{x_{i'j'}\not\in {\bf x}_{jt}, z_{i'j'}=k} f(x_{i'j'}|\phi_k) \right] h(\phi_k)d\phi_k } \\ &=\frac{\displaystyle \int {\prod_{x_{ji}\in {\bf x}_{jt}}} \phi_{k x_{ji}} \left[ \prod_{x_{i'j'}\not\in {\bf x}_{jt}, z_{i'j'}=k} \phi_{k x_{i'j'}} \right] \prod_{w} \phi_{kw}^{\beta-1} d\phi_k } {\displaystyle \int \left[ \displaystyle\prod_{x_{i'j'}\not\in {\bf x}_{jt}, z_{i'j'}=k} f(x_{i'j'}|\phi_k) \right] \prod_{w} \phi_{kw}^{\beta-1} d\phi_k } \\ &=\frac{\displaystyle \int {\prod_{x_{ji}\in {\bf x}_{jt}}} \phi_{k x_{ji}} \left[ \prod_{x_{i'j'}\not\in {\bf x}_{jt}, z_{i'j'}=k} \phi_{k x_{i'j'}} \right] \prod_{w} \phi_{kw}^{\beta-1} d\phi_k } {\displaystyle \int \left[ \displaystyle\prod_{x_{i'j'}\not\in {\bf x_{jt}}, z_{i'j'}=k} \phi_{k x_{i'j'}} \right] \prod_{w} \phi_{kw}^{\beta-1} d\phi_k } \end{align} \]

The denominator is

\[ \begin{align} \text{denominator}&= \int\left[ \prod_{x_{i'j'}\not\in {\bf x_{jt}}, z_{i'j'}=k} \phi_{k x_{i'j'}} \right] \prod_{w} \phi_{kw}^{\beta-1} d\phi_k \\ &=\int\left[ \prod_w \phi_{kw}^{n_{kw}^{-jt}} \prod_w \phi_{kw}^{\beta-1} \right] d\phi_k \\ &=\int\left[ \prod_w \phi_{kw}^{n_{kw}^{-jt}+\beta-1} \right] d\phi_k \\ &=\frac{ \prod_w \Gamma\left( n_{kw}^{-jt} + \beta \right) }{ \Gamma\left( \sum_w n_{kw}^{-jt}+\beta \right) } \\ &=\frac{ \prod_w \Gamma\left( n_{kw}^{-jt} + \beta \right) }{ \Gamma\left( n_{k\cdot}^{-jt}+V\beta \right) } \end{align} \]

The numerator is

\[ \begin{align} \text{numerator} &=\int {\prod_{x_{ji}\in {\bf x}_{jt}}} \phi_{k x_{ji}} \left[ \prod_{x_{i'j'}\not\in {\bf x}_{jt}, z_{i'j'}=k} \phi_{k x_{i'j'}} \right] \prod_{w} \phi_{kw}^{\beta-1} d\phi_k \\ &=\int \prod_{w} \phi_{kw}^{ n_{kw}^{-jt} + n_{\cdot w}^{jt} + \beta + 1 } d\phi_k \\ &=\frac{ \prod_w \Gamma\left( n_{kw}^{-jt} + n_{\cdot w}^{jt} + \beta \right) }{ \Gamma \left( \sum_w n_{kw}^{-jt} + n_{\cdot w}^{jt} + \beta \right) }\\ &=\frac{ \prod_w \Gamma\left( n_{kw}^{-jt} + n_{\cdot w}^{jt} + \beta \right) }{ \Gamma \left( n_{k\cdot}^{-jt} + n_{\cdot \cdot}^{jt} + \beta \right) } \end{align} \]

This gives us a closed form version of this conditional distribution:

\[ \begin{align} f_k^{-{\bf x_{jt}}}({\bf x}_{jt}) &= \displaystyle\frac{ \prod_w \Gamma\left( n_{kw}^{-jt} + n_{\cdot w}^{jt} + \beta \right) }{ \prod_w \Gamma\left( n_{kw}^{-jt} + \beta \right) } \frac{ \Gamma\left( n_{k\cdot}^{-jt}+V\beta \right) }{ \Gamma \left( n_{k\cdot}^{-jt} + n_{\cdot \cdot}^{jt} + \beta \right) } \end{align} \]

We also need the conditional distribution of $k$ is a new dish: $f_{k^\text{new}}^{-{\bf x}_{jt}}({\bf x}_{jt})$. Shuyo provides without derivation:

\[ \begin{align} f_{k^\text{new}}^{-{\bf x}_{jt}}({\bf x}_{jt}) &=\int \left[ \prod_{x_{ji}\in \mathbf{x_{jt}}} f(x_{ji} \,|\, \phi) \right] h(\phi)d\phi_k \\ &=\int \prod_{x_{ji}\in \mathbf{x_{jt}}} \phi_{x_{ji}} \cdot \frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) } \prod_w \phi_w^{\beta-1} d\phi \\ &=\frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) } \int \prod_{x_{ji}\in \mathbf{x_{jt}}} \phi_{x_{ji}} \cdot \prod_w \phi_w^{\beta-1} d\phi\\ &=\frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) } \int \prod_{x_{ji}\in \mathbf{x_{jt}}} \phi_{x_{ji}}^{\left(\beta+1\right)-1} \cdot \prod_{x_{jt}\not\in \mathbf{x_{jt}}} \phi_{x_{jt}}^{\beta-1} d\phi\\ &=\frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) }\cdot \frac{ \prod_{x_{ji}\in \mathbf{x_{jt}}} \Gamma(\beta+1) \prod_{x_{ji}\not\in \mathbf{x_{jt}}}\Gamma(\beta) }{ \Gamma(V\beta+\sum_{x_{ji}\in \mathbf{x_{jt}}} 1) }\\ &=\frac{ \Gamma(V\beta) \prod_v \Gamma(\beta+n_{\cdot v}^{jt}) }{ \Gamma(V\beta+n_{\cdot\cdot}^{jt}) \prod_v \Gamma(\beta) } \end{align} \]

Given these equations for $f_{k}^{-{\bf x}_{jt}}({\bf x}_{jt})$ and $f_{k^\text{new}}^{-{\bf x}_{jt}}({\bf x}_{jt})$, we can draw samples from $p(k_{jt}=k \,|\, {\bf t}, {\bf k}^{-jt})$ by enumeration over topics. Combined with sampling from table assignment described here, we now have a complete Gibbs sampler for the Posterior sampling in the Chinese restaurant franchise in Teh, et al.

Dave O'ConnorWeb Scraping with Scrapy

In this post we will use the fast and powerful web scraper [Scrapy](http://scrapy.org/) to comb [theonion.com](http://www.theonion.com/) for horoscopes. Web scraping is one of the many tools in the Python tool belt and can be incredibly helpful in retrieving information from a website that does not provide an API endpoint. Unfortunately web pages do change and as such web

Dave O'ConnorBuilding a Simple REST API With Django Rest Framework

Today we will focus on building a simple REST API using the [Django Rest Framework](http://www.django-rest-framework.org/). RESTful services give access, typically over HTTP, to resources containing a wide variety of data and/or media. One of the great benefits of the Django Rest Framework is that it is able to get something tangible up and running very quickly. It integrates

Dave O'ConnorLet's Build a Web App: Part 4 - Model, Template, View

Before we jump into designing the front-end of our website, we should configure Django to render content in the browser. Django is a [MTV framework](https://docs.djangoproject.com/en/1.8/faq/general/), short for Model, Template, View. Remember, we are trying to enable an end user to query PostgreSQL and visualize access log data. We already have our Entry model so now we need a

Dave O'ConnorLet's Build a Web App: Part 5 - Bootstrap

In this article we will dive into Bootstrap and begin working on a front end for querying the data in our Entry models. [Bootstrap](http://getbootstrap.com/) is a popular front-end framework for quickly and cleanly styling a website. Created by the folks at Twitter, it uses a combination of HTML, CSS, and Javascript to standardize views across mobile, tablet, and

Dave O'ConnorExploring Python 3

In this article we will discuss Python 3 as it relates to Python 2 and explore some of the more obvious differences between the two versions. Python 3 was released on December 3rd, 2008 and has been a point of contention within the Python community. Unlike previous iterations of Python, this version broke backwards compatibility and effectively erased

Frank WierzbickiJython 2.7.1 beta1 released!

On behalf of the Jython development team, I'm pleased to announce that the first beta of Jython 2.7.1 is available! This is a bugfix release. Bug fixes include:
  • Import systems fixes for relative imports and some circular imports.
  • -m command now executes scripts from inside a jar file.
  • bytearray matches cpython's behavior better.
Please see the NEWS file for detailed release notes. This release of Jython requires JDK 7 or above.

This release is being hosted at maven central. There are three main distributions. In order of popularity:
To see all of the files available including checksums, go to the maven query for Jython and navigate to the appropriate distribution and version.

Tim HopperDistribution of Number of Tables in Chinese Restaurant Process

The Chinese Restaurant Process (related to the Dirichlet process) often comes up in Bayesian nonparametrics. Some related MCMC algorithm require drawing samples from the distribution of tables created by a Chinese restaurant process with parameter after a given number of patrons are seated. Of course, you could sample from this distribution by simulating the CRP and counting tables, however Antoniak provided a closed form of the probability mass function. Here is a helpful introduction to the "Antoniak equation".

I wrote some quick and dirty Python code to sample from the Antoniak distribution. It's based Matlab code by Yee Whye Teh.

Tim HopperNotes on Gibbs Sampling in Hierarchical Dirichlet Process Models

Update (2015-09-21): I moved this post to a PDF on Github.

Tim HopperNotes on Gibbs Sampling in Hierarchal Dirichlet Process Models

Yee Whye Teh et al's 2005 paper Hierarchical Dirichlet Processes describes a nonparametric prior for grouped clustering problems. For example, the HDP helps in generalizing the latent Dirichlet allocation model to the case the number of topics in the data are discovered by the inference algorithm instead of being specified as a parameter of the model.

The authors describe three MCMC-based algorithms for fitting HDP based models. Unfortunately, the algorithms are described somewhat vaguely and in general terms. A fair bit of mathematical leg work is required before the HDP algorithms can be applied to the specific case of nonparametric latent Dirichlet allocation.

Here are some notes I've compiled in my effort to understand these algorithms.

HDP-LDA Generative Model

The generative model for Hierarchical Dirichlet Process Latent Dirichlet Allocation is as follows:

\[ \begin{align} H & \sim \text{Dirichlet}(\beta) \\ G_0 \,|\, \gamma, H & \sim \text{DP}(\gamma, H) \\ G_j \,|\, \alpha_0, G_0 & \sim \text{DP}(\alpha_0, G_0) \\ \theta_{ji} \,|\, G_j & \sim G_j \\ x_{ij} \,|\, \theta_{ji} & \sim \text{Categorical}(\theta_{ji}) \\ \end{align} \]

  • $H$ is Dirichlet distribution whose dimension is the size of the vocabulary, i.e. it is distribution over an uncountably-number of term distributions (topics).
  • $G_0$ is a distribution over a countably-infinite number of categorical term distributions, i.e. topics.
  • For each document $j$, $G_j$ is a is a distribution over a countably-infinite number of categorical term distributions, i.e. topics.
  • $\theta_{ji}$ is a categorical distribution over terms, i.e. a topic.
  • $x_{ij}$ is a term.

To see code for sampling from this generative model, see this post.

Chinese Restaurant Franchise Approach

Instead of the above Dirichlet process model, we can think of an identical "Chinese Restaurant Franchise" model.

Each $\theta_{ji}$ is a customer in restaurant $j$. Each customer is sitting at a table, and each table has multiple customers.

There is a global menu of $K$ dishes that the restaurants serve, $\phi_1,\ldots,\phi_K$.

Some other definitions:

  • $\psi_{jt}$ is the dish served at table $t$ in restaurant $j$; i.e. each $\psi_{jt}$ corresponds to some $\phi_k$.
  • $t_{ji}$ is the index of the $\psi_{jt}$ associated with $\theta_{ji}$.
  • $k_{jt}$ is the index of $\phi_k$ associated with $\psi_{jt}$.

Customer $i$ in restaurant $j$ sits at table $t_{ji}$ while table $t$ in restaurant $j$ serves dish $k_{jt}$.

There are two arrays of count variables we will want to track:

  • $n_{jtk}$ is the number of customers in restaurant $j$ at table $t$ eating dish $k$.
  • $m_{jk}$ is the number of tables in restaurant $j$ serving dish $k$ (multiple tables may serve the same dish).

To summarize:

$x_{ij}$ are observed data (words). We assume $x_{ij}\sim F(\theta_{ij})$. Further, we assume $\theta_{ji}$ is associated with table $t_{ji}$, that is $\theta_{ji}=\psi_{jt_{ji}}$. Further, we assume the topic for table $j$ is indexed by $k_{jt}$, i.e. $\psi_{jt}=\phi_{k_{jt}}$. Thus, if we know $t_{ji}$ (the table assignment for $x_{ij}$) and $k_{jt}$ (the dish assignment for table $t$) for all $i, j, t$, we could determine the remaining parameters by sampling.

Gibbs Sampling

Teh et al describe three Gibbs samplers for this model. The first and third are most applicable to the LDA application. The section helps with more complication applications of the LDA algorithm (e.g. the hidden Markov model).

5.3 Posterior sampling by direct assignment

Section 5.3 describes a direct assignment Gibbs sampler that directly assigns words to topics by augmenting the model with an assignment variable $z_{ji}$ that is equivalent to $k_{jt_{ji}}$. This also requires a count variable $m_{jk}$: the number of tables in document/franchise $j$ assigned to dish/topic $k$. This sampler requires less "bookkeeping" than the algorithm from 5.1, however it requires expensive simulation or computation of recursively computed Stirling numbers of the first kind.

5.1 Posterior sampling in the Chinese restaurant franchise

Section 5.1 describes "Posterior sampling in the Chinese restaurant franchise". Given observed data $\mathbf{x}$ (i.e. documents), we sample over the index variables $t_{ji}$ (associating tables with customers/words) and $k_{jt}$ (associating tables with dishes/topics). Given these variables, we can reconstruct the distribution over topics for each document and distribution over words for each topic.

Notes on Implementing Algorithm 5.1

Teh et al's original HDP paper is sparse on details with regard to applying these samplers to the specific case of nonparametric LDA. For example, both samplers require computing the conditional distribution of word $x_{ji}$ under topic $k$ given all data items except $x_{ji}$: $f_k^{x_{ji}}(x_{ji})$) (eq. 30).

A Blogger Formerly Known As Shuyo has a brief post where he states (with little-to-no derivation) the equations specific to the LDA case. Below I attempt provide some of those derivations in pedantic detail.

As stated above, in the case of topic modeling, $H$ is a Dirichlet distribution over terms distributions and $F$ is a multinomial distribution over terms.

By definition,

\[ h(\phi_k)=\frac{1}{Z}\prod_v[\phi_k]_v^{\beta-1} \]


\[ f(x_{ji}=v \,|\, \phi_k)=\phi_{kv}. \]

Equation (30): $f_k^{x_{ji}}(x_{ji})$

For convenience, define $v=x_{ji}$ (word $i$ in document $j$), define $k=k_{jt_{ji}}$ (topic assignment for the table in document $j$ containing word $i$), and

\[ n_{kv}^{-ji}=\left|\left\{ x_{mn} \,|\, k_{mt_{mn}}=k,\, x_{mn}=v,\, (m,n)\neq(j,i) \right\}\right| \]

(the number of time the term $x_{ji}$, besides $x_{ji}$ itself, is generated by the same topic as was $x_{ji}$).

First look at the term (for fixed $k$):

\[ \prod_{j'i'\neq ji, z_{j'i'=k}} f(x_{j'i'} \,|\, \phi_k)= \prod_{j'} \prod_{i'\neq i, z_{j'i'}=k} [\phi_{k}]_{x_{j'i'}} \]

$[\phi_k]_v$ is the probability that term $v$ is generated by topic $k$. The double sums run over every word generated by topic $k$ in every document. Since $[\phi_{k}]_{x_{j'i'}}$ is fixed for a given word $w$, we could instead do a product over the each word of the vocabulary:

\[ \prod_{j'i'\neq ji, z_{j'i'=k}} f(x_{j'i'} \,|\, \phi_k) =\prod_{w\in\mathcal{V}}[\phi_k]_w^{n_{kw}^{-ji}} \]

We use this early on in the big derivation below.

Also, note that

\[ \int \phi_{kv}^{n_{kv}^{-ji}+\beta} \prod_{w\neq v} \phi_{kw}^{n_{kw}^{-ji}+\beta-1} d\phi_k \text{ and } \int \prod_w \phi_{kw}^{n_{kw}^{-ji}+\beta-1} d\phi_k \]

are the normalizing coefficients for Dirichlet distributions.

Equation (30) in Teh's paper is:

\[ \begin{align} f_k^{-x_{ji}}(x_{ji}) &=\frac{ \int f(x_{ji} \,|\, \phi_k) \left[ \prod_{j'i'\neq ji, z_{j'i'}=k} f(x_{j'i'} \,|\, \phi_k) \right] h(\phi_k) d\phi_k } { \int \left[ \prod_{j'i'\neq ji, z_{j'i'}=k} f(x_{j'i'}|\phi_k) \right] h(\phi_k)d\phi_k } \\ &=\frac{ \int f(x_{ji} \,|\, \phi_k) \left[ \prod_{j'} \prod_{i'\neq i, z_{j'i'}=k} \phi_{kv} \right]\cdot h(\phi_k) d\phi_k } { \int \left[ \prod_{j'i'\neq ji, z_{j'i'}=k} f(x_{j'i'}|\phi_k) \right]h(\phi_k)d\phi_k } \\ &\propto\frac{ \int \phi_{kv} \prod_w \phi_{kw}^{n_{kw}^{-ji}} \prod_{w} \phi_{kw}^{\beta-1} d\phi_k }{ \int \prod_w \phi_{kw}^{n_{kw}^{-ji}} \prod_w \phi_{kw}^{\beta-1} d\phi_k }\\ &=\frac{ \int \phi_{kv}\cdot \phi_{kv}^{n_{kv}^{-ji}}\cdot \prod_{w\neq v} \phi_{kw}^{n_{kw}^{-ji}}\cdot \phi_{kv}^{\beta-1}\cdot \prod_{w\neq v} \phi_{kw}^{\beta-1} d\phi_k }{ \int \prod_w \phi_{kw}^{n_{kw}^{-ji}} \prod_w \phi_{kw}^{\beta-1} d\phi_k }\\ &= \int \phi_{kv}^{n_{kv}^{-ji}+\beta} \prod_{w\neq v} \phi_{kw}^{n_{kw}^{-ji}+\beta-1} d\phi_k \cdot \frac{ 1 }{ \int \prod_w \phi_{kw}^{n_{kw}^{-ji}+\beta-1} d\phi_k }\\ &=\frac{ \Gamma\left(\beta+n_{kv}^{-ji}+1\right) \prod_{w\neq v} \Gamma\left(\beta+n_{kw}^{-ji}\right) }{ \Gamma\left( \sum_{w\neq v} \left[ \beta+n_{kw}^{-ji} \right]+ (\beta+n_{kv}^{-ji}+1)\right) }\cdot \frac{ 1 }{ \int\prod_w \left(\phi_{kw}\right)^{n_{kw}^{-ji}+\beta-1} d\phi_k }\\ &=\frac{ \Gamma\left(\beta+n_{kv}^{-ji}+1\right) \prod_{w\neq v} \Gamma\left(\beta+n_{kw}^{-ji}\right) }{ \Gamma\left( \sum_{w\in\mathcal{V}} \left[ \beta+n_{kw}^{-ji} \right] +1\right) }\cdot \frac{ 1 }{ \int\prod_w \left(\phi_{kw}\right)^{n_{kw}^{-ji}+\beta-1} d\phi_k }\\ &=\frac{ \Gamma\left(\beta+n_{kv}^{-ji}+1\right) \prod_{w\neq v} \Gamma\left(\beta+n_{kw}^{-ji}\right) }{ \Gamma\left( \sum_{w\in\mathcal{V}} \left[ \beta+n_{kw}^{-ji} \right] +1\right) }\cdot \frac{ \Gamma\left( \sum_{w\in\mathcal{V}} \left[ \beta+n_{kw}^{-ji} \right] \right) }{ \prod_{w}\Gamma\left(\beta+n_{kw}^{-ji}\right) }\\ &=\frac{ \Gamma\left(\beta+n_{kv}^{-ji}+1\right) \prod_{w\neq v} \Gamma\left(\beta+n_{kw}^{-ji}\right) }{ \Gamma\left(V\beta+n_{k\cdot}^{-ji}+1\right) }\cdot \frac{ \Gamma\left(V\beta+n_{k\cdot}^{-ji}\right) }{ \prod_{w}\Gamma\left(\beta+n_{kw}^{-ji}\right) }\\ &=\frac{ \Gamma\left(\beta+n_{kv}^{-ji}+1\right) \Gamma\left(V\beta+n_{k\cdot}^{-ji}\right) }{ \Gamma\left(V\beta+n_{k\cdot}^{-ji}+1\right) }\cdot \frac{ \prod_{w\neq v} \Gamma\left(\beta+n_{kw}^{-ji}\right) }{ \prod_{w}\Gamma\left(\beta+n_{kw}^{-ji}\right) }\\ &=\frac{ \Gamma\left(\beta+n_{kn}^{-ji}+1\right) \Gamma\left(V\beta+n_{k\cdot}^{-ji}\right) }{ \Gamma\left(V\beta+n_{k\cdot}^{-ji} + 1\right) \Gamma\left(\beta+n_{kv}^{-ji}\right) }\\ &=\frac{ \Gamma\left(\beta+n_{kn}^{-ji}+1\right) }{ \Gamma\left(\beta+n_{kv}^{-ji}\right) }\cdot \frac{ \Gamma\left(V\beta+n_{k\cdot}^{-ji}\right) }{ \Gamma\left(V\beta+n_{k\cdot}^{-ji} + 1\right) }\\ &=\frac{\beta+n_{kv}^{-ji}}{V\beta+n_{k\cdot}^{-ji}} \end{align} \]

This seems to be validated in the appendix of this paper.


We also need the prior density of $x_{ji}$ to compute the likelihood that $x_{ji}$ will be seated at a new table.

\[ \begin{align} f_{k^{\text{new}}}^{-x_{ji}}(x_{ji}) &= \int f(x_{ji} \,|\, \phi) h(\phi)d\phi_k \\ &=\int \phi_v \cdot \frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) } \prod_w \phi_w^{\beta-1} d\phi \\ &=\frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) } \int \phi_v \cdot \prod_w \phi_w^{\beta-1} d\phi\\ &=\frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) } \int \phi_v^\beta \cdot \prod_{w\neq v} \phi_w^{\beta-1} d\phi\\ &=\frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) }\cdot \frac{ \Gamma(\beta+1)\prod_{w\neq v}\Gamma(\beta) }{ \Gamma(V\beta+1) }\\ &=\frac{ \Gamma(V\beta) }{ \Gamma(V \beta+1) }\cdot \frac{ \beta\prod_{w}\Gamma(\beta) }{ \prod_w \Gamma(\beta) }\\ &=\frac{ 1 }{ V \beta }\cdot \beta\\ &= \frac{1}{V} \end{align} \]

Equation (31): $p(x_{ji} \,|\, {\bf t}^{-ji}, t_{ji}=t^{new}, {\bf k})$

These last two derivations give us Equation (31), the likelihood that $t_{ji}=t^{new}$:

\[ \begin{align} p(x_{ji} \,|\, {\bf t}^{-ji}, t_{ji}=t^{new}, {\bf k}) &=\sum_{k=1}^K \left[ \frac{ m_{\cdot k} }{ m_{\cdot\cdot}+\gamma }\cdot \frac{\beta+n_{kv}^{-ji}}{V\beta+n_{k\cdot}^{-ji}} \right] \\ &\phantom{=}+ \frac{ \gamma }{ m_{\cdot\cdot}+\gamma } \cdot \frac{1}{V} \end{align} \]

Equation (32): $p(t_{ji}=t \,|\, {\bf t}^{-ji}, {\bf k})$

From this, we know the conditional distribution of $t_{ji}$ is:

\[ p(t_{ji}=t \,|\, {\bf t}^{-ji}, {\bf k}) \propto \begin{cases} n_{jt\cdot}^{-ji} \cdot \frac{\beta+n_{k_{jt}v}^{-ji}}{V\beta+n_{k_{jt}\cdot}^{-ji}} & {\tiny \text{if } t \text{ previously used,}}\\ {\tiny \alpha_0 \cdot p(x_{ji} \,|\, {\bf t}^{-ji}, t_{ji}=t^{new}, {\bf k})} & {\tiny \text{if } t=t^{\text{new}}}. \end{cases} \]

Equation (33): $p(k_{jt^\text{new}}=k \,|\, {\bf t}, {\bf k}^{-jt^\text{new}})$

If the sampled value of $t_{ji}$ is $t^{\text{new}}$, we sample a dish $k_{jt^\text{new}}$ for the table with:

\[ p(k_{jt^\text{new}}=k \,|\, {\bf t}, {\bf k}^{-jt^\text{new}}) \propto \begin{cases} m_{\cdot k}\cdot\frac{\beta+n_{kv}^{-ji}}{V\beta+n_{k\cdot}^{-ji}} & {\tiny \text{if } k \text{ previously used,}}\\ \frac{ \gamma }{V} & {\tiny \text{if } t=k^{\text{new}}}. \end{cases} \]

My notes on derivation the final equation (34) for sampling $k$ are in rough shape. I intend to follow this with a post on that. In the mean time, you can see Shuyo's post, his implementation, and equation 7 in this paper.

Tim HopperClasses Future Programmers Should Take

I appreciated James Hague's post on Computer Science Courses that Don't Exist, But Should.

I really like Dave Thomas's idea of a Software Archeology class. I have spent a huge amount of time as a developer reading (vs. writing) code. I wish I'd been taught how to read code effectively.

Similarly, I have thought there should be a class (or series of classes) in "interacting with others' code". Topics could include inheriting a software project, handing off a software project, writing code using 3rd party libraries, using package repositories, and understanding software licenses. These are such important skills in real world software development, but they seem to be rarely taught in the classroom. Perhaps a follow-up class could be "contributing to open source".

Tim HopperFitting a Mixture Model with Gibbs Sampling

In attempt to clarify my own understanding of how Gibbs samplers are derived for Bayesian models, I collected some notes on the sampler for a finite mixture model into an IPython notebook.

I think most of the literature on Bayesian computation is terrible. There's lots of handwaving with cavalier appeals to conjugacy and "integrating out". There aren't a lot of clear derivations of the equations needed for doing sampling in MCMC. As I have tried to write about it myself, I have a better appreciation of why that might be: this is complex math with a lot of moving parts. Given that, I would appreciate constructive feedback on how I could make this more clear or helpful. Even better, fork my repo and submit a pull request.

Tim HopperDealing with Outliers

From Testicular volume is inversely correlated with nurturing-related brain activity in human fathers in PNAS:

One participant's testes volume measurement was excluded because his value was 2.8 SDs above the mean (mean = 38,064; SD = 11,183) and was more than 13,000 mm^3 larger than any recorded value found in the literature. Of the more than 1,500 healthy, age-matched men in these studies, the largest reported value was 56,000 mm^3, and this participant’s measurement was 69,736 mm^3.

Tim HopperCross Entropy and KL Divergence

As we saw in an earlier post, the entropy of a discrete probability distribution is defined to be

$$H(p)=H(p_1,p_2,\ldots,p_n)=-\sum_{i}p_i \log p_i.$$

Kullback and Leibler defined a similar measure now known as KL divergence. This measure quantifies how similar a probability distribution $p$ is to a candidate distribution $q$.

$$D_{\text{KL}}(p\| q)=\sum_i p_i \log \frac{p_i}{q_i}.$$

$D_\text{KL}$ is non-negative and zero if and only if $p_i=q_i$ for all $i$. However, it is important to note that it is not in general symmetric: $D_{\text{KL}}(p\| q) \neq D_{\text{KL}}(q\| p)$.

Jonathon Shlens explains that KL Divergence can be interpreted as measuring the likelihood that samples represented by the empirical distribution $p$ were generated by a fixed distribution $q$. If $D_{\text{KL}}(p\| q)=0$, we can guarantee that $p$ is generated by $q$. As $D_{\text{KL}}(p\| q)\rightarrow\infty$, we can say that it is increasingly unlikely that $p$ was generated by $q$.

Algebraically, we can rewrite the definition as

$$ \begin{array}{rl} D_{\text{KL}}(p\| q) &=\sum_i p_i \log \frac{p_i}{q_i} \\ &=\sum_i \left ( - p_i \log q_i + p_i \log p_i \right)\\ &=- \sum_i p_i \log q_i + \sum_i p_i \log p_i \\ &=- \sum_i p_i \log q_i - \sum_i p_i \log \frac{1}{p_i} \\ &=- \sum_i p_i \log q_i-H(p) \\ &=\sum_i p_i \log \frac{1}{q_i}-H(p)\\ \end{array} $$

KL Divergence breaks down as something that looks similar to entropy (but combining $p$ and $q$) minus the entropy of $p$. This first term is often called cross entropy:

$$H(p, q)=\sum_i p_i \log \frac{1}{q_i}.$$

We could alternatively use this relationship to define cross entropy as:

$$H(p, q)=H(p) + D_\text{KL}(p\| q).$$

Intuatively, the cross entropy is the uncertainty implicit in $H(p)$ plus the likelihood that $p$ could have be generated by $q$. If we consider $p$ to be a fixed distribution, $H(p, q)$ and $D_\text{KL}(p \| q)$ differ by a constant factor for all $q$.

Tim HopperEntropy of a Discrete Probability Distribution

Supposed we have a discrete set of possible events $1,\ldots, n$ that occur with probabilities $p_1, p_2, \ldots, p_n$. Claude Shannon asked the question

Can we find a measure of how much "choice" is involved in the selection of the event or of how uncertain we are of the outcome?

For example, supposed we have coin that lands on heads with probability $p$ and tails with probability $1-p$. If $p=1$, the coin always lands on heads. Since there is no uncertainty, we might want to say the uncertainty is 0. However, if the coin is fair and $p=0.5$, we maximize our uncertainty: it's a complete tossup whether the coin is heads or tails. We might want to say the uncertainty in this case is 1.

In general, Shannon wanted to devise a function $H(p_1, p_2, \ldots, p_n)$ describing the uncertainty of an arbitrary set of discrete events (i.e. a $n$-sided die). He thought that "it is reasonable" that $H$ should have three properties:

  1. $H$ should be a continuous function of each $p_i$. A small change in a single probability should result in a similarly small change in the entropy (uncertainty).
  2. If each event is equally likely ($p_i=\frac{1}{n}$), $H$ should increase as a function of $n$: the more events there are, the more uncertain we are.
  3. Finally, entropy should be recursive with respect to independent events. Supposed we generate a random variable $x$ by the following process: Flip a fair coin. If it is heads, $x=0$. If the flip was tails, flip the coin again. If the second flip is heads, $x=1$, if tails $x=2$.

    We can compute the entropy as $H(p_0=1/2, p_1=1/4, p_2=1/4)$. However, the independence property tells us that this relationship should hold:

    $$H\left(\frac{1}{2}, \frac{1}{4}, \frac{1}{4}\right)=H\left(\frac{1}{2}, \frac{1}{2}\right) + \frac{1}{2} H\left(\frac{1}{2}, \frac{1}{2}\right).$$

    As David MacKay explains, this is the general claim that

    $$H(\mathbf{p})=H(p_1, 1-p_1)+(1-p_1)H \left( \frac{p_2}{1-p_1}, \frac{p_3}{1-p_1}, \ldots, \frac{p_n}{1-p_1} \right).$$

Shannon showed that given these three assumptions, there is unique for that $H$ must take:

$$H\propto -\sum_{i=1}^n p_i \log p_i=\sum_{i=1}^n p_i \log \frac{1}{p_i}.$$

He named this measure of uncertainty entropy, because the form of $H$ bears striking similarity to that of Gibbs Entropy in statistical thermodynamics.1

Shannon observes that $H$ has many other interesting properties:

  1. Entropy $H$ is 0 if and only if exactly one event has probability 1 and the rest have probability 0. (Uncertainty vanishes only when we are certain about the outcomes.)
  2. Entropy $H$ is maximized when the $p_i$ values are equal.
  3. The joint entropy of two events is less than or equal the sum of the individual entropies. $H(x, y)=H(x)+H(y)$ only when $x$ and $y$ are independent events.

You can read more about this in Shannon's seminal paper A Theory of Mathematical Communication.

  1. Caianiello and Aizerman say the name entropy is thanks to von Neumann who said

    You should call it entropy and for two reasons: first, the function is already in use in thermodynamics under that name; second and more importantly, most people don't know what entropy really is, and if you use the word.

    They argue that the name "uncertainty" would have been much more helpful since "Shannon entropy is simply and avowedly the 'measure of the uncertainty inherient in a pre-assigned probability scheme.'"