A planet of blogs from our members...

Tim HopperMy Python Environment Workflow with Conda

Many new Python programmers rely on their system install of Python to run their scripts. There are several good reasons to stop using the system Python. First, it's probably an old version of Python. Secondly, if you install 3rd party packages with pip, every package is installed into the same globally accessible directory. While this may sound convenient, it causes problems if you (1) install different packages with the same name (2) need to use different versions of the same package (3) upgrade your operating system (OS X will delete all the packages you have installed).

For many years, best practice for Python developers was to use virtualenv to create a sandbox-ed environment for each project. If you use virtualenv, each project you work on can have its own version of Python with its own 3rd party packages (hopefully specified in an requirements.txt file). In my experience, getting started with virtualenv is cumbersome and confusing; to this day, I have to look up the command to create a Python 3 virtualenv.1

In 2015, I have almost exclusively used Python installations provided through Continuum Analytics's Conda/Anaconda platform. I have also switched from using virtualenvs to using conda environments, and I am loving it.

Before explaining my workflow, here's a quick glossary of the similarly-named products that Continuum offers.

  1. conda: "Conda is an open source package management system and environment management system for installing multiple versions of software packages and their dependencies and switching easily between them. It works on Linux, OS X and Windows, and was created for Python programs but can package and distribute any software."2 A conda install provides a whole suite of command line tools for installing and managing packages and environments. Because conda works for any software, it can even install different versions of Python (unlike pip).
  2. Anaconda: "Anaconda is a completely free Python distribution (including for commercial use and redistribution). It includes more than 300 of the most popular Python packages for science, math, engineering, and data analysis." It is available across platforms and installable through a binary.
  3. Anaconda Cloud: Also known as Anaconda.org and formerly known as Binstar, "Anaconda Cloud is a package management service where you can host software packages of all kinds." Anaconda Cloud is a package repository analogous to PyPI. Packages are installed via the conda command line tool instead of Pip. By default, the conda install command installs packages from a curated collection of packages (a superset of those in Anaconda). Continuum allows users to host their own packages on Anaconda Cloud; these packages can also be installed through conda install using the -n flag with the username.

Conda, Anaconda, and Anaconda cloud are distinct but interrelated tools; keeping them straight can be hard, but is helpful.

Conda (the package manager) can be installed in two ways. Through the Miniconda installer or the Anaconda installer. Both install the package manager, but the latter also installs the 300+ packages for scientific Python. (I believe this is equivalent to installing Miniconda and then running conda install anaconda.)

Conda Environment Files

It has become standard for pip users to create a requirements.txt file for specifying dependencies for a particular project. Often, a developer working a project will (1) create and activate a virtual environment (2) run pip install -r requirements.txt to build an isolated development environment with the needed packages.

Conda provides an analogous (but more powerful) file: environment.yml.3

A simple environment.yml file might look like this:

name: numpy-env
- python=3
- numpy

If you are in a directory containing this file, you can run $ conda env create to create a Conda environment named numpy-env that runs Python 3 and has numpy installed4. Run $ source activate numpy-env to activate this environment. Once activated, running $ python will run Python 3 from your environment instead of the globally installed Python for your system. Moreover, you will be able to import numpy but not any of the 3rd party packages installed globally.

environment.yml can also install packages via pip with this syntax:

name: pip-env
- python
- pip
- pip:
    - pypi-package-name

I see environment.yml files as a positive development from requirements.txt files for several reasons. Foremost, they allow you to specify the version of Python you want to use. At Pydata NYC 2015, many presenters provided their code in Github repositories without specifying anywhere whether they were using Python 2 or 3. Because I included a YAML file, attendees could see exactly what version I was using and quickly install it with conda env create. I also like being able to specify the name of the environment in the file; this is particularly helpful when working with others. Finally, because conda can install from PyPI via pip, environment.yml files provide no less functionality than a requirements.txt file provides.

My Python Environment Workflow

Lately, whenever I am working on a new project (however big or small), I follow the following steps:

  1. Create a project folder in the ~/repos/ directory on my computer.
  2. Create an environment.yml file in the directory. Typically the environment name will be the same as the folder name. At minimum, it will specify the version of Python I want to use; it will often include anaconda as a dependency.5
  3. Create the conda environment with $ conda env create.
  4. Activate the conda environment with $ source activate ENV_NAME.
  5. Create a .env file containing the line source activate ENV_NAME. Because I have autoenv installed, this file will be run every time I navigate to the project folder in the Terminal. Therefore, my conda environment will be activated as soon as I navigate to the folder.
  6. Run $ git init to make the folder a Git repository. I then run $ git add environment.yml && git commit -m 'initial commit' to add the YAML file to the repository.
  7. If I want to push the repository to Github, I use $ git create using Github's hub commands. I then push the master branch with $ git push -u origin master.

As I add dependencies to my project, I try to be sure I add them to my environment.yml file.

A major benefit of all this is how easily reproducible a development environment becomes. If a colleague or conference attendee wants to run my code, they can setup the dependencies (including Python version) by (1) cloning the repository, (2) running $ conda env create, (3) running $ source activate ENV_NAME. It's easy enough for me to drop those instructions and further instructions for running the code in a README file. If I'm feeling especially helpful, I'll create a Makefile or Fabfile to encapsulate commands for core functionality of the code.

An even larger benefit is that I can return to a project after, days, months, or years and quickly start developing without first having to hunt for print statements to figure out whether I was using Python 2 or 3.

I've come to love environment.yml files, and I think you might too.

  1. virtualenv also provides no helping in actually managing Python versions. You have to install each version yourself and then tell virtualenv to use it. 

  2. From the conda docs

  3. Though there is currently a pull request for adding requirements.txt support to conda: https://github.com/conda/conda-env/pull/172

  4. Numpy will be installed from a binary from Anaconda Cloud, not built from source. 

  5. I created a bash command conda-env-file to automatically create an environment.yml file named after the current directory. 

Tim HopperSequential Minimal Optimization Algorithm for Support Vector Machines

In my nonlinear optimization class in grad school at North Carolina State University, I wrote a paper on the famed SMO algorithm for support vector machines. In particular, I derive the Lagrangian dual of the classic formulation of the SVM optimization model and show how it can be solved using the stochastic gradient descent algorithm.

You can find the paper here.

Caktus GroupWhat human-centered design can do for international development

Cross-posted with Creative Associates International. Written by Gina Assaf (Creative Associates International) and Tania Lee (Caktus Group). Image courtesy of Creative Associates International.

A human-centered design process is a critical step that is often overlooked when making important decisions in technology for development.

In the development space, funding usually comes from a donor or an agency, and the service or products it benefits goes to a different place, local communities in need around the world. The indirect link between the donors and the beneficiaries can lead to a lack of accountability—as Dave Algoso explains in his blog on Reboot.org.

But practicing human-centered design, or user-centered design, which puts the needs, wants and limitations of users first in each stage of the design process, can help ensure that the services and products actually meet the needs of the end users—the communities we seek to empower and serve.

For example, funding a software product that requires high bandwidth connectivity when the population it is meant to serve has minimal Internet connectivity will not benefit that community. Similarly, a mobile banking tool utilizing lots of text may not be appropriate for low-literacy communities.

Not practicing human-centered design could mean wasting funding on initiatives that simply aren’t effective and losing collaborative opportunities for designing better solutions.

At this year’s UXDC conference, organized by the DC Chapter of the User Experience Professionals Association, technologists, development practitioners and design professionals gathered to discuss “Lessons and Challenges in Implementing User Centered Design and UX Practices in the International Development Sector.” Human-centered design is gaining more and more traction in the international development realm.

Is international development ready for human-centered design?

International development funding mechanisms are mainly donor-driven, and convincing donors to invest in user research, user experience design or iterative prototyping (all important tenets of human-centered design) can be a tough sell. Funding models in development are not ideally set up for iteration, a key component for human-centered design.

Development funds are often taxpayer-funded and somewhat restricted in how they can be spent. Investing in a process that may require more upfront work and iteration may be considered wasteful.

The Request for Proposal process is set up in a way that assumes we have enough information to make decisions when pitching a proposal, and the turnaround timelines are very short. This is challenging for proposing and implementing technology innovations since it leaves little room for a human-centered design process, which requires more upfront thought, needs assessment and iteration.

To add to the hurdles, there is a lot of the confusion in the vocabulary between the user experience community and the development and humanitarian communities. The two groups speak different languages. For example, humanitarian and development communities use terms like “access,” “capture,” “accountability,” “dignity” and “livelihoods.” User experience communities use terms such as “co-design,” “user-centered,” “journey mapping” and “need-finding.”

But there is room for finding common ground, if not a common language.

Some parts of the human-centered design process are already implemented (to some degree) in development but described and worded differently.

For example, the “capture” phase in development is when “need-finding” occurs in design. “Need-finding” is an essential part of human-centered design, where we understand the needs, limitations and concerns of the target users. The “capture” phase in development performs a similar task, but could be further enhanced by making the beneficiary more central to this research and make use of some of the design research method tools used in the human-centered design process.

Importance of participatory design

Despite the challenges, there are many opportunities to promote and improve on the use of human-centered design practices within development.

For example, participatory design attempts to actively involve all stakeholders (e.g. employees, partners, customers, citizens, and end users) in the design process to help ensure the result meets their needs and is usable. This directly relates to the first principle of the Principles for Digital Development: “Design with the user.” A participatory design process builds collaboration across various internal teams that often work in silos: programs, technology, grants, procurement, and more.

However, “design with the user” isn’t always as easy.

Within development and humanitarian spaces, often the user or the person most affected could be a part of an extremely vulnerable group or hard to reach, and working directly with extremely vulnerable groups is not always possible or advised. In addition, funding restrictions may inhibit participatory design. For example, some U.S. government agencies won’t allow more than 10 external persons to participate in projects due to funding restrictions.

Also, there are multiple groups of stakeholders in development: community members, community leaders, development staff, development staff managers, donors, and more. Balancing and prioritizing needs is a challenge.

Bridging the development-user experience divide

There is a lack of understanding across user experience professionals and international development communities—a fact confirmed by panelists and participants at the UXDC session. But it is not an unbridgeable gap. More conversations and discussions around this topic are needed and panels, like the one at UXDC, help bridge this gap.

While there are many obstacles to implementing the entire human-centered design process and methods in international development projects, there are positive trends—including the emergence of development labs and innovation hubs with an emphasis on capacity building for teams and human-centered design training. There also seems to be receptiveness and an eagerness among local innovators in developing countries to move forward with human-centered design.

For example, Creative Associates International worked in Honduras using a human-centered design approach, and performed research in partnership with the local field staff, to understand how the computer lab could better serve the wants and need of the community.

Creative trained the local Honduran staff and performed user interviews together, synthesized the results and brainstormed solutions together. Creative is now in the process of testing and iterating some of the proposed solutions. The team there is very positive and enthusiastic about this process, and they are leading it forward and are seeing how the work they are doing can make a big impact.

Caktus Group offers similar services to clients and recently conducted a discovery workshop in Turkey with a humanitarian NGO serving local refugee communities. The discovery workshop was conducted with roughly 15 client staff and utilized a series of brainstorming, mapping, outlining, crowdsourcing and voting activities to develop terms of success, user personas, journey maps and defined a minimum viable product.

Caktus has since been working with this client to co-design and iteratively build a custom web application that fits their unique needs.

Human-centered design principles are already being utilized in many development programs, but there are many more opportunities for improvement. Better metrics and further discussion and collaboration of human-centered design practice in development can help to address the challenges, demonstrate value and achieve lasting results in people’s lives.

*Gina Assaf is a Product and User Experience Manager with the Technology for Development team at Creative Associates International. Tania Lee is a SMS & Web Product Strategist at Caktus Group.

For more information about the UXDC community or to join the human-centered design working group, please contact. Gina Assaf, ginaa@creativedc.com; Tania Lee,tlee@caktusgroup.com; or Ayan Kishore, ayank@creativedc.com.*

Tim HopperKnitting

True story: I'm a closet knitter. I don't have much time for it these days, but it helped keep me sane in grad school. Here are some things I've made over the years.






Caktus GroupInitial Data in Django

I've struggled to find an ideal way to load initial data for Django projects. By “initial data,” I'm referring to the kind of data that you need on a new system for it to be functional, but could change later. These are largely lists of possible choices, such as time zones, countries, or crayon colors.

Here are my requirements:

  • Fairly simple to run on initial server deploy, initial development environment setup, and when starting a test run.
  • Does not risk overwriting changes that are made to records in the live database after they're initially created.
  • Not too hard to update from the current live data, so that future new deploys etc get the latest data
  • Copes well as models evolve, because they will
  • Well supported by Django
  • Not too great a performance impact on testing

Here are some of the approaches that I've tried.


Fixtures are how Django used to recommend loading initial data.


  • It's fairly easy to update the fixtures as the "initial" data evolves - e.g. you've added more options in your live server, and want to preserve them in the initial data, so just do another dumpdata.
  • Fixtures don't slow down test startup if they're not named "initial.XXXX" or you're using a recent version of Django, because they don't get loaded automatically.
  • Easy enough to load at the beginning of tests that need them by adding a fixtures attribute to the test case class.


  • fatal - If a fixture is loaded again, it overwrites any changed data in the database with the original values
  • Discouraged by current Django documentation
  • Hard to keep valid when models evolve. The right way would be every time a model changes, you update the fixtures from the current data, then create a fresh temporary database without applying the new migrations, load the current fixtures, apply the new migrations, and make a fresh dump of the initial data. But that’s a lot of work, and hard to remember to do every time models change.
  • Data is not automatically available during tests, and since our system won't run correctly without some of this data, you have to arrange to load or create it at test setup.
  • Not loaded automatically so:
  • When setting up new development environments, you must document it and it’s still easily overlooked, or else get a developer to run some script that includes it
  • For automated deploys, not safe to run on every deploy. Probably the only safe approach is to run manually after the first deploy.

Summary: rejected due to risk of data loss, inconvenience during development, and negative recommendation from Django documentation.

Fixture hack

I played around with a modified loaddata command that checked (using natural keys) if a record in the fixture was already in the database and did not overwrite any data if the record had previously been loaded.

This means it's safer to add to scripts and automatic deploys.


  • Fairly easy to update as "initial" data evolves - e.g. you've added more options in your live server, and want to preserve them in the initial data, so just do another dumpdata
  • Fixtures don't slow down test startup if they're not named "initial.XXXX" or you're using a recent version of Django, because they don't get loaded automatically
  • Easy enough to load at the beginning of tests that need them by adding a fixtures attribute to the test case class.
  • Can add to env setup scripts and automated deploys safely


  • Hard to keep valid when models evolve
  • Data is not automatically available during tests
  • Not loaded automatically, so when setting up new development environments, you must document it and it’s still easily overlooked, or else get a developer to run some script that includes it

Summary: rejected; it mitigates one problem with fixtures, but all the others remain.

Post-migrate signal

Something else I experimented with was running code to create the new records in a post-migrate signal, even though the docs warn against data modification in that signal.


  • Runs automatically each time migrations are run, so will automatically get run during most automated deploys
  • Runs automatically when tests are setting up the test database, so all tests have the data available - but is part of the initial database, so we don't have the overhead of loading initial data during every test's setUp.


  • fatal - Runs every time migrations are run, even reverse migrations - so it runs when tables are in the wrong state and breaks development when you might be migrating forward and back
  • If it fails, the whole migration fails, so you can't just ignore a failure even if you didn't care about creating the initial data that time
  • Slows down database creation when running tests, unless you use --keepdb

Summary: rejected; not a valid way to load initial data.

In a migration

Add a migration that creates the initial records.


  • This is what the Django documentation currently recommends
  • Runs automatically
  • The migration only runs when the database schema matches what it was when you wrote it, so it won't break as models evolve
  • You can write it to ignore records that already exist, so it won't overwrite later changes in the database


  • fatal in some cases - migrations don't use the actual model class, so models with custom behavior (like MPTTModel) won't get created correctly. You might be able to find workarounds for this on a case-by-case basis.
  • Slows down database creation when running tests, unless you use --keepdb
  • Harder than fixtures to update as the initial data evolves. Options:
    • Go back and edit the original migration - but then it won't run on existing databases and they won't get the new records
    • Add a new migration that adds the whole updated initial data set, then go back and comment out the code in the previous initial data migration since there's no point running it twice on new database setup
    • Add yet another migration for just the new data - probably the simplest in terms of updating the migrations, but it'll be harder to extract just the new data from the current database than to just extract the whole dataset again. Also, you don't preserve any edits that might have been made over time to older records.

Summary: best option so far. It has some drawbacks, but not as bad as the other options.


The best approach in most cases is probably to load initial data in a migration, as the Django documentation recommends. It's not perfect, but it avoids some of the fatal flaws of other approaches. And the new (in Django 1.8) --keepdb option helps ameliorate the slow test startup.

I'm still curious if there are other approaches that I haven't considered, though.

Tim HopperMy First Publication

When I worked at RTI International, I worked on an exploratory analysis of Twitter discussion of electronic cigarettes. A paper on our work was just published in the Journal of Internet Medical Research: Using Twitter Data to Gain Insights into E-cigarette Marketing and Locations of Use: An Infoveillance Study.1

Marketing and use of electronic cigarettes (e-cigarettes) and other electronic nicotine delivery devices have increased exponentially in recent years fueled, in part, by marketing and word-of-mouth communications via social media platforms, such as Twitter. ... We identified approximately 1.7 million tweets about e-cigarettes between 2008 and 2013, with the majority of these tweets being advertising (93.43%, 1,559,508/1,669,123). Tweets about e-cigarettes increased more than tenfold between 2009 and 2010, suggesting a rapid increase in the popularity of e-cigarettes and marketing efforts. The Twitter handles tweeting most frequently about e-cigarettes were a mixture of e-cigarette brands, affiliate marketers, and resellers of e-cigarette products. Of the 471 e-cigarette tweets mentioning a specific place, most mentioned e-cigarette use in class (39.1%, 184/471) followed by home/room/bed (12.5%, 59/471), school (12.1%, 57/471), in public (8.7%, 41/471), the bathroom (5.7%, 27/471), and at work (4.5%, 21/471).

  1. I have no idea what "Infoveillance" means. 

Caktus GroupThe Long, Hard Haul to Uncovering a Single, Tiny CSS Property Bug in IE10

There’s a very small but devastatingly crash-inducing bug in Internet Explorer 10. Watch out for setting a background-color to inherit on any pseudo element (like ::before and ::after), because this will crash IE completely every single time.

#element::after {
    background-color: inherit; /* CRASH! */

Rather than dedicate an entire post to a single CSS property, let’s trace the work that was required to track this down. The payoff is obviously important (a supported browser crashing is a big problem!) but it can feel frustrating when a big issue comes down to something so seemingly tiny. It is important to remember the real measure of the bug is not the number of lines involved in fixing it (one, in our case) but the amount of effort it took to track down. Don’t let the small fix (just give it a color directly) betray the hard work you put into understanding the issue.

First, how did we get here, anyway? A crashing browser is something you hope doesn’t go unnoticed for long.

It is pretty common to try to avoid running full regression tests across a site, both early on and throughout the project. When you have a dozen supported browsers on the low end, the most efficient workflow tends to be getting a lot of feature work and layout under your belt early, and then attacking each browser’s quirks after testing what you’ve built across them. If you try to test the entire supported set of features for every small changeset, especially on a pre-launch product, you’re going to be repeating a lot of work and sinking a lot of hours you could better spend getting more of the initial work wrapped up.

So a couple days of iteration had been done with new feature work and a few bug fixes, and then we found this crash that perplexed us. At that point, it could have been any one of a dozen commits across a number of components. We could have expected some tweaks would be needed to bring Safari and Internet Explorer in line, but the outright crash certainly took us by surprise. We began down a road to narrow the confirmation. We ruled out SauceLabs, the excellent browser testing service, by reproducing it locally on our own setup. IE 11 didn’t crash, and most of the site was just fine except this one page, so we knew it had to be a pretty specific issue.

We had enough to start tracking it down. Being able to reproduce it reliably and only on one browser meant we could really focus any tests we did, so we got started right away.

Having confirmed this is reproducible, how do we start to narrow this down when we aren’t sure when it happened and the crashing nature of the bug precludes us from looking into what’s going on?

While the problem happened on a single page, that wasn’t enough to go on to tell us where it was happening. We had not even discovered that it was a CSS issue at this point, so basically anything was suspect.

We turned to the ever useful Git Bisect command, which every developer should have in their toolbelt. Git Bisect, if you haven’t had an opportunity to learn about it yet, helps you locate a specific commit in your history that caused a bug. You start by telling it the last known commit where you can confirm the bug did not exist and the latest commit (usually the head of your branch or master) where the bug does exist. Git begins narrowing it down, basically stepping halfway between the two known good and bad commits and at each step asking you “What about this commit, is it good or bad?” At the end of the process, which should only take less than half a dozen steps in most cases, you’ll have a single commit where the problem first appears. This is great when you have a problem that went undetected for a while and you need a smaller target to aim for.

The diff in question is CSS only, so now we know the problem is in our CSS! This really helped to narrow it down, both to one aspect of our changes (neither markup nor Javascript seemed to be at fault) and in the number of changes to look closer at (it was not a very large diff). It was, however, much more surprising than a strange Javascript bug might have been. Who ever heard of CSS crashing a browser, after all? The first thought was a parser bug and we poured over the diff looking for errant curly braces or other oddities that might have gone unnoticed and caused this crash. Everything looked flawless, from a syntax perspective, so we were left with the conclusion that it was some actual style causing the failure, rather than a parser issue.

As both a sanity check and a confirmation of our findings so far, we cleared the entire CSS file where these changes were isolated to and pulled it back up in Internet Explorer 10. This confirmed our findings when a very ugly, but non-crash-inducing, version of the page was rendered. We now had clear confirmation that something in this simple CSS file was the culprit, and a quick look at the short commit we had narrowed down as the time when the problem was introduced gave us no good candidates.

There were a few color changes. A bit more padding on one element and a slimmer margin on another. Heck, several of the changes were just fixing indentation in a few places! Nothing stood out.

One by one we commented out the rulesets that had been changed in the diff until we found one that only caused the problem when it was present:

#cssmenu::after {
    position: absolute;
    top: 0;
    left: -50%;
    width: 150vw;
    height: 100%;
    content: "";
    background-color: inherit;
    z-index: -1;

This is a standard trick to create a wider background padding around a section that needed to expand beyond the content well. It is just a simple pseudo element with all the necessary properties. These pseudo elements require the position, top, left, width, height, and content properties. This leaves the final two properties, background-color and z-index, as likely causes.

Finding the answer was background-color was only a matter of commenting them each out one by one to test, and seeing this combination from the start of the article was our problem all along:

#cssmenu::after {
    background-color: inherit;

We had the problem and immediately knew the solution was easy. The color being inherited was just “white” and we had only used “inherit” to remove a minor redundancy. Explicitly stating the color here fixed the problem immediately.

But, it fixed the problem in seconds, while it had taken us nearly an hour to track down the bug between two of us. We had to know more about what had caused us so much headache and followed up with a few experiments.

We had a theory and a false-positive test that this affected ::after but not ::before elements, and quickly learned this affected both varieties of pseudo element.

We dug around as much as we could without sinking time, but could not find any mention of this problem in the usual places people complain about things (StackOverflow, MSDN forums).

If you’ve found this post because you googled the terms “css after background-color inherit crash” then I hope you can breathe a sigh of relief and move on with your day.

For the reader following this story for the journey and not the solution, remember this the next time you feel betrayed by a long debug session that reveals what seems like a small issue: you did not waste your time. The measure of a bug should not be in lines of code or the time to fix the problem once you understand it, but in the effort it requires to uncover, understand, and correct. If we let ourselves be deceived by what false measurements of how big or difficult a bug really is, focusing too much on only the very final and sometimes minor step of fixing it, we let ourselves devalue the very important and difficult work of uncovering complicated problems. And that is where real skill and experience lies, not in the few lines at most it might take to correct.

Joe GregorioHalloween Dragon

We do large displays for Halloween, usually involving converting the garage into a haunted house. They're major undertakings, with four or more actors, so we only do them every other year. I also try to roll a "build" into every Halloween. For example, 6 years ago I built this as part of the haunted house:

And two years ago the same skeleton got taken out of the jail and laid out on a table, this time animated with pneumatics. Neat tip on this one, those sprinkler valves for watering systes work just as well for controlling air.

This year Lynne requested that we do something less intensive. Now I agreed that the haunted houses were a lot of work, but I still wanted to do something big, or at least give the appearance of something big. That's when I hit on the idea of doing a dragon in the garage, but only opening the garage door a bit, so you could only see the nose and the tail of the dragon. Let the kid's imaginations fill in the rest of the beast.

The head was paper-mache, with a hot-glued cardboard frame, and two nostrils that could be hooked up to a fog machine. Here's photos of the build. And here's a shot of the fog test:

Next came the build for the articulated tail, with a wooden frame and bicycle brake cabling for the controls. Here are some photos from the build. And here's the tail in action, first w/o the scales, and then with the scales.

And finally putting it all together, with a "Beware of Dragon" sign, and a scattering of bones across the driveway, and the Dragon Sound Effects Panel I posted yesterday:

Now it doesn't look too scary during the day, but on Halloween night, only lit by a strobe light, with the candy bowl between the nose and the tail, and the proper "story" it was quite effective. Here's Lynne's description:

The kids were told the dragon was sleeping (cue - snoring noise) but could smell the kids when they got close (cue - sniffing noise) and "probably" couldn't get out of the garage, was "probably" full from all the kids he ate earlier, and "probably" wouldn't eat them if they were careful. They had to walk to candy bowl between head and tail. As they did the dragon awoke, started to move, and then huge banging and movement of garage door as he tried to get to them. They would run and scream and then bravely try again! It was precious.

I think we had over 100 people see the dragon, some coming back for multiple attempts. It was a blast to run and just as fun to build, as the three youngest kids all participated in the build. I can't wait for 2017, I'm thinking a giant pneumatic octopus...

Joe GregorioDragon Sound Effects Web Audio API

A sound effects board with the sounds of a dragon built using the Web Audio API. Because Halloween.

Chains: [Source]

Roar: [Source]

Sniff: [Source]

Snore: [Source]


Caktus GroupOpen Sourcing SmartElect: Libya's SMS Voter Registration System

We are proud to say that, with the Libyan High National Elections Commission (HNEC) and consultative support from the United Nations Support Mission to Libya, we have open sourced their elections management platform today under a permissive Apache 2.0 license. Open sourcing means other governments and organizations can freely adopt and adapt the elections tools which cover nine functional areas. The tools range from SMS voter registration, the first of its kind, to bulk alerts to voters and call center support software. You can learn more at our brand new SmartElect homepage. This is the cumulation of two years of work, so we’re incredibly excited to share SmartElect with the rest of the world.

With an established partner network, others interested in adopting or adapting this code can easily find experienced partners who've worked on this project before. The network consists of those of us that worked on the original code base and can provide commercial support.

How SmartElect got its start

Initially, the building of SmartElect was entirely driven by the Libyan High National Elections Commission’s need for a voter registration alternative. The interim Libyan government and HNEC faced extraordinary challenges in access to elections: challenging terrain, nomadic populations, and civil unrest. They turned to text message voter registration as a low-cost and more accessible alternative to reach more people safely.

Caktus joined the team at the start of HNEC’s efforts in Fall 2013. We then took over as a primary contractor after the February 2014 elections and open sourcing was immediately one of HNEC’s priorities for us. This thrilled us since we have long believed in the importance of open source to building affordable tools for others. We’ve been steadily working on open sourcing the project since then while simultaneously continuing to develop and strengthen the system for Libya.

SmartElect took shape as we discussed amongst ourselves the best way to grow the open source community around it.

What can SmartElect be used for

We hope to see SmartElect used for other organizations and governments. While the system used for Libya was highly customized to support Libya's electoral process, the SmartElect features have been generalized for open source use and must be adapted to each unique electoral process. Since each government is different, customization will be necessary. SmartElect includes the following adaptable components for governments to select from:

  • Voter registration and polling center management
  • SMS voter registration
  • Citizen call center support
  • Voter roll generation (a truly tricky task that involves literal tons of printed rolls)
  • Dashboards that show the public real-time registration data
  • Bulk messaging to registered voters
  • Voter registration verification tools
  • Text message reporting tools for poll workers
  • Quick vote tabulations to improve electoral integrity (built by Ona.io)

How we open sourced SmartElect

To open source Libya’s system, we examined two years of work—approximately 8,482 team hours since Fall 2013 and 83,095 lines of code. (We had some fun crunching the numbers and the code in length is equivalent to 1,179 sedans lined up end to end.) Parts of the system were built with open sourcing in mind, but there was much untangling to do between Libya’s system and what could be released securely and usefully. Our team of developers worked incredibly hard with our partners at HNEC to ensure the open source code met the highest security and usability standards. We also had a third party security firm audit both Libya’s system and SmartElect for potential technical vulnerabilities. We’re happy to say they gave us a clean bill of health.

The technology behind SmartElect

SmartElect is built using Django, the open source framework we specialize in and RapidSMS, a Django-based framework we helped UNICEF build especially for SMS tools. Django’s tagline, “the Web framework for perfectionists with deadlines,” is a true reflection of what this project needed. In the original development, there were elections deadlines looming and the weight of knowing we needed to build something quickly and well to support a new democracy.

Who was involved in building SmartElect

A system as complex and robust as SmartElect had many key players. Reboot, a consulting firm, first got the technical experts together so we could partner with HNEC, so we owe them a great thanks for connecting us before they left in the initial phase of the Libya project. As lead on the project, we subcontracted with Elliott Wilkes, the future principal and co-founder of Forsa Consulting, for in-country coordination and project management (an invaluable role). Together with our developers, Praekelt Foundation, Ona Data, and the MIS Department all provided critical components of Libya’s system. At HNEC’s behest, Caktus took the lead on open sourcing the code base.

What our hopes are for SmartElect

We’re looking forward to seeing new iterations of SmartElect take place as other developers build upon HNEC’s work. We’ve created a developer’s mailing list and a partner network to support both developers and organizations interested in adopting or adapting SmartElect. We’re excited to see what new versions of the software will take shape.

Read our official press release

Lead Developer Vinod Kurupmakes the repo public Lead Developer Vinod Kurup makes the SmartElect repo public.

SmartElect partners All the players: HNEC, UNDP, IOM, us, and those who subcontracted with us in building Libya’s SMS voter registration system.

Frank WierzbickiJython 2.7.1 beta2 released!

On behalf of the Jython development team, I'm pleased to announce that the second beta of Jython 2.7.1 is available! This is a bugfix release. Bug fixes include better unit testing under windows and the restoration of os.getpid() function.

Please see the NEWS file for detailed release notes. This release of Jython requires JDK 7 or above.

This release is being hosted at maven central. There are three main distributions. In order of popularity:
To see all of the files available including checksums, go to the maven query for Jython and navigate to the appropriate distribution and version.

Caktus GroupIdentifying Racial Bias in Policing with a Data-driven App

Recently, Caktus co-founder Colin Copeland spoke about the creation of a web app that analyzes North Carolina traffic stop data to identify racial bias during the Code for America 2015 Summit. The website allows both police departments and community members to visualize a dataset of more than 18 million stops statewide. Colin spoke with Ian Mance, the originator of the app idea and staff attorney with the Southern Coalition for Social Justice. Together with fellow community members, Andy Shapiro and Dylan Young, they used Django, an open source web framework, to make policing data more accessible.

Tim HopperTweets I'm Proud Of


On nonparametric Bayesian methods:

On code:

On key pressers:

Tim HopperWrapping Up on Nonparametric Bayes

Today is my last day at Qadium

I've been privileged to work with Eric Jonas on the data microscopes project for the past 8 months. In particular, I contributed the implementation of nonparametric latent Dirichlet allocation.

Here is an IPython notebook I prepared showing how that library can be used to analyze the topics discussed in Econtalk, my favorite economics podcast.

I also published a notebook deriving the collapsed Gibbs sampler for Bayesian mixture models, showing a toy example in Python, and showing how the model can be extended to the nonparametric case (with a Dirichlet process prior).

Next week, I am joining the data science team at Distil Networks.

Caktus GroupA Peek Behind the Scenes at Caktus

Over the years, Caktus has had the opportunity to employ a wonderfully talented staff from a wide range of backgrounds. This summer we decided we would conduct a survey of Caktus staff. This infographic is the result of that study, and allows a little peek into the Cakti way of life. Conclusions: cats are great and coffee keeps you sharp!

Caktus GroupShipIt Day ReCap: Q4 2015

Members of one team joined forces with local meetup Code for Durham to help with the alpha launch of the School Navigator App. Using publicly available data, the School Navigator, allows users to geolocate nearby Durham schools and view information like performance ratings. The team included Code for Durham co-founder Colin Copeland who upgraded the Django template for the latest version of SALT. Erin Mullaney helped expand a feature denoting different school zones on the map, using Angular for the first time to do so. She also worked on a query change to more closely match the rules of districting on the map’s display. [Victor Rocha] developed various bug fixes, and merged pull requests. In the meantime, David Ray put his new Ionic skills to the test by building a mobile app version of the School Navigator, now available from the Google App store. (David’s starting Ship It Day project was working through an Ionic tutorial to create a Reddit viewing app with pull refresh and infinite scrolling.)

Mark Lavin, Rebecca Conley, Dmitriy Chukhin, and Ben Phillips built a Taylor Swift song generator that uses her lyrics to create new songs. Not only can you create new songs out of beloved Swift lyrics, but you can save your favorites to reference later. The app was built using a library called Markovify to create a Markov chain model. This model analyzes a body of text to determine probabilities of word occurrence. This then allows the app to generate next words based on these probabilities. The team even designed a cheeky 404 error page. Readers, I now encourage you to break the internet.

Rob Lineberger also had music on the brain. He developed the A-Ching app (based on the Ancient Chinese I-Ching) to disperse advice based on ABBA lyrics. To achieve this, Rob wrote a management command that used Beautiful Soup to parse an open source lyrics site. He then sorted the data by verses and programmed the app to produce a random call-up. Users can ask a question and the app will provide them with the answer in the form of the inimitable wisdom of the 1970s Swedish pop group. Having just finished coaching the Django Girls RDU workshop, Rob followed the Django Girls Tutorial to deploy his site, in order to further familiarize himself with the tutorial and to better coach for our next Django Girls RDU workshop.

Dan Poirier, Jeff Bradberry, and Scott Morningstar fleshed out Tequila, our Ansible alternative to SALT. Dan worked on getting the Caktus Group Django project template to deploy through Tequila in place of SALT. Now our developers can choose to deploy on Salt or Ansible. Though this transition is still a work in progress and in its testing phases, the team made a good step forward.

Rebecca Muraya spent her day importing her expense tracking data from Google Sheets to Django in an attempt to allow more complex analysis across multiple years. She hit a lot of bumps along the road but made significant headway in transferring the data.

Karen Tracey looked into using Django’s new HStoreField for a client project involving surveys. In these surveys information pulled from the roster for the individual survey-taker can be used to personalize the survey, and over time various fixed roster fields have been added to access different information about the survey-taker. Each of these additions has required code changes to support these new roster fields, many of which are used only very rarely. Though she didn’t get beyond messing with roster import code, she was heartened to find that the experiment worked, and could enable our client to add in extra roster fields at their own discretion.

Tobias McNulty spent the day testing a base operating system and Postgres database upgrade for a client project. He found the base image could be switched from Ubuntu 12.04 to 14.04 with few other changes. The Postgres upgrade was more difficult to automate because cluster file formats change between major releases. Rather than upgrade the cluster directly, he found it easier to use the project’s existing feature to backup the database from Postgres 9.1 and restore it to a new server running Postgres 9.4. When load testing the performance before and after the Postgres upgrade, he found little to no difference for this app.

Our Project Management team, Ben Riesling, Sarah Gray, Daryl Riethof, and [Katie Twilley] reevaluated their weekly reporting methods and developed a series of questions to help them create a new format for these reports.

Though Hunter MacDermut and Calvin Spealman were both out of the office for the day, they couldn’t help but participate in ShipIt Day from afar. Hunter explored the vast world of geolocation. Using a browser-based HTML geolocation API, he built a simple compass. Usable on any device with built-in GPS, you can open it on your smartphone to see the compass in action. The source code can be found on Github. Calvin attempted prototyping a CSS framework based on Flexbox rules with mixed results.

Nicole Foster and Liza Chabot were able to break away from their duties for a few hours to get some reading done. Nicole read a few articles from The Harvard Business Review’s 10 Must-Reads on Managing People. Liza read ”Hello World!”: The History of Programming and started on Charles Petzold’s Code: The Hidden Language of Computer Hardware and Software.

Caktus GroupDjango Girls Workshop Recap

This past Saturday we had the pleasure of holding a Django Girls RDU workshop at Caktus in downtown Durham, NC. We hosted 22 students and 10 coaches for a free, all-day coding workshop. Aimed at increasing diversity in tech and encouraging women to gain the skills they need to fall in love with coding, Django Girls is a global nonprofit that provides the framework for these workshops. In regular Django Girls style, the day was part party, part coding marathon and every student left having deployed their very own website!

We were incredibly lucky to have great partners from the local Django and Python community supporting our work that day, including PyLadies RDU, Durham Women in Tech, Code for Durham, and Code for Raleigh. Coaches donated their time and their weekends and attendees came from all walks of life (even as far as Atlanta, Georgia!) to spend their Saturdays teaching and learning to code.

Working in groups of 3 with one coach per group, attendees worked their way through a tutorial covering the development of a blog application and deploying it to Python Anywhere. “It was such a warm and open environment,” remarked Viola, a workshop attendee. “It was very invigorating being part of it.” Viola, who hadn’t made a web page since 1999, was one of many on a vast spectrum of coding experience to attend the workshop.

“It was... immensely satisfying to finish the day with a completed, attractive website that even had some rudimentary security. I loved feeling as if I had worked on a puzzle all day and managed to solve it,” wrote another attendee.

We are thrilled to have had the opportunity to support this movement and can’t wait to help coordinate another workshop in the future. Keep a sharp eye for more events from Django Girls RDU!

Og MacielBooks - September 2015

Caktus GroupColin Copeland to Speak on Police Data and Racial Bias at Code for America Summit

This Thursday, Colin Copeland, CTO and Caktus Group Co-founder, will be co-presenting “Case Study from North Carolina: Identifying Racial Bias in Policing Practices” during the prestigious 2015 Code for America Summit in Oakland, CA. This invite-only event joins technologists, activists, and officials ranging from mayors to White House officials to discuss technology’s role in civic participation.

Colin, as volunteer co-founder and co-captain of Code for Durham, will be discussing how he and fellow developers transformed over 20 million data points and twelve years of NC police data into a website that can help identify race-based policing practices. His co-presenter will be civil rights lawyer Ian Mance of Southern Coalition for Social Justice, who originated the idea and guided the site’s development. Last year, Ian's work with a local coalition of Durham activists received front-page attention in the New York Times for using a data-driven approach to combat racial profiling.

For more about Code for America’s role in the open data and civic technology movement, visit codeforamerica.org.

Tim HopperA Joke

Caktus GroupIntroduction to Monte Carlo Tree Search

For DjangoCon 2015, Jeff Bradberry created an A.I. for our booth game, Ultimate Tic Tac Toe. Reprinted here from jeffbradberry.com is his explanation of the Monte Carlo Tree Search used to build the A.I.

The subject of game AI generally begins with so-called perfect information games. These are turn-based games where the players have no information hidden from each other and there is no element of chance in the game mechanics (such as by rolling dice or drawing cards from a shuffled deck). Tic Tac Toe, Connect 4, Checkers, Reversi, Chess, and Go are all games of this type. Because everything in this type of game is fully determined, a tree can, in theory, be constructed that contains all possible outcomes, and a value assigned corresponding to a win or a loss for one of the players. Finding the best possible play, then, is a matter of doing a search on the tree, with the method of choice at each level alternating between picking the maximum value and picking the minimum value, matching the different players' conflicting goals, as the search proceeds down the tree. This algorithm is called Minimax.

The problem with Minimax, though, is that it can take an impractical amount of time to do a full search of the game tree. This is particularly true for games with a high branching factor, or high average number of available moves per turn. This is because the basic version of Minimax needs to search all of the nodes in the tree to find the optimal solution, and the number of nodes in the tree that must be checked grows exponentially with the branching factor. There are methods of mitigating this problem, such as searching only to a limited number of moves ahead (or ply) and then using an evaluation function to estimate the value of the position, or by pruning branches to be searched if they are unlikely to be worthwhile. Many of these techniques, though, require encoding domain knowledge about the game, which may be difficult to gather or formulate. And while such methods have produced Chess programs capable of defeating grandmasters, similar success in Go has been elusive, particularly for programs playing on the full 19x19 board.

However, there is a game AI technique that does do well for games with a high branching factor and has come to dominate the field of Go playing programs. It is easy to create a basic implementation of this algorithm that will give good results for games with a smaller branching factor, and relatively simple adaptations can build on it and improve it for games like Chess or Go. It can be configured to stop after any desired amount of time, with longer times resulting in stronger game play. Since it doesn't necessarily require game-specific knowledge, it can be used for general game playing. It may even be adaptable to games that incorporate randomness in the rules. This technique is called Monte Carlo Tree Search. In this article I will describe how Monte Carlo Tree Search (MCTS) works, specifically a variant called Upper Confidence bound applied to Trees (UCT) , and then will show you how to build a basic implementation in Python.

Imagine, if you will, that you are faced with a row of slot machines, each with different payout probabilities and amounts. As a rational person (if you are going to play them at all), you would prefer to use a strategy that will allow you to maximize your net gain. But how can you do that? For whatever reason, there is no one nearby, so you can't watch someone else play for a while to gain information about which is the best machine. Clearly, your strategy is going to have to balance playing all of the machines to gather that information yourself, with concentrating your plays on the observed best machine. One strategy, called Upper Confidence Bound 1 (UCB1), does this by constructing statistical confidence intervals for each machine


  • xi: the mean payout for machine i
  • ni: the number of plays of machine i
  • n: the total number of plays

Then, your strategy is to pick the machine with the highest upper bound each time. As you do so, the observed mean value for that machine will shift and its confidence interval will become narrower, but all of the other machines' intervals will widen. Eventually, one of the other machines will have an upper bound that exceeds that of your current one, and you will switch to that one. This strategy has the property that your regret, the difference between what you would have won by playing solely on the actual best slot machine and your expected winnings under the strategy that you do use, grows only as O(lnn). This is the same big-O growth rate as the theoretical best for this problem (referred to as the multi-armed bandit problem), and has the additional benefit of being easy to calculate.

And here's how Monte Carlo comes in. In a standard Monte Carlo process, a large number of random simulations are run, in this case, from the board position that you want to find the best move for. Statistics are kept for each possible move from this starting state, and then the move with the best overall results is returned. The downside to this method, though, is that for any given turn in the simulation, there may be many possible moves, but only one or two that are good. If a random move is chosen each turn, it becomes extremely unlikely that the simulation will hit upon the best path forward. So, UCT has been proposed as an enhancement. The idea is this: any given board position can be considered a multi-armed bandit problem, if statistics are available for all of the positions that are only one move away. So instead of doing many purely random simulations, UCT works by doing many multi-phase playouts.

The first phase, selection, lasts while you have the statistics necessary to treat each position you reach as a multi-armed bandit problem. The move to use, then, would be chosen by the UCB1 algorithm instead of randomly, and applied to obtain the next position to be considered. Selection would then proceed until you reach a position where not all of the child positions have statistics recorded.

Selection: Here the positions and moves selected by the UCB1 algorithm. Note that a number of playouts have already been run to accumulate the statistics shown. Each circle contains the number of wins / number of times played.
Selection. Here the positions and moves selected by the UCB1 algorithm at each step are marked in bold. Note that a number of playouts have already been run to accumulate the statistics shown. Each circle contains the number of wins / number of times played.

The second phase, expansion, occurs when you can no longer apply UCB1. An unvisited child position is randomly chosen, and a new record node is added to the tree of statistics.

Expansion: The position marked 1/1 at the bottom of the tree has no further statistics records under it, so we choose a random move and add a new record for it (bold), initialized to 0/0.
Expansion. The position marked 1/1 at the bottom of the tree has no further statistics records under it, so we choose a random move and add a new record for it (bold), initialized to 0/0.

After expansion occurs, the remainder of the playout is in phase 3, simulation. This is done as a typical Monte Carlo simulation, either purely random or with some simple weighting heuristics if a light playout is desired, or by using some computationally expensive heuristics and evaluations for a heavy playout. For games with a lower branching factor, a light playout can give good results.

Simulation: Once the new record is added, the Monte Carlo simulation begins, here depicted with a dashed arrow. Moves in the simulation may be completely random, or may use calculations to weight the randomness in favor of moves that may be better.
Simulation. Once the new record is added, the Monte Carlo simulation begins, here depicted with a dashed arrow. Moves in the simulation may be completely random, or may use calculations to weight the randomness in favor of moves that may be better.

Finally, the fourth phase is the update or back-propagation phase. This occurs when the playout reaches the end of the game. All of the positions visited during this playout have their play count incremented, and if the player for that position won the playout, the win count is also incremented.

Back-Propagation: After the simulation reaches an end, all of the records in the path taken are updated. Each has its play count incremented by one, and each that matches the winner has its win count incremented by one, here shown by the bolded numbers.
Back-Propagation. After the simulation reaches an end, all of the records in the path taken are updated. Each has its play count incremented by one, and each that matches the winner has its win count incremented by one, here shown by the bolded numbers.

This algorithm may be configured to stop after any desired length of time, or on some other condition. As more and more playouts are run, the tree of statistics grows in memory and the move that will finally be chosen will converge towards the actual optimal play, though that may take a very long time, depending on the game.

For more details about the mathematics of UCB1 and UCT, see Finite-time Analysis of the Multiarmed Bandit Problem and Bandit based Monte-Carlo Planning.

Now let's see some code. To separate concerns, we're going to need a Board class, whose purpose is to encapsulate the rules of a game and which will care nothing about the AI, and a MonteCarlo class, which will only care about the AI algorithm and will query into the Board object in order to obtain information about the game. Let's assume a Board class supporting this interface:

class Board(object):
    def start(self):
        # Returns a representation of the starting state of the game.

    def current_player(self, state):
        # Takes the game state and returns the current player's
        # number.

    def next_state(self, state, play):
        # Takes the game state, and the move to be applied.
        # Returns the new game state.

    def legal_plays(self, state_history):
        # Takes a sequence of game states representing the full
        # game history, and returns the full list of moves that
        # are legal plays for the current player.

    def winner(self, state_history):
        # Takes a sequence of game states representing the full
        # game history.  If the game is now won, return the player
        # number.  If the game is still ongoing, return zero.  If
        # the game is tied, return a different distinct value, e.g. -1.

For the purposes of this article I'm not going to flesh this part out any further, but for example code you can find one of my implementations on github. However, it is important to note that we will require that the state data structure is hashable and equivalent states hash to the same value. I personally use flat tuples as my state data structures.

The AI class we will be constructing will support this interface:

class MonteCarlo(object):
    def __init__(self, board, **kwargs):
        # Takes an instance of a Board and optionally some keyword
        # arguments.  Initializes the list of game states and the
        # statistics tables.

    def update(self, state):
        # Takes a game state, and appends it to the history.

    def get_play(self):
        # Causes the AI to calculate the best move from the
        # current game state and return it.

    def run_simulation(self):
        # Plays out a "random" game from the current position,
        # then updates the statistics tables with the result.

Let's begin with the initialization and bookkeeping. The board object is what the AI will be using to obtain information about where the game is going and what the AI is allowed to do, so we need to store it. Additionally, we need to keep track of the state data as we get it.

class MonteCarlo(object):
    def __init__(self, board, **kwargs):
        self.board = board
        self.states = []

    def update(self, state):

The UCT algorithm relies on playing out multiple games from the current state, so let's add that next.

import datetime

class MonteCarlo(object):
    def __init__(self, board, **kwargs):
        # ...
        seconds = kwargs.get('time', 30)
        self.calculation_time = datetime.timedelta(seconds=seconds)

    # ...

    def get_play(self):
        begin = datetime.datetime.utcnow()
        while datetime.datetime.utcnow() - begin < self.calculation_time:

Here we've defined a configuration option for the amount of time to spend on a calculation, and get_play will repeatedly call run_simulation until that amount of time has passed. This code won't do anything particularly useful yet, because we still haven't defined run_simulation, so let's do that now.

# ...
from random import choice

class MonteCarlo(object):
    def __init__(self, board, **kwargs):
        # ...
        self.max_moves = kwargs.get('max_moves', 100)

    # ...

    def run_simulation(self):
        states_copy = self.states[:]
        state = states_copy[-1]

        for t in xrange(self.max_moves):
            legal = self.board.legal_plays(states_copy)

            play = choice(legal)
            state = self.board.next_state(state, play)

            winner = self.board.winner(states_copy)
            if winner:

This adds the beginnings of the run_simulation method, which either selects a move using UCB1 or chooses a random move from the set of legal moves each turn until the end of the game. We have also introduced a configuration option for limiting the number of moves forward that the AI will play.

You may notice at this point that we are making a copy of self.states and adding new states to it, instead of adding directly to self.states. This is because self.states is the authoritative record of what has happened so far in the game, and we don't want to mess it up with these speculative moves from the simulations.

Now we need to start keeping statistics on the game states that the AI hits during each run of run_simulation. The AI should pick the first unknown game state it reaches to add to the tables.

class MonteCarlo(object):
    def __init__(self, board, **kwargs):
        # ...
        self.wins = {}
        self.plays = {}

    # ...

    def run_simulation(self):
        visited_states = set()
        states_copy = self.states[:]
        state = states_copy[-1]
        player = self.board.current_player(state)

        expand = True
        for t in xrange(self.max_moves):
            legal = self.board.legal_plays(states_copy)

            play = choice(legal)
            state = self.board.next_state(state, play)

            # `player` here and below refers to the player
            # who moved into that particular state.
            if expand and (player, state) not in self.plays:
                expand = False
                self.plays[(player, state)] = 0
                self.wins[(player, state)] = 0

            visited_states.add((player, state))

            player = self.board.current_player(state)
            winner = self.board.winner(states_copy)
            if winner:

        for player, state in visited_states:
            if (player, state) not in self.plays:
            self.plays[(player, state)] += 1
            if player == winner:
                self.wins[(player, state)] += 1

Here we've added two dictionaries to the AI, wins and plays, which will contain the counts for every game state that is being tracked. The run_simulation method now checks to see if the current state is the first new one it has encountered this call, and, if not, adds the state to both plays and wins, setting both values to zero. This method also adds every game state that it goes through to a set, and at the end updates plays and wins with those states in the set that are in the plays and wins dicts. We are now ready to base the AI's final decision on these statistics.

from __future__ import division
# ...

class MonteCarlo(object):
    # ...

    def get_play(self):
        self.max_depth = 0
        state = self.states[-1]
        player = self.board.current_player(state)
        legal = self.board.legal_plays(self.states[:])

        # Bail out early if there is no real choice to be made.
        if not legal:
        if len(legal) == 1:
            return legal[0]

        games = 0
        begin = datetime.datetime.utcnow()
        while datetime.datetime.utcnow() - begin < self.calculation_time:
            games += 1

        moves_states = [(p, self.board.next_state(state, p)) for p in legal]

        # Display the number of calls of `run_simulation` and the
        # time elapsed.
        print games, datetime.datetime.utcnow() - begin

        # Pick the move with the highest percentage of wins.
        percent_wins, move = max(
            (self.wins.get((player, S), 0) /
             self.plays.get((player, S), 1),
            for p, S in moves_states

        # Display the stats for each possible play.
        for x in sorted(
            ((100 * self.wins.get((player, S), 0) /
              self.plays.get((player, S), 1),
              self.wins.get((player, S), 0),
              self.plays.get((player, S), 0), p)
             for p, S in moves_states),
            print "{3}: {0:.2f}% ({1} / {2})".format(*x)

        print "Maximum depth searched:", self.max_depth

        return move

We have added three things in this step. First, we allow get_play to return early if there are no choices or only one choice to make. Next, we've added output of some debugging information, including the statistics for the possible moves this turn and an attribute that will keep track of the maximum depth searched in the selection phase of the playouts. Finally, we've added code that picks out the move with the highest win percentage out of the possible moves, and returns it.

But we are not quite finished yet. Currently, our AI is using pure randomness for its playouts. We need to implement UCB1 for positions where the legal plays are all in the stats tables, so the next trial play is based on that information.

# ...
from math import log, sqrt

class MonteCarlo(object):
    def __init__(self, board, **kwargs):
        # ...
        self.C = kwargs.get('C', 1.4)

    # ...

    def run_simulation(self):
        # A bit of an optimization here, so we have a local
        # variable lookup instead of an attribute access each loop.
        plays, wins = self.plays, self.wins

        visited_states = set()
        states_copy = self.states[:]
        state = states_copy[-1]
        player = self.board.current_player(state)

        expand = True
        for t in xrange(1, self.max_moves + 1):
            legal = self.board.legal_plays(states_copy)
            moves_states = [(p, self.board.next_state(state, p)) for p in legal]

            if all(plays.get((player, S)) for p, S in moves_states):
                # If we have stats on all of the legal moves here, use them.
                log_total = log(
                    sum(plays[(player, S)] for p, S in moves_states))
                value, move, state = max(
                    ((wins[(player, S)] / plays[(player, S)]) +
                     self.C * sqrt(log_total / plays[(player, S)]), p, S)
                    for p, S in moves_states
                # Otherwise, just make an arbitrary decision.
                move, state = choice(moves_states)


            # `player` here and below refers to the player
            # who moved into that particular state.
            if expand and (player, state) not in plays:
                expand = False
                plays[(player, state)] = 0
                wins[(player, state)] = 0
                if t > self.max_depth:
                    self.max_depth = t

            visited_states.add((player, state))

            player = self.board.current_player(state)
            winner = self.board.winner(states_copy)
            if winner:

        for player, state in visited_states:
            if (player, state) not in plays:
            plays[(player, state)] += 1
            if player == winner:
                wins[(player, state)] += 1

The main addition here is the check to see if all of the results of the legal moves are in the plays dictionary. If they aren't available, it defaults to the original random choice. But if the statistics are all available, the move with the highest value according to the confidence interval formula is chosen. This formula adds together two parts. The first part is just the win ratio, but the second part is a term that grows slowly as a particular move remains neglected. Eventually, if a node with a poor win rate is neglected long enough, it will begin to be chosen again. This term can be tweaked using the configuration parameter C added to __init__ above. Larger values of C will encourage more exploration of the possibilities, and smaller values will cause the AI to prefer concentrating on known good moves. Also note that the self.max_depth attribute from the previous code block is now updated when a new node is added and its depth exceeds the previous self.max_depth.

So there we have it. If there are no mistakes, you should now have an AI that will make reasonable decisions for a variety of board games. I've left a suitable implementation of Board as an exercise for the reader, but one thing I've left out here is a way of actually allowing a user to play against the AI. A toy framework for this can be found at jbradberry/boardgame-socketserver and jbradberry/boardgame-socketplayer.

This version that we've just built uses light playouts. Next time, we'll explore improving our AI by using heavy playouts, by training some evaluation functions using machine learning techniques and hooking in the results.

UPDATE: The diagrams have been corrected to more accurately reflect the possible node values.

Caktus GroupPyCon 2016: Behind the Design

Having helped to design an award-winning event site for last year’s PyCon in Montreal, we are thrilled to collaborate again with the Python Software Foundation (PSF) on this year’s site for PyCon 2016.

PyCon 2016 will be held in Portland, OR and PSF knew they wanted to convey the distinctive mood and look of that city with the 2016 website. Working collaboritively with PSF, Designer Trevor Ray drew on everything from the unique architecture of the city’s craftsman-style bungalow houses and the surrounding mountainous landscape to the cool color scheme of the Pacific-Northwest region. The team developed a site that leads the user on a journey. As he or she scrolls, the user is brought further into the city, from the low, rolling, misty, forest-topped mountains on the outskirts, into the very heart of its neighborhoods.

Trevor also redesigned the PyCon logo for 2016, giving it a peaked shape, hand-lettered typography (a reference to the thriving craft community of Portland), and a classic, balanced layout. Ultimately, our team and PSF worked together to achieve a site that we hope is welcoming and functional.

We’re excited for this year’s PyCon, and we hope the site excites attendees as well. Only 249 days till PyCon!

Dave O'ConnorMySQL SQLAlchemy OS X Reference

Whenever I learn something new, I enjoy putting together a reference sheet of whatever it is I am learning. In regards to Python, this process typically involves working in the Python interpreter and diligently recording commands that worked as intended. Today I want to share the reference sheet I put together while exploring MySQL and [SQLAlchemy](http://www.sqlalchemy.org/) on OS

Tim HopperData Science and Agriculture

I'm excited to see tools developed for the web being applied to offline domains like agriculture and health. I posed a question on Twitter yesterday:

I got a number of replies. Companies in this space (not all hiring for DS) include:

  • The Climate Corporation: Known for their high profile acquisition by Monsanto, Climate Corp "combines hyper-local weather monitoring, agronomic modeling, and high-resolution weather simulations" "that help farmers improve profitability by making better informed operating and financing decisions". They are based in San Fransisco.
  • FarmLogs: A YCombinator-backed startup in Ann Arbor, MI, FarmLogs is combining satellite maps, soil surveys, GPS data, and more to "expose critical operational data insights" to row crop farmers.
  • Dairymaster: In Ireland, Darymaster is hiring data scientists for their business of dairy equipment manufacturing.
  • aWhere: aWhere "delivers agricultural intelligence into the hands of farmers, commercial growers, and policymakers everywhere" by collecting detailed weather data from around the world. They are based in Denver.
  • pulsepod: This small startup is building hardware to help farmers "know when to plant, fertilize, irrigate, and harvest to achieve quality and yield goals using data from your own field". They are based in Princeton, NJ.
  • Granular: "Granular, a new kind of farm management software and analytics platform, helps you improve efficiency, profit and yield so you are ready to farm more acres." They are based in San Fransisco.
  • PrecisionHawk: Based in my home of Raleigh, NC, PrecisionHawk is "an information delivery company that combines unmanned aerial systems, cutting-edge artificial intelligence and remote sensing technologies to improve business operations and day-to-day decision making."
  • mavrx: "We use aerial imagery to provide actionable insights to the global agriculture industry." They are based in San Fransisco.
  • AgSmarts: Memphis-based Agsmarts "is a Precision Agriculture company that automates existing agricultural irrigation systems with our universal retrofit solution to optimize crop yield, save water and AgSmarts minimize input costs via mesh networks of IP-enabled controllers & environmental sensors."

Tim HopperNotes on Gibbs Sampling in Hierarchical Dirichlet Process Models, Part 2

Update (2015-09-21): I moved this post to a PDF on Github.

Tim HopperNotes on Gibbs Sampling in Hierarchal Dirichlet Process Models, Part 2

In an earlier post, I derived the necessary equations to sample table assignments when applying the "Posterior sampling in the Chinese restaurant franchise" algorithm to the case of latent Dirichlet allocation. I complete the derivation of necessary equations here.

We need to sample $k_{jt}$ (the dish/topic for table $t$ in restaurant $j$):

\[\displaystyle p(k_{jt}=k \,|\, {\bf t}, {\bf k}^{-jt}) \propto \begin{cases} m_{\cdot k}^{-jt}\cdot f_k^{-{\bf x}_{jt}}({\bf x}_{jt}) & {\tiny \text{if } k \text{ previously used,}}\\ \gamma\cdot f_{k^\text{new}}^{-{\bf x}_{jt}}({\bf x}_{jt}) & {\tiny \text{if } t=k^{\text{new}}}. \end{cases} \]

where $f_k^{-{\bf x}_{jt}}({\bf x}_{jt})$ is the "conditional density of ${\bf x}_{jt}$ given all data items associated with mixture component $k$ leaving out ${\bf x}_{jt}$" (Teh, et al). (${\bf x}_{jt}$ is every customer in restaurant $j$ seated at table $t$). $m_{\cdot k}^{-jt}$ is the number of tables (in all franchises) serving dish $k$ when we remove table $jt$.

This requires $f_k^{-{\bf x}_{jt}}({\bf x}_{jt})$; this is different from Equation (30), though they look quite similar.

\[ \begin{align} f_k^{-{\bf x}_{jt}}({\bf x}_{jt}) &=\frac{\displaystyle \int {\prod_{x_{ji}\in {\bf x}_{jt}}} f(x_{ji} \,|\, \phi_k) \left[ \prod_{x_{i'j'}\not\in {\bf x}_{jt}, z_{i'j'}=k} f(x_{j'i'} \,|\, \phi_k) \right] h(\phi_k) d\phi_k } {\displaystyle \int \left[ \displaystyle\prod_{x_{i'j'}\not\in {\bf x}_{jt}, z_{i'j'}=k} f(x_{i'j'}|\phi_k) \right] h(\phi_k)d\phi_k } \\ &=\frac{\displaystyle \int {\prod_{x_{ji}\in {\bf x}_{jt}}} \phi_{k x_{ji}} \left[ \prod_{x_{i'j'}\not\in {\bf x}_{jt}, z_{i'j'}=k} \phi_{k x_{i'j'}} \right] \prod_{w} \phi_{kw}^{\beta-1} d\phi_k } {\displaystyle \int \left[ \displaystyle\prod_{x_{i'j'}\not\in {\bf x}_{jt}, z_{i'j'}=k} f(x_{i'j'}|\phi_k) \right] \prod_{w} \phi_{kw}^{\beta-1} d\phi_k } \\ &=\frac{\displaystyle \int {\prod_{x_{ji}\in {\bf x}_{jt}}} \phi_{k x_{ji}} \left[ \prod_{x_{i'j'}\not\in {\bf x}_{jt}, z_{i'j'}=k} \phi_{k x_{i'j'}} \right] \prod_{w} \phi_{kw}^{\beta-1} d\phi_k } {\displaystyle \int \left[ \displaystyle\prod_{x_{i'j'}\not\in {\bf x_{jt}}, z_{i'j'}=k} \phi_{k x_{i'j'}} \right] \prod_{w} \phi_{kw}^{\beta-1} d\phi_k } \end{align} \]

The denominator is

\[ \begin{align} \text{denominator}&= \int\left[ \prod_{x_{i'j'}\not\in {\bf x_{jt}}, z_{i'j'}=k} \phi_{k x_{i'j'}} \right] \prod_{w} \phi_{kw}^{\beta-1} d\phi_k \\ &=\int\left[ \prod_w \phi_{kw}^{n_{kw}^{-jt}} \prod_w \phi_{kw}^{\beta-1} \right] d\phi_k \\ &=\int\left[ \prod_w \phi_{kw}^{n_{kw}^{-jt}+\beta-1} \right] d\phi_k \\ &=\frac{ \prod_w \Gamma\left( n_{kw}^{-jt} + \beta \right) }{ \Gamma\left( \sum_w n_{kw}^{-jt}+\beta \right) } \\ &=\frac{ \prod_w \Gamma\left( n_{kw}^{-jt} + \beta \right) }{ \Gamma\left( n_{k\cdot}^{-jt}+V\beta \right) } \end{align} \]

The numerator is

\[ \begin{align} \text{numerator} &=\int {\prod_{x_{ji}\in {\bf x}_{jt}}} \phi_{k x_{ji}} \left[ \prod_{x_{i'j'}\not\in {\bf x}_{jt}, z_{i'j'}=k} \phi_{k x_{i'j'}} \right] \prod_{w} \phi_{kw}^{\beta-1} d\phi_k \\ &=\int \prod_{w} \phi_{kw}^{ n_{kw}^{-jt} + n_{\cdot w}^{jt} + \beta + 1 } d\phi_k \\ &=\frac{ \prod_w \Gamma\left( n_{kw}^{-jt} + n_{\cdot w}^{jt} + \beta \right) }{ \Gamma \left( \sum_w n_{kw}^{-jt} + n_{\cdot w}^{jt} + \beta \right) }\\ &=\frac{ \prod_w \Gamma\left( n_{kw}^{-jt} + n_{\cdot w}^{jt} + \beta \right) }{ \Gamma \left( n_{k\cdot}^{-jt} + n_{\cdot \cdot}^{jt} + \beta \right) } \end{align} \]

This gives us a closed form version of this conditional distribution:

\[ \begin{align} f_k^{-{\bf x_{jt}}}({\bf x}_{jt}) &= \displaystyle\frac{ \prod_w \Gamma\left( n_{kw}^{-jt} + n_{\cdot w}^{jt} + \beta \right) }{ \prod_w \Gamma\left( n_{kw}^{-jt} + \beta \right) } \frac{ \Gamma\left( n_{k\cdot}^{-jt}+V\beta \right) }{ \Gamma \left( n_{k\cdot}^{-jt} + n_{\cdot \cdot}^{jt} + \beta \right) } \end{align} \]

We also need the conditional distribution of $k$ is a new dish: $f_{k^\text{new}}^{-{\bf x}_{jt}}({\bf x}_{jt})$. Shuyo provides without derivation:

\[ \begin{align} f_{k^\text{new}}^{-{\bf x}_{jt}}({\bf x}_{jt}) &=\int \left[ \prod_{x_{ji}\in \mathbf{x_{jt}}} f(x_{ji} \,|\, \phi) \right] h(\phi)d\phi_k \\ &=\int \prod_{x_{ji}\in \mathbf{x_{jt}}} \phi_{x_{ji}} \cdot \frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) } \prod_w \phi_w^{\beta-1} d\phi \\ &=\frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) } \int \prod_{x_{ji}\in \mathbf{x_{jt}}} \phi_{x_{ji}} \cdot \prod_w \phi_w^{\beta-1} d\phi\\ &=\frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) } \int \prod_{x_{ji}\in \mathbf{x_{jt}}} \phi_{x_{ji}}^{\left(\beta+1\right)-1} \cdot \prod_{x_{jt}\not\in \mathbf{x_{jt}}} \phi_{x_{jt}}^{\beta-1} d\phi\\ &=\frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) }\cdot \frac{ \prod_{x_{ji}\in \mathbf{x_{jt}}} \Gamma(\beta+1) \prod_{x_{ji}\not\in \mathbf{x_{jt}}}\Gamma(\beta) }{ \Gamma(V\beta+\sum_{x_{ji}\in \mathbf{x_{jt}}} 1) }\\ &=\frac{ \Gamma(V\beta) \prod_v \Gamma(\beta+n_{\cdot v}^{jt}) }{ \Gamma(V\beta+n_{\cdot\cdot}^{jt}) \prod_v \Gamma(\beta) } \end{align} \]

Given these equations for $f_{k}^{-{\bf x}_{jt}}({\bf x}_{jt})$ and $f_{k^\text{new}}^{-{\bf x}_{jt}}({\bf x}_{jt})$, we can draw samples from $p(k_{jt}=k \,|\, {\bf t}, {\bf k}^{-jt})$ by enumeration over topics. Combined with sampling from table assignment described here, we now have a complete Gibbs sampler for the Posterior sampling in the Chinese restaurant franchise in Teh, et al.

Dave O'ConnorWeb Scraping with Scrapy

In this post we will use the fast and powerful web scraper [Scrapy](http://scrapy.org/) to comb [theonion.com](http://www.theonion.com/) for horoscopes. Web scraping is one of the many tools in the Python tool belt and can be incredibly helpful in retrieving information from a website that does not provide an API endpoint. Unfortunately web pages do change and as such web

Dave O'ConnorBuilding a Simple REST API With Django Rest Framework

Today we will focus on building a simple REST API using the [Django Rest Framework](http://www.django-rest-framework.org/). RESTful services give access, typically over HTTP, to resources containing a wide variety of data and/or media. One of the great benefits of the Django Rest Framework is that it is able to get something tangible up and running very quickly. It integrates

Dave O'ConnorLet's Build a Web App: Part 4 - Model, Template, View

Before we jump into designing the front-end of our website, we should configure Django to render content in the browser. Django is a [MTV framework](https://docs.djangoproject.com/en/1.8/faq/general/), short for Model, Template, View. Remember, we are trying to enable an end user to query PostgreSQL and visualize access log data. We already have our Entry model so now we need a

Dave O'ConnorLet's Build a Web App: Part 5 - Bootstrap

In this article we will dive into Bootstrap and begin working on a front end for querying the data in our Entry models. [Bootstrap](http://getbootstrap.com/) is a popular front-end framework for quickly and cleanly styling a website. Created by the folks at Twitter, it uses a combination of HTML, CSS, and Javascript to standardize views across mobile, tablet, and

Dave O'ConnorExploring Python 3

In this article we will discuss Python 3 as it relates to Python 2 and explore some of the more obvious differences between the two versions. Python 3 was released on December 3rd, 2008 and has been a point of contention within the Python community. Unlike previous iterations of Python, this version broke backwards compatibility and effectively erased

Frank WierzbickiJython 2.7.1 beta1 released!

On behalf of the Jython development team, I'm pleased to announce that the first beta of Jython 2.7.1 is available! This is a bugfix release. Bug fixes include:
  • Import systems fixes for relative imports and some circular imports.
  • -m command now executes scripts from inside a jar file.
  • bytearray matches cpython's behavior better.
Please see the NEWS file for detailed release notes. This release of Jython requires JDK 7 or above.

This release is being hosted at maven central. There are three main distributions. In order of popularity:
To see all of the files available including checksums, go to the maven query for Jython and navigate to the appropriate distribution and version.

Tim HopperDistribution of Number of Tables in Chinese Restaurant Process

The Chinese Restaurant Process (related to the Dirichlet process) often comes up in Bayesian nonparametrics. Some related MCMC algorithm require drawing samples from the distribution of tables created by a Chinese restaurant process with parameter after a given number of patrons are seated. Of course, you could sample from this distribution by simulating the CRP and counting tables, however Antoniak provided a closed form of the probability mass function. Here is a helpful introduction to the "Antoniak equation".

I wrote some quick and dirty Python code to sample from the Antoniak distribution. It's based Matlab code by Yee Whye Teh.

Tim HopperNotes on Gibbs Sampling in Hierarchical Dirichlet Process Models

Update (2015-09-21): I moved this post to a PDF on Github.

Tim HopperNotes on Gibbs Sampling in Hierarchal Dirichlet Process Models

Yee Whye Teh et al's 2005 paper Hierarchical Dirichlet Processes describes a nonparametric prior for grouped clustering problems. For example, the HDP helps in generalizing the latent Dirichlet allocation model to the case the number of topics in the data are discovered by the inference algorithm instead of being specified as a parameter of the model.

The authors describe three MCMC-based algorithms for fitting HDP based models. Unfortunately, the algorithms are described somewhat vaguely and in general terms. A fair bit of mathematical leg work is required before the HDP algorithms can be applied to the specific case of nonparametric latent Dirichlet allocation.

Here are some notes I've compiled in my effort to understand these algorithms.

HDP-LDA Generative Model

The generative model for Hierarchical Dirichlet Process Latent Dirichlet Allocation is as follows:

\[ \begin{align} H & \sim \text{Dirichlet}(\beta) \\ G_0 \,|\, \gamma, H & \sim \text{DP}(\gamma, H) \\ G_j \,|\, \alpha_0, G_0 & \sim \text{DP}(\alpha_0, G_0) \\ \theta_{ji} \,|\, G_j & \sim G_j \\ x_{ij} \,|\, \theta_{ji} & \sim \text{Categorical}(\theta_{ji}) \\ \end{align} \]

  • $H$ is Dirichlet distribution whose dimension is the size of the vocabulary, i.e. it is distribution over an uncountably-number of term distributions (topics).
  • $G_0$ is a distribution over a countably-infinite number of categorical term distributions, i.e. topics.
  • For each document $j$, $G_j$ is a is a distribution over a countably-infinite number of categorical term distributions, i.e. topics.
  • $\theta_{ji}$ is a categorical distribution over terms, i.e. a topic.
  • $x_{ij}$ is a term.

To see code for sampling from this generative model, see this post.

Chinese Restaurant Franchise Approach

Instead of the above Dirichlet process model, we can think of an identical "Chinese Restaurant Franchise" model.

Each $\theta_{ji}$ is a customer in restaurant $j$. Each customer is sitting at a table, and each table has multiple customers.

There is a global menu of $K$ dishes that the restaurants serve, $\phi_1,\ldots,\phi_K$.

Some other definitions:

  • $\psi_{jt}$ is the dish served at table $t$ in restaurant $j$; i.e. each $\psi_{jt}$ corresponds to some $\phi_k$.
  • $t_{ji}$ is the index of the $\psi_{jt}$ associated with $\theta_{ji}$.
  • $k_{jt}$ is the index of $\phi_k$ associated with $\psi_{jt}$.

Customer $i$ in restaurant $j$ sits at table $t_{ji}$ while table $t$ in restaurant $j$ serves dish $k_{jt}$.

There are two arrays of count variables we will want to track:

  • $n_{jtk}$ is the number of customers in restaurant $j$ at table $t$ eating dish $k$.
  • $m_{jk}$ is the number of tables in restaurant $j$ serving dish $k$ (multiple tables may serve the same dish).

To summarize:

$x_{ij}$ are observed data (words). We assume $x_{ij}\sim F(\theta_{ij})$. Further, we assume $\theta_{ji}$ is associated with table $t_{ji}$, that is $\theta_{ji}=\psi_{jt_{ji}}$. Further, we assume the topic for table $j$ is indexed by $k_{jt}$, i.e. $\psi_{jt}=\phi_{k_{jt}}$. Thus, if we know $t_{ji}$ (the table assignment for $x_{ij}$) and $k_{jt}$ (the dish assignment for table $t$) for all $i, j, t$, we could determine the remaining parameters by sampling.

Gibbs Sampling

Teh et al describe three Gibbs samplers for this model. The first and third are most applicable to the LDA application. The section helps with more complication applications of the LDA algorithm (e.g. the hidden Markov model).

5.3 Posterior sampling by direct assignment

Section 5.3 describes a direct assignment Gibbs sampler that directly assigns words to topics by augmenting the model with an assignment variable $z_{ji}$ that is equivalent to $k_{jt_{ji}}$. This also requires a count variable $m_{jk}$: the number of tables in document/franchise $j$ assigned to dish/topic $k$. This sampler requires less "bookkeeping" than the algorithm from 5.1, however it requires expensive simulation or computation of recursively computed Stirling numbers of the first kind.

5.1 Posterior sampling in the Chinese restaurant franchise

Section 5.1 describes "Posterior sampling in the Chinese restaurant franchise". Given observed data $\mathbf{x}$ (i.e. documents), we sample over the index variables $t_{ji}$ (associating tables with customers/words) and $k_{jt}$ (associating tables with dishes/topics). Given these variables, we can reconstruct the distribution over topics for each document and distribution over words for each topic.

Notes on Implementing Algorithm 5.1

Teh et al's original HDP paper is sparse on details with regard to applying these samplers to the specific case of nonparametric LDA. For example, both samplers require computing the conditional distribution of word $x_{ji}$ under topic $k$ given all data items except $x_{ji}$: $f_k^{x_{ji}}(x_{ji})$) (eq. 30).

A Blogger Formerly Known As Shuyo has a brief post where he states (with little-to-no derivation) the equations specific to the LDA case. Below I attempt provide some of those derivations in pedantic detail.

As stated above, in the case of topic modeling, $H$ is a Dirichlet distribution over terms distributions and $F$ is a multinomial distribution over terms.

By definition,

\[ h(\phi_k)=\frac{1}{Z}\prod_v[\phi_k]_v^{\beta-1} \]


\[ f(x_{ji}=v \,|\, \phi_k)=\phi_{kv}. \]

Equation (30): $f_k^{x_{ji}}(x_{ji})$

For convenience, define $v=x_{ji}$ (word $i$ in document $j$), define $k=k_{jt_{ji}}$ (topic assignment for the table in document $j$ containing word $i$), and

\[ n_{kv}^{-ji}=\left|\left\{ x_{mn} \,|\, k_{mt_{mn}}=k,\, x_{mn}=v,\, (m,n)\neq(j,i) \right\}\right| \]

(the number of time the term $x_{ji}$, besides $x_{ji}$ itself, is generated by the same topic as was $x_{ji}$).

First look at the term (for fixed $k$):

\[ \prod_{j'i'\neq ji, z_{j'i'=k}} f(x_{j'i'} \,|\, \phi_k)= \prod_{j'} \prod_{i'\neq i, z_{j'i'}=k} [\phi_{k}]_{x_{j'i'}} \]

$[\phi_k]_v$ is the probability that term $v$ is generated by topic $k$. The double sums run over every word generated by topic $k$ in every document. Since $[\phi_{k}]_{x_{j'i'}}$ is fixed for a given word $w$, we could instead do a product over the each word of the vocabulary:

\[ \prod_{j'i'\neq ji, z_{j'i'=k}} f(x_{j'i'} \,|\, \phi_k) =\prod_{w\in\mathcal{V}}[\phi_k]_w^{n_{kw}^{-ji}} \]

We use this early on in the big derivation below.

Also, note that

\[ \int \phi_{kv}^{n_{kv}^{-ji}+\beta} \prod_{w\neq v} \phi_{kw}^{n_{kw}^{-ji}+\beta-1} d\phi_k \text{ and } \int \prod_w \phi_{kw}^{n_{kw}^{-ji}+\beta-1} d\phi_k \]

are the normalizing coefficients for Dirichlet distributions.

Equation (30) in Teh's paper is:

\[ \begin{align} f_k^{-x_{ji}}(x_{ji}) &=\frac{ \int f(x_{ji} \,|\, \phi_k) \left[ \prod_{j'i'\neq ji, z_{j'i'}=k} f(x_{j'i'} \,|\, \phi_k) \right] h(\phi_k) d\phi_k } { \int \left[ \prod_{j'i'\neq ji, z_{j'i'}=k} f(x_{j'i'}|\phi_k) \right] h(\phi_k)d\phi_k } \\ &=\frac{ \int f(x_{ji} \,|\, \phi_k) \left[ \prod_{j'} \prod_{i'\neq i, z_{j'i'}=k} \phi_{kv} \right]\cdot h(\phi_k) d\phi_k } { \int \left[ \prod_{j'i'\neq ji, z_{j'i'}=k} f(x_{j'i'}|\phi_k) \right]h(\phi_k)d\phi_k } \\ &\propto\frac{ \int \phi_{kv} \prod_w \phi_{kw}^{n_{kw}^{-ji}} \prod_{w} \phi_{kw}^{\beta-1} d\phi_k }{ \int \prod_w \phi_{kw}^{n_{kw}^{-ji}} \prod_w \phi_{kw}^{\beta-1} d\phi_k }\\ &=\frac{ \int \phi_{kv}\cdot \phi_{kv}^{n_{kv}^{-ji}}\cdot \prod_{w\neq v} \phi_{kw}^{n_{kw}^{-ji}}\cdot \phi_{kv}^{\beta-1}\cdot \prod_{w\neq v} \phi_{kw}^{\beta-1} d\phi_k }{ \int \prod_w \phi_{kw}^{n_{kw}^{-ji}} \prod_w \phi_{kw}^{\beta-1} d\phi_k }\\ &= \int \phi_{kv}^{n_{kv}^{-ji}+\beta} \prod_{w\neq v} \phi_{kw}^{n_{kw}^{-ji}+\beta-1} d\phi_k \cdot \frac{ 1 }{ \int \prod_w \phi_{kw}^{n_{kw}^{-ji}+\beta-1} d\phi_k }\\ &=\frac{ \Gamma\left(\beta+n_{kv}^{-ji}+1\right) \prod_{w\neq v} \Gamma\left(\beta+n_{kw}^{-ji}\right) }{ \Gamma\left( \sum_{w\neq v} \left[ \beta+n_{kw}^{-ji} \right]+ (\beta+n_{kv}^{-ji}+1)\right) }\cdot \frac{ 1 }{ \int\prod_w \left(\phi_{kw}\right)^{n_{kw}^{-ji}+\beta-1} d\phi_k }\\ &=\frac{ \Gamma\left(\beta+n_{kv}^{-ji}+1\right) \prod_{w\neq v} \Gamma\left(\beta+n_{kw}^{-ji}\right) }{ \Gamma\left( \sum_{w\in\mathcal{V}} \left[ \beta+n_{kw}^{-ji} \right] +1\right) }\cdot \frac{ 1 }{ \int\prod_w \left(\phi_{kw}\right)^{n_{kw}^{-ji}+\beta-1} d\phi_k }\\ &=\frac{ \Gamma\left(\beta+n_{kv}^{-ji}+1\right) \prod_{w\neq v} \Gamma\left(\beta+n_{kw}^{-ji}\right) }{ \Gamma\left( \sum_{w\in\mathcal{V}} \left[ \beta+n_{kw}^{-ji} \right] +1\right) }\cdot \frac{ \Gamma\left( \sum_{w\in\mathcal{V}} \left[ \beta+n_{kw}^{-ji} \right] \right) }{ \prod_{w}\Gamma\left(\beta+n_{kw}^{-ji}\right) }\\ &=\frac{ \Gamma\left(\beta+n_{kv}^{-ji}+1\right) \prod_{w\neq v} \Gamma\left(\beta+n_{kw}^{-ji}\right) }{ \Gamma\left(V\beta+n_{k\cdot}^{-ji}+1\right) }\cdot \frac{ \Gamma\left(V\beta+n_{k\cdot}^{-ji}\right) }{ \prod_{w}\Gamma\left(\beta+n_{kw}^{-ji}\right) }\\ &=\frac{ \Gamma\left(\beta+n_{kv}^{-ji}+1\right) \Gamma\left(V\beta+n_{k\cdot}^{-ji}\right) }{ \Gamma\left(V\beta+n_{k\cdot}^{-ji}+1\right) }\cdot \frac{ \prod_{w\neq v} \Gamma\left(\beta+n_{kw}^{-ji}\right) }{ \prod_{w}\Gamma\left(\beta+n_{kw}^{-ji}\right) }\\ &=\frac{ \Gamma\left(\beta+n_{kn}^{-ji}+1\right) \Gamma\left(V\beta+n_{k\cdot}^{-ji}\right) }{ \Gamma\left(V\beta+n_{k\cdot}^{-ji} + 1\right) \Gamma\left(\beta+n_{kv}^{-ji}\right) }\\ &=\frac{ \Gamma\left(\beta+n_{kn}^{-ji}+1\right) }{ \Gamma\left(\beta+n_{kv}^{-ji}\right) }\cdot \frac{ \Gamma\left(V\beta+n_{k\cdot}^{-ji}\right) }{ \Gamma\left(V\beta+n_{k\cdot}^{-ji} + 1\right) }\\ &=\frac{\beta+n_{kv}^{-ji}}{V\beta+n_{k\cdot}^{-ji}} \end{align} \]

This seems to be validated in the appendix of this paper.


We also need the prior density of $x_{ji}$ to compute the likelihood that $x_{ji}$ will be seated at a new table.

\[ \begin{align} f_{k^{\text{new}}}^{-x_{ji}}(x_{ji}) &= \int f(x_{ji} \,|\, \phi) h(\phi)d\phi_k \\ &=\int \phi_v \cdot \frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) } \prod_w \phi_w^{\beta-1} d\phi \\ &=\frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) } \int \phi_v \cdot \prod_w \phi_w^{\beta-1} d\phi\\ &=\frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) } \int \phi_v^\beta \cdot \prod_{w\neq v} \phi_w^{\beta-1} d\phi\\ &=\frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) }\cdot \frac{ \Gamma(\beta+1)\prod_{w\neq v}\Gamma(\beta) }{ \Gamma(V\beta+1) }\\ &=\frac{ \Gamma(V\beta) }{ \Gamma(V \beta+1) }\cdot \frac{ \beta\prod_{w}\Gamma(\beta) }{ \prod_w \Gamma(\beta) }\\ &=\frac{ 1 }{ V \beta }\cdot \beta\\ &= \frac{1}{V} \end{align} \]

Equation (31): $p(x_{ji} \,|\, {\bf t}^{-ji}, t_{ji}=t^{new}, {\bf k})$

These last two derivations give us Equation (31), the likelihood that $t_{ji}=t^{new}$:

\[ \begin{align} p(x_{ji} \,|\, {\bf t}^{-ji}, t_{ji}=t^{new}, {\bf k}) &=\sum_{k=1}^K \left[ \frac{ m_{\cdot k} }{ m_{\cdot\cdot}+\gamma }\cdot \frac{\beta+n_{kv}^{-ji}}{V\beta+n_{k\cdot}^{-ji}} \right] \\ &\phantom{=}+ \frac{ \gamma }{ m_{\cdot\cdot}+\gamma } \cdot \frac{1}{V} \end{align} \]

Equation (32): $p(t_{ji}=t \,|\, {\bf t}^{-ji}, {\bf k})$

From this, we know the conditional distribution of $t_{ji}$ is:

\[ p(t_{ji}=t \,|\, {\bf t}^{-ji}, {\bf k}) \propto \begin{cases} n_{jt\cdot}^{-ji} \cdot \frac{\beta+n_{k_{jt}v}^{-ji}}{V\beta+n_{k_{jt}\cdot}^{-ji}} & {\tiny \text{if } t \text{ previously used,}}\\ {\tiny \alpha_0 \cdot p(x_{ji} \,|\, {\bf t}^{-ji}, t_{ji}=t^{new}, {\bf k})} & {\tiny \text{if } t=t^{\text{new}}}. \end{cases} \]

Equation (33): $p(k_{jt^\text{new}}=k \,|\, {\bf t}, {\bf k}^{-jt^\text{new}})$

If the sampled value of $t_{ji}$ is $t^{\text{new}}$, we sample a dish $k_{jt^\text{new}}$ for the table with:

\[ p(k_{jt^\text{new}}=k \,|\, {\bf t}, {\bf k}^{-jt^\text{new}}) \propto \begin{cases} m_{\cdot k}\cdot\frac{\beta+n_{kv}^{-ji}}{V\beta+n_{k\cdot}^{-ji}} & {\tiny \text{if } k \text{ previously used,}}\\ \frac{ \gamma }{V} & {\tiny \text{if } t=k^{\text{new}}}. \end{cases} \]

My notes on derivation the final equation (34) for sampling $k$ are in rough shape. I intend to follow this with a post on that. In the mean time, you can see Shuyo's post, his implementation, and equation 7 in this paper.

Tim HopperClasses Future Programmers Should Take

I appreciated James Hague's post on Computer Science Courses that Don't Exist, But Should.

I really like Dave Thomas's idea of a Software Archeology class. I have spent a huge amount of time as a developer reading (vs. writing) code. I wish I'd been taught how to read code effectively.

Similarly, I have thought there should be a class (or series of classes) in "interacting with others' code". Topics could include inheriting a software project, handing off a software project, writing code using 3rd party libraries, using package repositories, and understanding software licenses. These are such important skills in real world software development, but they seem to be rarely taught in the classroom. Perhaps a follow-up class could be "contributing to open source".

Tim HopperFitting a Mixture Model with Gibbs Sampling

In attempt to clarify my own understanding of how Gibbs samplers are derived for Bayesian models, I collected some notes on the sampler for a finite mixture model into an IPython notebook.

I think most of the literature on Bayesian computation is terrible. There's lots of handwaving with cavalier appeals to conjugacy and "integrating out". There aren't a lot of clear derivations of the equations needed for doing sampling in MCMC. As I have tried to write about it myself, I have a better appreciation of why that might be: this is complex math with a lot of moving parts. Given that, I would appreciate constructive feedback on how I could make this more clear or helpful. Even better, fork my repo and submit a pull request.

Tim HopperDealing with Outliers

From Testicular volume is inversely correlated with nurturing-related brain activity in human fathers in PNAS:

One participant's testes volume measurement was excluded because his value was 2.8 SDs above the mean (mean = 38,064; SD = 11,183) and was more than 13,000 mm^3 larger than any recorded value found in the literature. Of the more than 1,500 healthy, age-matched men in these studies, the largest reported value was 56,000 mm^3, and this participant’s measurement was 69,736 mm^3.

Tim HopperCross Entropy and KL Divergence

As we saw in an earlier post, the entropy of a discrete probability distribution is defined to be

$$H(p)=H(p_1,p_2,\ldots,p_n)=-\sum_{i}p_i \log p_i.$$

Kullback and Leibler defined a similar measure now known as KL divergence. This measure quantifies how similar a probability distribution $p$ is to a candidate distribution $q$.

$$D_{\text{KL}}(p\| q)=\sum_i p_i \log \frac{p_i}{q_i}.$$

$D_\text{KL}$ is non-negative and zero if and only if $p_i=q_i$ for all $i$. However, it is important to note that it is not in general symmetric: $D_{\text{KL}}(p\| q) \neq D_{\text{KL}}(q\| p)$.

Jonathon Shlens explains that KL Divergence can be interpreted as measuring the likelihood that samples represented by the empirical distribution $p$ were generated by a fixed distribution $q$. If $D_{\text{KL}}(p\| q)=0$, we can guarantee that $p$ is generated by $q$. As $D_{\text{KL}}(p\| q)\rightarrow\infty$, we can say that it is increasingly unlikely that $p$ was generated by $q$.

Algebraically, we can rewrite the definition as

$$ \begin{array}{rl} D_{\text{KL}}(p\| q) &=\sum_i p_i \log \frac{p_i}{q_i} \\ &=\sum_i \left ( - p_i \log q_i + p_i \log p_i \right)\\ &=- \sum_i p_i \log q_i + \sum_i p_i \log p_i \\ &=- \sum_i p_i \log q_i - \sum_i p_i \log \frac{1}{p_i} \\ &=- \sum_i p_i \log q_i-H(p) \\ &=\sum_i p_i \log \frac{1}{q_i}-H(p)\\ \end{array} $$

KL Divergence breaks down as something that looks similar to entropy (but combining $p$ and $q$) minus the entropy of $p$. This first term is often called cross entropy:

$$H(p, q)=\sum_i p_i \log \frac{1}{q_i}.$$

We could alternatively use this relationship to define cross entropy as:

$$H(p, q)=H(p) + D_\text{KL}(p\| q).$$

Intuatively, the cross entropy is the uncertainty implicit in $H(p)$ plus the likelihood that $p$ could have be generated by $q$. If we consider $p$ to be a fixed distribution, $H(p, q)$ and $D_\text{KL}(p \| q)$ differ by a constant factor for all $q$.

Tim HopperEntropy of a Discrete Probability Distribution

Supposed we have a discrete set of possible events $1,\ldots, n$ that occur with probabilities $p_1, p_2, \ldots, p_n$. Claude Shannon asked the question

Can we find a measure of how much "choice" is involved in the selection of the event or of how uncertain we are of the outcome?

For example, supposed we have coin that lands on heads with probability $p$ and tails with probability $1-p$. If $p=1$, the coin always lands on heads. Since there is no uncertainty, we might want to say the uncertainty is 0. However, if the coin is fair and $p=0.5$, we maximize our uncertainty: it's a complete tossup whether the coin is heads or tails. We might want to say the uncertainty in this case is 1.

In general, Shannon wanted to devise a function $H(p_1, p_2, \ldots, p_n)$ describing the uncertainty of an arbitrary set of discrete events (i.e. a $n$-sided die). He thought that "it is reasonable" that $H$ should have three properties:

  1. $H$ should be a continuous function of each $p_i$. A small change in a single probability should result in a similarly small change in the entropy (uncertainty).
  2. If each event is equally likely ($p_i=\frac{1}{n}$), $H$ should increase as a function of $n$: the more events there are, the more uncertain we are.
  3. Finally, entropy should be recursive with respect to independent events. Supposed we generate a random variable $x$ by the following process: Flip a fair coin. If it is heads, $x=0$. If the flip was tails, flip the coin again. If the second flip is heads, $x=1$, if tails $x=2$.

    We can compute the entropy as $H(p_0=1/2, p_1=1/4, p_2=1/4)$. However, the independence property tells us that this relationship should hold:

    $$H\left(\frac{1}{2}, \frac{1}{4}, \frac{1}{4}\right)=H\left(\frac{1}{2}, \frac{1}{2}\right) + \frac{1}{2} H\left(\frac{1}{2}, \frac{1}{2}\right).$$

    As David MacKay explains, this is the general claim that

    $$H(\mathbf{p})=H(p_1, 1-p_1)+(1-p_1)H \left( \frac{p_2}{1-p_1}, \frac{p_3}{1-p_1}, \ldots, \frac{p_n}{1-p_1} \right).$$

Shannon showed that given these three assumptions, there is unique for that $H$ must take:

$$H\propto -\sum_{i=1}^n p_i \log p_i=\sum_{i=1}^n p_i \log \frac{1}{p_i}.$$

He named this measure of uncertainty entropy, because the form of $H$ bears striking similarity to that of Gibbs Entropy in statistical thermodynamics.1

Shannon observes that $H$ has many other interesting properties:

  1. Entropy $H$ is 0 if and only if exactly one event has probability 1 and the rest have probability 0. (Uncertainty vanishes only when we are certain about the outcomes.)
  2. Entropy $H$ is maximized when the $p_i$ values are equal.
  3. The joint entropy of two events is less than or equal the sum of the individual entropies. $H(x, y)=H(x)+H(y)$ only when $x$ and $y$ are independent events.

You can read more about this in Shannon's seminal paper A Theory of Mathematical Communication.

  1. Caianiello and Aizerman say the name entropy is thanks to von Neumann who said

    You should call it entropy and for two reasons: first, the function is already in use in thermodynamics under that name; second and more importantly, most people don't know what entropy really is, and if you use the word.

    They argue that the name "uncertainty" would have been much more helpful since "Shannon entropy is simply and avowedly the 'measure of the uncertainty inherient in a pre-assigned probability scheme.'" 

Caktus GroupCaktus at DjangoCon 2015

Django is kind of our thing, so we’ve been looking forward to DjangoCon 2015 for months! Now that it is finally here, we thought we would give a little preview of what Cakti will be up to at this year’s conference.

Tuesday: Django Authors Panel (1:30 pm)

Meet Caktus Technical Director and co-author of Lightweight Django Mark Lavin at this panel of people who write about Django. Learn about the experience of writing or recording video about Django from a collection of talented tech authors!

Wednesday: Intro to Client-Side Testing (11:50 am)

You've just built a REST API and client-side application to consume it. With all your focus on architecture, you missed a critical piece: how do you test it? In this intermediate-level talk, Mark Lavin will look at some of the tools available to test this style of application with both Javascript unit tests and integration tests written in Python.

Wednesday: Lightweight Django Book Signing (2:50 pm)

Come meet co-author of Lightweight Django Mark Lavin and have him sign your copy. You can also say hello to everyone at our Caktus Group table!

Thursday: DjangoGirls Austin Workshop (9:00 am)

We’re a proud DjangoGirls sponsor and organizer of DjangoGirls RDU. For this workshop, David Ray and Mark Lavin will coach participants, teaching them to create their first Django app.

Tim HopperProfile in Computational Imagination

I recently had the honor of being interviewed by Michael Swenson for his interview series called "Profiles in Computational Imagination". I talked a bit about my current work, my wandering road to data science, and my love for remote work. You can read it here.

Og MacielBooks - August 2015

Books - August 2015

This August 2015 I took a break from work and spent about 6 days enjoying some R&R down the North Carolina shore with my family. I managed to get through some of the books that were waiting for a long time for me to get to them, as well as try some new authors.


  • The Sentinel by Arthur C. Clarke

    I forgot where I read about how the short story "The Sentinel" was the inspiration for "2001: A Space Odyssey", but being that I have always considered the latter a great book and movie, I managed to grab a copy of the anthology "The Sentinel" just so that I could read the short story by the same name and see what else Arthur C. Clarke "had to offer." Interestingly enough (to me), most if not all the other short stories included in this collection could easily be published today and still feel just as futuristic as they probably were back when they were first published! This was yet another one of the books that I read by the beach this Summer and though it didn't blow me away, it was still a very relaxing read.

  • A Princess of Mars by Edgar Rice Burroughs

    This last June I got from my family for my birthday "John Carter of Mars" containing the complete series and I was just itching for a good opportunity to start reading it. That chance came up this week as I started reading some of Melville's short stories and found that I needed a bit of a "break". First off, I have never read Edgar Rice Burroughs before but I do have a copy of "Tarzan of the Apes" also awaiting for a chance, so I had an idea about what to expect from his style. Sure enough, reading "A Princess of Mars" felt like a taking a trip down memory's lane, back when it was easy to tell who the good and the bad guys were, and there was always a damsel in distress somewhere waiting to be rescued. I have to confess that it took me a few chapters to get re-acclimated with this style, but once I got into it, it was easy reading, which is exactly what I was looking for any how.

    John Carter, the main character, shows all the expected, cliché virtues one would expect from a "hero" but one thing that bothered me a bit was the language he used to describe those who were different from him (which was mostly everyone in the story, since they were all Martians) and the way he treated them. It felt a bit abusive and even a but racist? I don't know if someone could get away with writing in the same style today, but then again I remembered that back then people were not as politically correct as we are today... or maybe I was reading too much into it? Anyhow, it was a fun read and I think I will try to add the next 4 books of the series in the coming months so that I can hopefully get a better opinion formed about the author.

Caktus GroupComposting at Caktus

At Caktus we have an employee suggestion policy that has been the birthplace of tons of great ideas, from tech community yoga to a pair programming station.

The most recent addition to our workplace comes from the suggestion of Developer Rebecca Muraya: composting at Caktus! We’re partnering with Compost Now to decrease our waste. Compost Now helps to divert more than ⅓ of waste away from landfills while improving local soil for farmers by increasing the availability of all natural compost. There is nothing better for keeping soil healthy like a good supply of organic compost, and we’re thrilled to be contributing to keep worms, plants, and farmers happy.

Caktus GroupMaking Clean Code a Part of Your Build Process (And More!)

At Caktus, "clean" (in addition to "working"!) code is an important part of our delivery. For all new projects, we achieve that by using flake8. flake8 is a wrapper around several tools: pep8, pyflakes, and McCabe. pep8 checks to make sure your code matches the PEP 0008 style guidelines, pyflakes looks for a few additional things like unused imports or variables, and McCabe raises warnings about overly complex sections of code.

We usually check code formatting locally before committing, but we also have safety checks in place in case someone forgets (as I more than anyone have been known to do!). This prevents code formatting standards from slowly eroding over time. We accomplish this by making our continuous integration (CI) server "fail" the build if unclean code is committed. For example, using Travis CI, simply adding the following line to the script: section of your .travis.yml will fail the build if flake8 detects any formatting issues (the command returns a non-zero exit code if errors are found):

- flake8 .

You can adjust flake8 defaults by adding a file named setup.cfg to the top level of your repository. For example, we usually relax the 80 character limit a little and exclude certain automatically generated files:


As a result you not only have code that is more readable for everyone, but avoids actual errors as well. For example, flake8 will detect missing imports or undefined variables in code paths that might not be tested by your unit test suite.

Adding flake8 to an older project

This is all well and good for new projects, but bringing old projects up to speed with this approach can be a challenge. I recently embarked on such a task myself and thought I'd share what I learned.

I started by adding a setup.cfg to the project and running flake8 on my source tree:

$ flake8 --count

The result: a whopping 1798 warnings. Many of these turned out to be pep8's "E128 continuation line under-indented for visual indent":

$ flake8 --select=E128 --count

In other words, in a huge number of cases, we weren't indenting multi-line continuations the way pep8 wanted us to. Other errors included things like not having a space after commas (E231), or not having two spaces before inline comments (E261). While many editors do support automatically fixing errors like this, doing so manually would still be tedious: In this project we had nearly 250 Python source files.

Enter autopep8 and autoflake. These tools purport to automatically fix pep8- and pyflakes-related issues. There are two ways to use these tools; one wholesale for all errors across the project at once, and one more granular, addressing only a single group of similar errors at a time.

Addressing all errors at once

This approach is best for smaller projects or those with just a few errors (50-100) to address:

$ pip install autoflake
$ find . -name '*.py'|grep -v migrations|xargs autoflake --in-place --remove-all-unused-imports --remove-unused-variables

In my case, this resulted in changes across 39 files and reduced the number of flake8 errors from 1798 to 1726. Not a huge change, but a time saver nonetheless. autopep8 was even more impressive:

$ pip install autopep8
$ autopep8 --in-place --recursive --max-line-length=100 --exclude="*/migrations/*" .

This brought the number of changed files up to 160, and brought the number of warnings from 1726 down to 211. Note that autopep8 also supports an --aggressive option which allows non-whitespace changes. When I tried this, however, it only reduced the number of warnings from 211 to 198. I'll most likely fix those by hand.

Please note: If or when you're ready to try these commands on a project, you must first make sure you have no uncommitted changes locally. After each change, I also recommend (a) committing the results of each command as a separate commit so they're easier to unravel or review later, and (b) running your unit test suite to make sure nothing is inadvertently broken.

Addressing errors as groups (recommended)

While the above approach may work for smaller projects (not this one!), it can make code reviews difficult because all pyflakes or pep8 fixes are grouped together in a single commit. The more labor intensive but recommended approach is to address them in groups of similar errors. My colleague Rebecca Muraya recommended this approach and suggested the groups (thanks, Rebecca!):

  1. First, find and remove any unused imports:

    $ pip install autoflake
    $ find . -name '*.py'|grep -v migrations|xargs autoflake --in-place --remove-all-unused-imports

    After it's finished, review the changes, run your test suite, and commit the code.

  2. Now, do the same for unused variables:

    $ find . -name '*.py'|grep -v migrations|xargs autoflake --in-place --remove-unused-variables

    Again, once complete, review the changes, run your test suite, and commit the code.

  3. Before moving on to pep8 errors, the following command provides an invaluable summary of errors not yet addressed:

    $ flake8 --statistics --count -qq
  4. Finally, autopep8 can be told to fix only certain error codes, like so:

    $ pip install autopep8
    $ autopep8 --in-place --recursive --max-line-length=100 --exclude="*/migrations/*" --select="W291,W293" .

    This will remove trailing whitespace and trailing whitespace on blank lines. Once complete, review and commit your changes and move on to the next group of errors.

pep8's error codes are listed in detail on the autopep8 PyPI page and in the pep8 documentation. You can either group them yourself based on your preferences and the particular warnings in your project, or use the following as a guide:

  • Remove trailing whitespace, then configure your editor to keep it away:
    • W291 - Remove trailing whitespace.
    • W293 - Remove trailing whitespace on blank line.
  • Use your editor to find/replace all tabs, if any, with spaces, and then fix indentation with these error codes. This can have a semantic impact so the changes need to be reviewed carefully:
    • E101 - Reindent all lines.
    • E121 - Fix indentation to be a multiple of four.
  • Fix whitespace errors:
    • E20 - Remove extraneous whitespace.
    • E211 - Remove extraneous whitespace.
    • E22 - Fix extraneous whitespace around keywords.
    • E224 - Remove extraneous whitespace around operator.
    • E226 - Fix missing whitespace around arithmetic operator.
    • E227 - Fix missing whitespace around bitwise/shift operator.
    • E228 - Fix missing whitespace around modulo operator.
    • E231 - Add missing whitespace.
    • E241 - Fix extraneous whitespace around keywords.
    • E242 - Remove extraneous whitespace around operator.
    • E251 - Remove whitespace around parameter '=' sign.
    • E27 - Fix extraneous whitespace around keywords.
  • Adjust blank lines:
    • W391 - Remove trailing blank lines.
    • E301 - Add missing blank line.
    • E302 - Add missing 2 blank lines.
    • E303 - Remove extra blank lines.
    • E304 - Remove blank line following function decorator.
    • E309 - Add missing blank line (after class declaration).
  • Fix comment spacing:
    • E26 - Fix spacing after comment hash for inline comments.
    • E265 - Fix spacing after comment hash for block comments.
  • The following are aggressive fixes that can have semantic impact. It's best to do these one commit at a time and with careful review:
    • E711 - Fix comparison with None.
    • E712 - Fix comparison with boolean.
    • E721 - Use "isinstance()" instead of comparing types directly.
    • W601 - Use "in" rather than "has_key()".
    • W602 - Fix deprecated form of raising exception.
    • W603 - Use "!=" instead of "<>"

You can repeat steps 3 and 4 in the above with each group of error codes (or in the case of more aggressive fixes, single error codes) until they're all resolved. Once all the automatic fixes are done, you'll likely have some manual fixes left to do. Before those, you may want to see what remaining automatic fixes, if any, autopep8 suggests:

$ autopep8 --in-place --recursive --max-line-length=100 --exclude="*/migrations/*" .

Once all the errors have been resolved, add flake8 to your build process so you never have to go through this again.

Concluding Remarks

While I haven't finished all the manual edits yet as of the time of this post, I have reduced the number to about 153 warnings across the whole project. Most of the remaining warnings are long lines that pep8 wasn't comfortable splitting, for example, strings that needed to be broken up like so::

foo = ('a very '
       'long string')

Or other similar issues that couldn't be auto-corrected. To its credit, flake8 did detect two bugs, namely, a missing import (in some unused test code that should probably be deleted), and an instance of if not 'foo' in bar (instead of the correct version, if 'foo' not in bar).

My colleague Mark Lavin also remarked that flake8 does not raise warnings about variable naming, but the pep8-naming plugin is available to address this. The downside is that it doesn't like custom assertions which match the existing unittest style (i.e., assertOk, assertNotFound, etc.).

Good luck, and I hope this has been helpful!

Tim HopperOn Showing Your Work

Caktus GroupAWS load balancers with Django

We recently had occasion to reconfigure some of our existing servers to use Amazon Web Services Elastic Load Balancers in front of them. Setting this up isn't hard, exactly, but there are a lot of moving parts that have to mesh correctly before things start to work, so I thought I'd write down what we did.

All of these tools have lots of options and ways to use them. I'm not trying to cover all the possibilities here. I'm just showing what we ended up doing.

Our requirements

We had some specific goals we wanted to achieve in this reconfiguration.

  • There should be no outside requests sneaking in -- the only requests that should reach the backend servers are those that come through the load balancer. We'll achieve this by setting the backend servers' security group(s) to only allow incoming traffic on those ports from the load balancer's security group.
  • The site should only handle requests that have the right Host header. We achieve this already by Nginx configuration (server_name) and won't have to change anything.
  • Redirect any non-SSL requests to SSL. The load balancer can't do this for us (as far as I could see), so we just forward incoming port 80 requests to the server’s port 80, and let our existing port 80 Nginx configuration continue to redirect all requests to our https: URL.
  • All SSL connections are terminated at the load balancer. Our site certificate and key are only needed on the load balancer. The backend servers don't need to process encryption, nor do we need to maintain SSL configuration on them. We'll have the load balancers forward the connections, unencrypted, to a new listening port in nginx, 8088, because we're redirecting everything from port 80 to https. (We could have configured the port 80 server to figure out from the headers whether the connection came into the load balancer over SSL, but we didn't, figuring that using a separate port would be fool-proof.) If we were concerned about security of the data between the load balancer and the backend, for example if financial or personal information was included, we could re-encrypt the forwarded connections, maybe using self-signed certificates on the backend servers to simplify managing their configurations.
  • Strict-Transport-Security header - we add this already in our Nginx configuration and will include it in our port 8088 configuration.
  • We need to access backend servers directly for deploys (via ssh). We achieve this by keeping our elastic IP addresses on our backend servers so they have stable IP addresses, even though the load balancers don't need them.
  • Some of our servers use basic auth (to keep unreleased sites private). This is in our Nginx configuration, but we'll need to open up the health check URL to bypass basic auth, since the load balancers can't provide basic auth on health checks.
  • Sites stay up through the change. We achieve this by making the changes incrementally, and making sure at all times there's a path for incoming requests to be handled.

All the pieces

Here are all the pieces that we had to get in place:

  • The site's hostname is a CNAME for the elastic load balancer’s hostname, so that requests for the site go to the load balancer instead of the backend servers. Don’t use the load balancer IP addresses directly, since they’ll change over time.
  • The backend servers' security group allows incoming requests on ports 80 and 8088, but only from the load balancer's security group. That allows the load balancer to forward requests, but requests cannot be sent directly to the backend servers even if someone knows their addresses.
  • There's a health check URL on the backend server that the load balancer can access, and that returns a 200 status (not 301 or 401 or anything else), so the load balancers can determine if the backend servers are up.
  • Apart from the health check, redirect port 80 requests to the https URL of the server (non-SSL to SSL), so that any incoming requests that aren't over SSL will be redirected to SSL.
  • Get the data about the request's origin from the headers where the load balancer puts it, and pass it along to Django in the headers that our Django configuration is expecting. This lets Django tell whether a request came in securely.
  • The load balancer must be in the same region as the servers (AWS requirement).
  • Keep the elastic IP on our backend server so we can use that to get to it for administration. Deploys and other administrative tasks can no longer use the site domain name to access the backend server, since it now points at the load balancer.

Where we started

Before adding the load balancer, our site was running on EC2 servers with Ubuntu. Nginx was accepting incoming requests on ports 80 and 443, redirecting all port 80 requests to https, adding basic auth on port 443 on some servers, proxying some port 443 requests to gunicorn with our Django application, and serving static files for the rest.

To summarize our backend server configuration before and after the change:


  • Port 80 redirects all requests to https://server_URL
  • Port 443 terminates SSL and processes requests
  • Server firewall and AWS security group allow all incoming connections on port 80 and 443


  • Port 80 redirects all requests to https://server_URL
  • Port 8088 processes requests
  • Server firewall and AWS security group allow port 80 and 8088 connections from the load balancer only, and no port 443 connections at all.

Steps in order

  • DNS: shorten the DNS cache time for the site domain names to something like 5 minutes, so when we start changing them later, clients will pick up the change quickly. We'll lengthen these again when we're done.
  • Django: if needed, create a new view for health checks. We made one at /health/ that simply returned a response with status 200, bypassing all authentication. We can enhance that view later to do more checking, such as making sure the database is accessible.
  • Nginx: We added a new port 8088 server, copying the configuration from our existing port 443 server, but removing the ssl directives. We did keep the line that added the Strict-Transport-Security header.
  • Nginx: Added configuration in our new port 8088 to bypass basic auth for the /health/ URL only.
  • Ufw: opened port 8088 in the Linux firewall.
  • AWS: opened port 8088 in the servers' security group - for now, from all source addresses so we can test easily as we go.
  • AWS: add the SSL certificate in IAM
  • AWS: create a new load balancer in the same region as the servers
  • AWS: configure the new load balancer:
  • configure to use the SSL certificate
  • set up a security group for the load balancer. It needs to accept incoming connections from the internet on ports 80 and 443.
  • instances: the backend servers this load balancer will forward to
  • health check: port 8088, URL /health/. Set the period and number of checks small for now, e.g. 30 seconds and 2 checks.
  • listeners: 80->80, 443 ssl -> 8088 non-ssl
  • Tests: Now stop to make sure things are working right so far:
  • The load balancer should show the instance in service (after the health check period has passed).
  • With the site domain set in your local /etc/hosts file to point at one of the load balancer's IP addresses, the site should work on ports 80 & 443
  • undo your local /etc/hosts changes since the load balancer IPs will change over time!

  • AWS: update the backend servers' security group to only accept 8088 traffic from the load balancer's security group

  • Test: the health check should still pass, since it's coming in on port 8088 from the load balancer.

  • DNS: update DNS to make the site domain a CNAME for the load balancer's A name

  • wait for DNS propagation
  • test: site should still work when accessed using its hostname.


These steps should now be safe to do, but it doesn't hurt to test again after each step, just to be sure.

  • Nginx: remove port 443 server from nginx configuration.
  • AWS: remove port 443 from backend servers’ security group. Configure ports 80 and 8088 to only accept incoming connections from the load balancer's security group.
  • Ufw: block port 443 in server firewall
  • AWS: in the load balancer health check configuration, lengthen the time between health checks, and optionally require more passing checks before treating an instance as live.
  • docs: Update your deploy docs with the changes to how the servers are deployed!
  • DNS: lengthen the cache time for the server domain name(s) if you had shortened it before.

Tim HopperDirichlet Process Notebooks

I created a dedicated Github repository for the recent posts I've been doing on Dirichlet processes.

I also added a note responding to some of Dan Roy's helpful push-back on my understanding of the term Dirichlet process.

Tim HopperReleasing to Anaconda.org via Travis-CI

I've spent the last few days figuring out how to co-opt Travis CI as a build server for Anaconda.org (the package management service previously known as Binstar). I need to be able release five conda packages to Anaconda built for both Linux and OS X. Travis is not the right tool for this job, but I have managed to make it work. Since I've talked to others who are wrestling with this same problem, I am sharing my solution here.

Each of my five projects lives in a separate repository on Github. We have each of them as a submodule in a common repository called release. Each of these projects has a build.sh and meta.yaml file (required by Anaconda.org) in a subdirectory called conda/PACKAGE-NAME.

In my release directory, I have a simple Python script called build.py. It takes the name of a submodule and an Anaconda.org channel as command line arguments. It then uses the Python sh module to call the conda build and binstar upload commands.

binstar upload requires authentication at Anaconda.org. I did this by using a Binstar token which I add to the binstar command after fetching it from an encrypted Travis environmental variable.

I call build.py for each of the five projects from the script: section of my .travis.yml file. This will build the each package on the Travis workers and then release to Anaconda.org.

There is no easy way to get Travis to build for Linux and OS X simultaneously. However, it can be tricked into building for one or the other by changing the language: specified in the .travis.yml file. (language: objective-c will force Travis to use an OS X work; by default, it uses a Linux worker.) I wrote a fabric script that provides command line commands which will modify the language value in my .travis.yml file and then push the release repository to Github. Github triggers a Travis CI build which then deploys the repository to Anaconda.org!

If I want to cut a new release for OS X, I simply call $ fab release_osx, for Linux, I call $ fab release_linux. By default, this will release to the "main" channel on Anaconda.org. I can release to a different channel (e.g. "dev") with $ fab release_linux:dev. When specifying the channel, the fabfile will modify my .travis.yml file to set an environmental variable that is picked up when calling build.py.

Finally, my .travis.yml file instructs Travis on preparing the build environment differently between operating systems by using Travis's built in $TRAVIS_OS_NAME environmental variable and calling appropriate setup scripts.1

Also, to update the submodules to origin/master, I created a fabric command: $ fab update. This command calls git submodule update --remote --rebase.

This certainly isn't a perfect solution, but it'll work for the time being. I certainly look forward to easier solutions being developed in the future!2

Thanks to my colleague Stephen Tu who laid a lot of the groundwork for this!

  1. You may notice I have files called travis.yml and .travis.yml. Originally, my fabfile just modified the latter on the fly. For clarity, I started using travis.yml as the canonical file and .travis.yml is what is generated by the fabfile commands and used by Travis. 

  2. Continuum is starting to provide paid, hosted build servers to do this very task 

Caktus GroupAnnouncing the Caktus Open Source Fellowship

We are excited to announce the creation and funding of a pilot program for open source contributions here at Caktus Group. This program is inspired by the Django Software Foundation’s fellowship as well as the Two Day Manifesto. For this program, Caktus seeks to hire a part-time developer for twelve weeks this fall for the sole purpose of contributing back to open source projects. Caktus builds web applications based on open source tools and the continued growth of these projects is important to us. Open source projects such as Python and Django have given so much to this company and this is one of many ways we are trying to give back.

We are looking for candidates who would love to collaborate with us in our offices in downtown Durham, NC to directly contribute to both open source projects created at Caktus as well as open source projects used by Caktus, such as Django. As previously mentioned, this will be a part-time position and will be taking the place of our normal fall internship program. If successful, we may expand this into a full-time position for future iterations.

I think this will be a great opportunity for a developer to get experience working with and contributing to open source. It could be a great resume builder for a relatively new developer where all of your work will be publicly visible. It could also be a fun break from the ordinary for a more experienced developer who would love to be paid to work on open source. Through our mentorship, we hope this program will empower people who otherwise would not contribute to open source.

Prior experience with Python is a general requirement, but no prior open source contributions are required. If you have specific projects you would like to contribute to, we would love to know about them during the application process.

We are looking forwarding to reviewing and selecting applicants over the next few weeks. You can find more details on the position as well as the application form here: https://www.caktusgroup.com/careers/#op-73393-caktus-open-source-fellow

Caktus GroupAnnouncing Django Girls RDU: Free Coding Workshop for Women

Django Girls Logo

We’re incredibly excited to announce the launch of Django Girls RDU, a group in NC’s Triangle region that hosts free one-day Django coding workshops for women. Django Girls is part of an international movement that’s helped 1,600 (and counting!) women learn how to code.

We originally got involved with Django Girls at the impetus of our Technical Director, Mark Lavin. While discussing our efforts to support women in tech, Mark was emphatic: there was this group, Django Girls, that was doing extraordinary work engaging women in Django. Clearly, we needed something like that locally. Luckily for us, Django Girls was coming to PyCon and we'd get to see first hand just how wonderful they are.

Mark Lavin coaching at Django Girls PyCon 2015

Four of our team members volunteered as coaches for a Django Girls workshop during PyCon 2015. There was nothing quite like seeing the impact Django Girls had on each attendee. The environment was warm and friendly. The tutorials for students, coaches, and organizers, were detailed and extremely well thought out. The passion of the Django Girls organizers was simply infectious. Out of a desire to prolong this excitement and share it with everyone we knew, we put together a team and applied to have a Django Girls RDU chapter. We’re honored to partner with such a wonderful group!

The first workshop will be on October 3rd, applications are due September 4th. We have five great coaches from Caktus and PyLadies RDU and each coach will work one-on-one with three students to build their first Django website, a blog. We’re looking for volunteers to coach and sponsor the workshop. Each additional coach means three more students we can accept. If you’d like to get involved, please email us at durham@djangogirls.org. And, of course, we’re also looking for women who want to learn how to code.

Not able to make the October 3rd meetup? You'll also find members of our team coaching at DjangoCon's Django Girls Austin. To learn about more Django Girl activities, please follow us @djangogirlsRDU or visit the DjangoGirls website.

Tim HopperA Programmer's Portfolio

I am convinced that a programming student hoping to get a job in that field should be actively building a portfolio online. Turn those class projects, presentations, and reports into Github repositories or blog posts! I felt vindicated as I read this anecdote in Peopleware:

In the spring of 1979, while teaching together in western Canada,we got a call from a computer science professor at the local technical college. He proposed to stop by our hotel after class one evening and buy us beers in exchange for ideas. That's the kind of offer we seldom turn down. What we learned from him that evening was almost certainly worth more than whatever he learned from us.

The teacher was candid about what he needed to be judged a success in his work: He needed his students to get good job offers and lots of them. "A Harvard diploma is worth something in and of itself, but our diploma isn't worth squat. If this year's graduates don't get hired fast, there are no students next year and I'm out of a job." So he had developed a formula to make his graduates optimally attractive to the job market. Of course he taught them modern techniques for system construction, including structured analysis and design, data-driven design, information hiding, structured coding, walk throughs, and metrics. He also had them work on real applications for nearby companies and agencies. But the center piece of his formula was the portfolio that all students put together to show samples of their work.

He described how his students had been coached to show off their portfolios as part of each interview:

"I've brought along some samples of the kind of work I do. Here, for instance, is a subroutine in Pascal from one project and a set of COBOL paragraphs from another. As you can see in this portion, we use the loop-with-exit extension advocated by Knuth, but aside from that, it's pure structured code, pretty much the sort of thing that your company standard calls for. And here is the design that this code was written from. The hierarchies and coupling analysis use Myers' notation. I designed all of this particular subsystem, and this one little section where we used some Orr methods because the data structure really imposed itself on the process structure. And these are the leveled data flow diagrams that makeup the guts of our specification, and the associated data dictionary. ..."

In the years since, we've often heard more about that obscure technical college and those portfolios. We've met recruiters from as far away as Triangle Park, North Carolina, and Tampa, Florida,who regularly converge upon that distant Canadian campus for a shot at its graduates.

Of course, this was a clever scheme of the professor's to give added allure to his graduates, but what struck us most that evening was the report that interviewers were always surprised by the portfolios. That meant they weren't regularly requiring all candidates to arrive with portfolios. Yet why not? What could be more sensible than asking each candidate to bring along some samples of work to the interview?

Tim HopperNonparametric Latent Dirichlet Allocation

I wrote this in an IPython Notebook. You may prefer to view it on nbviewer.

In [1]:
%matplotlib inline
%precision 2

Latent Dirichlet Allocation is a generative model for topic modeling. Given a collection of documents, an LDA inference algorithm attempts to determined (in an unsupervised manner) the topics discussed in the documents. It makes the assumption that each document is generated by a probability model, and, when doing inference, we try to find the parameters that best fit the model (as well as unseen/latent variables generated by the model). If you are unfamiliar with LDA, Edwin Chen has a friendly introduction you should read.

Because LDA is a generative model, we can simulate the construction of documents by forward-sampling from the model. The generative algorithm is as follows (following Heinrich):

  • for each topic $k\in [1,K]$ do
    • sample term distribution for topic $\overrightarrow \phi_k \sim \text{Dir}(\overrightarrow \beta)$
  • for each document $m\in [1, M]$ do
    • sample topic distribution for document $\overrightarrow\theta_m\sim \text{Dir}(\overrightarrow\alpha)$
    • sample document length $N_m\sim\text{Pois}(\xi)$
    • for all words $n\in [1, N_m]$ in document $m$ do
      • sample topic index $z_{m,n}\sim\text{Mult}(\overrightarrow\theta_m)$
      • sample term for word $w_{m,n}\sim\text{Mult}(\overrightarrow\phi_{z_{m,n}})$

You can implement this with a little bit of code and start to simulate documents.

In LDA, we assume each word in the document is generated by a two-step process:

  1. Sample a topic from the topic distribution for the document.
  2. Sample a word from the term distribution from the topic.

When we fit the LDA model to a given text corpus with an inference algorithm, our primary objective is to find the set of topic distributions $\underline \Theta$, term distributions $\underline \Phi$ that generated the documents, and latent topic indices $z_{m,n}$ for each word.

To run the generative model, we need to specify each of these parameters:

In [2]:
vocabulary = ['see', 'spot', 'run']
num_terms = len(vocabulary)
num_topics = 2 # K
num_documents = 5 # M
mean_document_length = 5 # xi
term_dirichlet_parameter = 1 # beta
topic_dirichlet_parameter = 1 # alpha

The term distribution vector $\underline\Phi$ is a collection of samples from a Dirichlet distribution. This describes how our 3 terms are distributed across each of the two topics.

In [3]:
from scipy.stats import dirichlet, poisson
from numpy import round
from collections import defaultdict
from random import choice as stl_choice
In [4]:
term_dirichlet_vector = num_terms * [term_dirichlet_parameter]
term_distributions = dirichlet(term_dirichlet_vector, 2).rvs(size=num_topics)
print term_distributions
[[ 0.41  0.02  0.57]
 [ 0.38  0.36  0.26]]

Each document corresponds to a categorical distribution across this distribution of topics (in this case, a 2-dimensional categorical distribution). This categorical distribution is a distribution of distributions; we could look at it as a Dirichlet process!

The base base distribution of our Dirichlet process is a uniform distribution of topics (remember, topics are term distributions).

In [5]:
base_distribution = lambda: stl_choice(term_distributions)
# A sample from base_distribution is a distribution over terms
# Each of our two topics has equal probability
from collections import Counter
for topic, count in Counter([tuple(base_distribution()) for _ in range(10000)]).most_common():
    print "count:", count, "topic:", [round(prob, 2) for prob in topic]
count: 5066 topic: [0.40999999999999998, 0.02, 0.56999999999999995]
count: 4934 topic: [0.38, 0.35999999999999999, 0.26000000000000001]

Recall that a sample from a Dirichlet process is a distribution that approximates (but varies from) the base distribution. In this case, a sample from the Dirichlet process will be a distribution over topics that varies from the uniform distribution we provided as a base. If we use the stick-breaking metaphor, we are effectively breaking a stick one time and the size of each portion corresponds to the proportion of a topic in the document.

To construct a sample from the DP, we need to again define our DP class:

In [6]:
from scipy.stats import beta
from numpy.random import choice

class DirichletProcessSample():
    def __init__(self, base_measure, alpha):
        self.base_measure = base_measure
        self.alpha = alpha
        self.cache = []
        self.weights = []
        self.total_stick_used = 0.

    def __call__(self):
        remaining = 1.0 - self.total_stick_used
        i = DirichletProcessSample.roll_die(self.weights + [remaining])
        if i is not None and i < len(self.weights) :
            return self.cache[i]
            stick_piece = beta(1, self.alpha).rvs() * remaining
            self.total_stick_used += stick_piece
            new_value = self.base_measure()
            return new_value
    def roll_die(weights):
        if weights:
            return choice(range(len(weights)), p=weights)
            return None

For each document, we will draw a topic distribution from the Dirichlet process:

In [7]:
topic_distribution = DirichletProcessSample(base_measure=base_distribution, 

A sample from this topic distribution is a distribution over terms. However, unlike our base distribution which returns each term distribution with equal probability, the topics will be unevenly weighted.

In [8]:
for topic, count in Counter([tuple(topic_distribution()) for _ in range(10000)]).most_common():
    print "count:", count, "topic:", [round(prob, 2) for prob in topic]
count: 9589 topic: [0.38, 0.35999999999999999, 0.26000000000000001]
count: 411 topic: [0.40999999999999998, 0.02, 0.56999999999999995]

To generate each word in the document, we draw a sample topic from the topic distribution, and then a term from the term distribution (topic).

In [9]:
topic_index = defaultdict(list)
documents = defaultdict(list)

for doc in range(num_documents):
    topic_distribution_rvs = DirichletProcessSample(base_measure=base_distribution, 
    document_length = poisson(mean_document_length).rvs()
    for word in range(document_length):
        topic_distribution = topic_distribution_rvs()
        documents[doc].append(choice(vocabulary, p=topic_distribution))

Here are the documents we generated:

In [10]:
for doc in documents.values():
    print doc
[&apossee&apos, &aposrun&apos, &apossee&apos, &aposspot&apos, &apossee&apos, &aposspot&apos]
[&apossee&apos, &aposrun&apos, &apossee&apos]
[&apossee&apos, &aposrun&apos, &apossee&apos, &apossee&apos, &aposrun&apos, &aposspot&apos, &aposspot&apos]
[&aposrun&apos, &aposrun&apos, &aposrun&apos, &aposspot&apos, &aposrun&apos]
[&aposrun&apos, &aposrun&apos, &apossee&apos, &aposspot&apos, &aposrun&apos, &aposrun&apos]

We can see how each topic (term-distribution) is distributed across the documents:

In [11]:
for i, doc in enumerate(Counter(term_dist).most_common() for term_dist in topic_index.values()):
    print "Doc:", i
    for topic, count in doc:
        print  5*" ", "count:", count, "topic:", [round(prob, 2) for prob in topic]
Doc: 0
      count: 6 topic: [0.38, 0.35999999999999999, 0.26000000000000001]
Doc: 1
      count: 3 topic: [0.40999999999999998, 0.02, 0.56999999999999995]
Doc: 2
      count: 5 topic: [0.40999999999999998, 0.02, 0.56999999999999995]
      count: 2 topic: [0.38, 0.35999999999999999, 0.26000000000000001]
Doc: 3
      count: 5 topic: [0.38, 0.35999999999999999, 0.26000000000000001]
Doc: 4
      count: 5 topic: [0.40999999999999998, 0.02, 0.56999999999999995]
      count: 1 topic: [0.38, 0.35999999999999999, 0.26000000000000001]

To recap: for each document we draw a sample from a Dirichlet Process. The base distribution for the Dirichlet process is a categorical distribution over term distributions; we can think of the base distribution as an $n$-sided die where $n$ is the number of topics and each side of the die is a distribution over terms for that topic. By sampling from the Dirichlet process, we are effectively reweighting the sides of the die (changing the distribution of the topics).

For each word in the document, we draw a sample (a term distribution) from the distribution (over term distributions) sampled from the Dirichlet process (with a distribution over term distributions as its base measure). Each term distribution uniquely identifies the topic for the word. We can sample from this term distribution to get the word.

Given this formulation, we might ask if we can roll an infinite sided die to draw from an unbounded number of topics (term distributions). We can do exactly this with a Hierarchical Dirichlet process. Instead of the base distribution of our Dirichlet process being a finite distribution over topics (term distributions) we will instead make it an infinite Distribution over topics (term distributions) by using yet another Dirichlet process! This base Dirichlet process will have as its base distribution a Dirichlet distribution over terms.

We will again draw a sample from a Dirichlet Process for each document. The base distribution for the Dirichlet process is itself a Dirichlet process whose base distribution is a Dirichlet distribution over terms. (Try saying that 5-times fast.) We can think of this as a countably infinite die each side of the die is a distribution over terms for that topic. The sample we draw is a topic (distribution over terms).

For each word in the document, we will draw a sample (a term distribution) from the distribution (over term distributions) sampled from the Dirichlet process (with a distribution over term distributions as its base measure). Each term distribution uniquely identifies the topic for the word. We can sample from this term distribution to get the word.

These last few paragraphs are confusing! Let's illustrate with code.

In [12]:
term_dirichlet_vector = num_terms * [term_dirichlet_parameter]
base_distribution = lambda: dirichlet(term_dirichlet_vector).rvs(size=1)[0]

base_dp_parameter = 10
base_dp = DirichletProcessSample(base_distribution, alpha=base_dp_parameter)

This sample from the base Dirichlet process is our infinite sided die. It is a probability distribution over a countable infinite number of topics.

The fact that our die is countably infinite is important. The sampler base_distribution draws topics (term-distributions) from an uncountable set. If we used this as the base distribution of the Dirichlet process below each document would be constructed from a completely unique set of topics. By feeding base_distribution into a Dirichlet Process (stochastic memoizer), we allow the topics to be shared across documents.

In other words, base_distribution will never return the same topic twice; however, every topic sampled from base_dp would be sampled an infinite number of times (if we sampled from base_dp forever). At the same time, base_dp will also return an infinite number of topics. In our formulation of the the LDA sampler above, our base distribution only ever returned a finite number of topics (num_topics); there is no num_topics parameter here.

Given this setup, we can generate documents from the hierarchical Dirichlet process with an algorithm that is essentially identical to that of the original latent Dirichlet allocation generative sampler:

In [13]:
nested_dp_parameter = 10

topic_index = defaultdict(list)
documents = defaultdict(list)

for doc in range(num_documents):
    topic_distribution_rvs = DirichletProcessSample(base_measure=base_dp, 
    document_length = poisson(mean_document_length).rvs()
    for word in range(document_length):
        topic_distribution = topic_distribution_rvs()
        documents[doc].append(choice(vocabulary, p=topic_distribution))

Here are the documents we generated:

In [14]:
for doc in documents.values():
    print doc
[&aposspot&apos, &aposspot&apos, &aposspot&apos, &aposspot&apos, &aposrun&apos]
[&aposspot&apos, &aposspot&apos, &apossee&apos, &aposspot&apos]
[&aposspot&apos, &aposspot&apos, &aposspot&apos, &apossee&apos, &aposspot&apos, &aposspot&apos, &aposspot&apos]
[&aposrun&apos, &aposrun&apos, &aposspot&apos, &aposspot&apos, &aposspot&apos, &aposspot&apos, &aposspot&apos, &aposspot&apos]
[&apossee&apos, &aposrun&apos, &apossee&apos, &aposrun&apos, &aposrun&apos, &aposrun&apos]

And here are the latent topics used:

In [15]:
for i, doc in enumerate(Counter(term_dist).most_common() for term_dist in topic_index.values()):
    print "Doc:", i
    for topic, count in doc:
        print  5*" ", "count:", count, "topic:", [round(prob, 2) for prob in topic]
Doc: 0
      count: 2 topic: [0.17999999999999999, 0.79000000000000004, 0.02]
      count: 1 topic: [0.23000000000000001, 0.58999999999999997, 0.17999999999999999]
      count: 1 topic: [0.089999999999999997, 0.54000000000000004, 0.35999999999999999]
      count: 1 topic: [0.22, 0.40000000000000002, 0.38]
Doc: 1
      count: 2 topic: [0.23000000000000001, 0.58999999999999997, 0.17999999999999999]
      count: 1 topic: [0.17999999999999999, 0.79000000000000004, 0.02]
      count: 1 topic: [0.35999999999999999, 0.55000000000000004, 0.089999999999999997]
Doc: 2
      count: 4 topic: [0.11, 0.65000000000000002, 0.23999999999999999]
      count: 2 topic: [0.070000000000000007, 0.65000000000000002, 0.27000000000000002]
      count: 1 topic: [0.28999999999999998, 0.65000000000000002, 0.070000000000000007]
Doc: 3
      count: 2 topic: [0.17999999999999999, 0.79000000000000004, 0.02]
      count: 2 topic: [0.25, 0.55000000000000004, 0.20000000000000001]
      count: 2 topic: [0.28999999999999998, 0.65000000000000002, 0.070000000000000007]
      count: 1 topic: [0.23000000000000001, 0.58999999999999997, 0.17999999999999999]
      count: 1 topic: [0.089999999999999997, 0.54000000000000004, 0.35999999999999999]
Doc: 4
      count: 3 topic: [0.40000000000000002, 0.23000000000000001, 0.37]
      count: 2 topic: [0.42999999999999999, 0.17999999999999999, 0.40000000000000002]
      count: 1 topic: [0.23000000000000001, 0.29999999999999999, 0.46000000000000002]

Our documents were generated by an unspecified number of topics, and yet the topics were shared across the 5 documents. This is the power of the hierarchical Dirichlet process!

This non-parametric formulation of Latent Dirichlet Allocation was first published by Yee Whye Teh et al.

Unfortunately, forward sampling is the easy part. Fitting the model on data requires complex MCMC or variational inference. There are a limited number of implementations of HDP-LDA available, and none of them are great.

Tim HopperSampling from a Hierarchical Dirichlet Process

This may be more readable on NBViewer.

In [137]:
%matplotlib inline

As we saw earlier the Dirichlet process describes the distribution of a random probability distribution. The Dirichlet process takes two parameters: a base distribution $H_0$ and a dispersion parameter $\alpha$. A sample from the Dirichlet process is itself a probability distribution that looks like $H_0$. On average, the larger $\alpha$ is, the closer a sample from $\text{DP}(\alpha H_0)$ will be to $H_0$.

Suppose we're feeling masochistic and want to input a distribution sampled from a Dirichlet process as base distribution to a new Dirichlet process. (It will turn out that there are good reasons for this!) Conceptually this makes sense. But can we construct such a thing in practice? Said another way, can we build a sampler that will draw samples from a probability distribution drawn from these nested Dirichlet processes? We might initially try construct a sample (a probability distribution) from the first Dirichlet process before feeding it into the second.

But recall that fully constructing a sample (a probability distribution!) from a Dirichlet process would require drawing a countably infinite number of samples from $H_0$ and from the beta distribution to generate the weights. This would take forever, even with Hadoop!

Dan Roy, et al helpfully described a technique of using stochastic memoization to construct a distribution sampled from a Dirichlet process in a just-in-time manner. This process provides us with the equivalent of the Scipy rvs method for the sampled distribution. Stochastic memoization is equivalent to the Chinese restaurant process: sometimes you get seated an an occupied table (i.e. sometimes you're given a sample you've seen before) and sometimes you're put at a new table (given a unique sample).

Here is our memoization class again:

In [162]:
from numpy.random import choice 
from scipy.stats import beta

class DirichletProcessSample():
    def __init__(self, base_measure, alpha):
        self.base_measure = base_measure
        self.alpha = alpha
        self.cache = []
        self.weights = []
        self.total_stick_used = 0.

    def __call__(self):
        remaining = 1.0 - self.total_stick_used
        i = DirichletProcessSample.roll_die(self.weights + [remaining])
        if i is not None and i < len(self.weights) :
            return self.cache[i]
            stick_piece = beta(1, self.alpha).rvs() * remaining
            self.total_stick_used += stick_piece
            new_value = self.base_measure()
            return new_value
    def roll_die(weights):
        if weights:
            return choice(range(len(weights)), p=weights)
            return None

Let's illustrate again with a standard normal base measure. We can construct a function base_measure that generates samples from it.

In [95]:
from scipy.stats import norm

base_measure = lambda: norm().rvs() 

Because the normal distribution has continuous support, we can generate samples from it forever and we will never see the same sample twice (in theory). We can illustrate this by drawing from the distribution ten thousand times and seeing that we get ten thousand unique values.

In [163]:
from pandas import Series

ndraws = 10000
print "Number of unique samples after {} draws:".format(ndraws), 
draws = Series([base_measure() for _ in range(ndraws)])
print draws.unique().size
Number of unique samples after 10000 draws: 10000

However, when we feed the base measure through the stochastic memoization procedure and then sample, we get many duplicate samples. The number of unique samples goes down as $\alpha$ increases.

In [164]:
norm_dp = DirichletProcessSample(base_measure, alpha=100)

print "Number of unique samples after {} draws:".format(ndraws), 
dp_draws = Series([norm_dp() for _ in range(ndraws)])
print dp_draws.unique().size
Number of unique samples after 10000 draws: 446

At this point, we have a function dp_draws that returns samples from a probability distribution (specifically, a probability distribution sampled from $\text{DP}(\alpha H_0)$). We can use dp_draws as a base distribution for another Dirichlet process!

In [155]:
norm_hdp = DirichletProcessSample(norm_dp, alpha=10)

How do we interpret this? norm_dp is a sampler from a probability distribution that looks like the standard normal distribution. norm_hdp is a sampler from a probability distribution that "looks like" the distribution norm_dp samples from.

Here is a histogram of samples drawn from norm_dp, our first sampler.

In [152]:
import matplotlib.pyplot as plt
pd.Series(norm_dp() for _ in range(10000)).hist()
_=plt.title("Histogram of Samples from norm_dp")

And here is a histogram for samples drawn from norm_hdp, our second sampler.

In [154]:
pd.Series(norm_hdp() for _ in range(10000)).hist()
_=plt.title("Histogram of Samples from norm_hdp")

The second plot doesn't look very much like the first! The level to which a sample from a Dirichlet process approximates the base distribution is a function of the dispersion parameter $\alpha$. Because I set $\alpha=10$ (which is relatively small), the approximation is fairly course. In terms of memoization, a small $\alpha$ value means the stochastic memoizer will more frequently reuse values already seen instead of drawing new ones.

This nesting procedure, where a sample from one Dirichlet process is fed into another Dirichlet process as a base distribution, is more than just a curiousity. It is known as a Hierarchical Dirichlet Process, and it plays an important role in the study of Bayesian Nonparametrics (more on this in a future post).

Without the stochastic memoization framework, constructing a sampler for a hierarchical Dirichlet process is a daunting task. We want to be able to draw samples from a distribution drawn from the second level Dirichlet process. However, to be able to do that, we need to be able to draw samples from a distribution sampled from a base distribution of the second-level Dirichlet process: this base distribution is a distribution drawn from the first-level Dirichlet process.

Though it appeared that we would need to be able to fully construct the first level sample (by drawing a countably infinite number of samples from the first-level base distribution). However, stochastic memoization allows us to only construct the first distribution just-in-time as it is needed at the second-level.

We can define a Python class to encapsulate the Hierarchical Dirichlet Process as a base class of the Dirichlet process.

In [165]:
class HierarchicalDirichletProcessSample(DirichletProcessSample):
    def __init__(self, base_measure, alpha1, alpha2):
        first_level_dp = DirichletProcessSample(base_measure, alpha1)
        self.second_level_dp = DirichletProcessSample(first_level_dp, alpha2)

    def __call__(self):
        return self.second_level_dp()

Since the Hierarchical DP is a Dirichlet Process inside of Dirichlet process, we must provide it with both a first and second level $\alpha$ value.

In [167]:
norm_hdp = HierarchicalDirichletProcessSample(base_measure, alpha1=10, alpha2=20)

We can sample directly from the probability distribution drawn from the Hierarchical Dirichlet Process.

In [170]:
pd.Series(norm_hdp() for _ in range(10000)).hist()
_=plt.title("Histogram of samples from distribution drawn from Hierarchical DP")

norm_hdp is not equivalent to the Hierarchical Dirichlet Process; it samples from a single distribution sampled from this HDP. Each time we instantiate the norm_hdp variable, we are getting a sampler for a unique distribution. Below we sample five times and get five different distributions.

In [180]:
for i in range(5):
    norm_hdp = HierarchicalDirichletProcessSample(base_measure, alpha1=10, alpha2=10)
    _=pd.Series(norm_hdp() for _ in range(100)).hist()
    _=plt.title("Histogram of samples from distribution drawn from Hierarchical DP")
<matplotlib.figure.Figure at 0x112a2da50>

In a later post, I will discuss how these tools are applied in the realm of Bayesian nonparametrics.

Tim HopperHigh Quality Code at Quora

I love this new post on Quora's engineering blog. The post states "high code quality is the long-term boost to development speed" and goes on to explain how they go about accomplishing this.

I've inherited large code bases at each of my jobs out of grad school, and I've spent a lot of thinking about this question. At least on the surface, I love the solutions Quora has in place for ensuring quality code: thoughtful code review, careful testing, style guidelines, static checking, and intentional code cleanup.

Og MacielBooks - July 2015

Books - July 2015

This July 2015 I travelled to the Red Hat office in Brno, Czech Republic to spend some time with my teammates there, and I managed to get a lot of reading done between long plane rides and being jet lagged for many nights :) So I finally managed to finish up some of the books that had been lingering on my ToDo list and even managed to finally read a few of the books that together make up the Chronicles of Narnia, since I had never read them as a kid.


Out of all the books I read this month, I feel that All Quiet on the Western Front and The October Country were the ones I enjoyed reading the most, closely followed by Cryptonomicon, which took me a while to get through. The other books, with the exception of The Memoirs of Sherlock Holmes, helped me pass the time when I only wanted to be entertained.

All Quiet on the Western Front takes the prize for being one of the best books I have ever read! I felt that the way WWI was presented through the eyes of the main character was a great way to represent all the pain, angst and suffering that all sides of conflict went through, without catering for any particular side or having an agenda. Erich Maria Remarque's style had me some times breathless, some times with a knot on the pit of my stomach I as 'endured' the many life changing events that took place in the book. Is this an action-packed book about WWI? Will it read like a thriller? In my opinion, even though there are many chapters with gory details about killings and battles, the answer is a very bland 'maybe'. I think that the real 'star' of this book is its philosophical view of the war and how the main characters, all around 19-20 years of age, learn to deal with its life lasting effects.

Now, I have been a huge fan of Ray Bradbury for a while now, and when I got The October Country for my birthday last month, I just knew that it would be time well spent reading it. For those of you who are more acquainted his science fiction works, this book will surprise you as it shows you a bit of his 'darker' side. All of the short stories included in this collection deal with death, mysterious apparitions, inexplicable endings and are sure to spook you a little bit.

Cryptonomicon was at times slow, some other times funny and, especially toward the end, a very entertaining book. Weighing in at a hefty 1000 pages (depending on the edition you have, plus/minus 50 odd pages), this book covers two different periods in the lives of a number of different characters, past (around WWII) and present, all different threads eventually leading to a great finale. Alternating between past and present, the story takes us to the early days of how cryptology was 'officially invented' and used during the war, and how many of the events that took place back then were affecting the lives of some of the direct descendants of the main characters in our present day. As you go through the back and forth you start to gather bits and pieces of information that eventually connects all the dots of an interesting puzzle. It definitely requires a long term commitment to go though it, but it was enjoyable and, as I mention before, it made me laugh at many places.

Caktus GroupUsing Unsaved Related Models for Sample Data in Django 1.8

Note: In between the time I originally wrote this post and it getting published, a ticket and pull request were opened in Django to remove allow_unsaved_instance_assignment and move validation to the model save() method, which makes much more sense anyways. It's likely this will even be backported to Django 1.8.4. So, if you're using a version of Django that doesn't require this, hopefully you'll never stumble across this post in the first place! If this is still an issue for you, here's the original post:

In versions of Django prior to 1.8, it was easy to construct "sample" model data by putting together a collection of related model objects, even if none of those objects was saved to the database. Django 1.8 - 1.8.3 adds a restriction that prevents this behavior. Errors such as this are generally a sign that you're encountering this issue:

ValueError: Cannot assign "...": "MyRelatedModel" instance isn't saved in the database.

The justification for this is that, previously, unsaved foreign keys were silently lost if they were not saved to the database. Django 1.8 does provide a backwards compatibility flag to allow working around the issue. The workaround, per the Django documentation, is to create a new ForeignKey field that removes this restriction, like so:

class UnsavedForeignKey(models.ForeignKey):
    # A ForeignKey which can point to an unsaved object
    allow_unsaved_instance_assignment = True

class Book(models.Model):
    author = UnsavedForeignKey(Author)

This may be undesirable, however, because this approach means you lose all protection for all uses of this foreign key, even if you want Django to ensure foreign key values have been saved before being assigned in some cases.

There is a middle ground, not immediately obvious, that involves changing this attribute temporarily during the assignment of an unsaved value and then immediately changing it back. This can be accomplished by writing a context manager to change the attribute, for example:

import contextlib

def allow_unsaved(model, field):
    model_field = model._meta.get_field(field)
    saved = model_field.allow_unsaved_instance_assignment
    model_field.allow_unsaved_instance_assignment = True
    model_field.allow_unsaved_instance_assignment = saved

To use this decorator, surround any assignment of an unsaved foreign key value with the context manager as follows:

with allow_unsaved(MyModel, 'my_fk_field'):
    my_obj.my_fk_field = unsaved_instance

The specifics of how you access the field to pass into the context manager are important; any other way will likely generate the following error:

RelatedObjectDoesNotExist: MyModel has no instance.

While strictly speaking this approach is not thread safe, it should work for any process-based worker model (such as the default "sync" worker in Gunicorn).

This took a few iterations to figure out, so hopefully it will (still) prove useful to someone else!

Tim Hopper10x Engineering

Tim HopperNotes on the Dirichlet Distribution and Dirichlet Process

In [3]:
%matplotlib inline

Note: I wrote this post in an IPython notebook. It might be rendered better on NBViewer.

Dirichlet Distribution

The symmetric Dirichlet distribution (DD) can be considered a distribution of distributions. Each sample from the DD is a categorial distribution over $K$ categories. It is parameterized $G_0$, a distribution over $K$ categories and $\alpha$, a scale factor.

The expected value of the DD is $G_0$. The variance of the DD is a function of the scale factor. When $\alpha$ is large, samples from $DD(\alpha\cdot G_0)$ will be very close to $G_0$. When $\alpha$ is small, samples will vary more widely.

We demonstrate below by setting $G_0=[.2, .2, .6]$ and varying $\alpha$ from 0.1 to 1000. In each case, the mean of the samples is roughly $G_0$, but the standard deviation is decreases as $\alpha$ increases.

In [10]:
import numpy as np
from scipy.stats import dirichlet

def stats(scale_factor, G0=[.2, .2, .6], N=10000):
    samples = dirichlet(alpha = scale_factor * np.array(G0)).rvs(N)
    print "                          alpha:", scale_factor
    print "              element-wise mean:", samples.mean(axis=0)
    print "element-wise standard deviation:", samples.std(axis=0)
for scale in [0.1, 1, 10, 100, 1000]:
                          alpha: 0.1
              element-wise mean: [ 0.2  0.2  0.6]
element-wise standard deviation: [ 0.38  0.38  0.47]

                          alpha: 1
              element-wise mean: [ 0.2  0.2  0.6]
element-wise standard deviation: [ 0.28  0.28  0.35]

                          alpha: 10
              element-wise mean: [ 0.2  0.2  0.6]
element-wise standard deviation: [ 0.12  0.12  0.15]

                          alpha: 100
              element-wise mean: [ 0.2  0.2  0.6]
element-wise standard deviation: [ 0.04  0.04  0.05]

                          alpha: 1000
              element-wise mean: [ 0.2  0.2  0.6]
element-wise standard deviation: [ 0.01  0.01  0.02]

Dirichlet Process

The Dirichlet Process can be considered a way to generalize the Dirichlet distribution. While the Dirichlet distribution is parameterized by a discrete distribution $G_0$ and generates samples that are similar discrete distributions, the Dirichlet process is parameterized by a generic distribution $H_0$ and generates samples which are distributions similar to $H_0$. The Dirichlet process also has a parameter $\alpha$ that determines how similar how widely samples will vary from $H_0$.

We can construct a sample $H$ (recall that $H$ is a probability distribution) from a Dirichlet process $\text{DP}(\alpha H_0)$ by drawing a countably infinite number of samples $\theta_k$ from $H_0$) and setting:

$$H=\sum_{k=1}^\infty \pi_k \cdot\delta(x-\theta_k)$$

where the $\pi_k$ are carefully chosen weights (more later) that sum to 1. ($\delta$ is the Dirac delta function.)

$H$, a sample from $DP(\alpha H_0)$, is a probability distribution that looks similar to $H_0$ (also a distribution). In particular, $H$ is a discrete distribution that takes the value $\theta_k$ with probability $\pi_k$. This sampled distribution $H$ is a discrete distribution even if $H_0$ has continuous support; the support of $H$ is a countably infinite subset of the support $H_0$.

The weights ($\pi_k$ values) of a Dirichlet process sample related the Dirichlet process back to the Dirichlet distribution.

Gregor Heinrich writes:

The defining property of the DP is that its samples have weights $\pi_k$ and locations $\theta_k$ distributed in such a way that when partitioning $S(H)$ into finitely many arbitrary disjoint subsets $S_1, \ldots, S_j$ $J<\infty$, the sums of the weights $\pi_k$ in each of these $J$ subsets are distributed according to a Dirichlet distribution that is parameterized by $\alpha$ and a discrete base distribution (like $G_0$) whose weights are equal to the integrals of the base distribution $H_0$ over the subsets $S_n$.

As an example, Heinrich imagines a DP with a standard normal base measure $H_0\sim \mathcal{N}(0,1)$. Let $H$ be a sample from $DP(H_0)$ and partition the real line (the support of a normal distribution) as $S_1=(-\infty, -1]$, $S_2=(-1, 1]$, and $S_3=(1, \infty]$ then

$$H(S_1),H(S_2), H(S_3) \sim \text{Dir}\left(\alpha\,\text{erf}(-1), \alpha\,(\text{erf}(1) - \text{erf}(-1)), \alpha\,(1-\text{erf}(1))\right)$$

where $H(S_n)$ be the sum of the $\pi_k$ values whose $\theta_k$ lie in $S_n$.

These $S_n$ subsets are chosen for convenience, however similar results would hold for any choice of $S_n$. For any sample from a Dirichlet process, we can construct a sample from a Dirichlet distribution by partitioning the support of the sample into a finite number of bins.

There are several equivalent ways to choose the $\pi_k$ so that this property is satisfied: the Chinese restaurant process, the stick-breaking process, and the Pólya urn scheme.

To generate $\left\{\pi_k\right\}$ according to a stick-breaking process we define $\beta_k$ to be a sample from $\text{Beta}(1,\alpha)$. $\pi_1$ is equal to $\beta_1$. Successive values are defined recursively as

$$\pi_k=\beta_k \prod_{j=1}^{k-1}(1-\beta_j).$$

Thus, if we want to draw a sample from a Dirichlet process, we could, in theory, sample an infinite number of $\theta_k$ values from the base distribution $H_0$, an infinite number of $\beta_k$ values from the Beta distribution. Of course, sampling an infinite number of values is easier in theory than in practice.

However, by noting that the $\pi_k$ values are positive values summing to 1, we note that, in expectation, they must get increasingly small as $k\rightarrow\infty$. Thus, we can reasonably approximate a sample $H\sim DP(\alpha H_0)$ by drawing enough samples such that $\sum_{k=1}^K \pi_k\approx 1$.

We use this method below to draw approximate samples from several Dirichlet processes with a standard normal ($\mathcal{N}(0,1)$) base distribution but varying $\alpha$ values.

Recall that a single sample from a Dirichlet process is a probability distribution over a countably infinite subset of the support of the base measure.

The blue line is the PDF for a standard normal. The black lines represent the $\theta_k$ and $\pi_k$ values; $\theta_k$ is indicated by the position of the black line on the $x$-axis; $\pi_k$ is proportional to the height of each line.

We generate enough $\pi_k$ values are generated so their sum is greater than 0.99. When $\alpha$ is small, very few $\theta_k$'s will have corresponding $\pi_k$ values larger than $0.01$. However, as $\alpha$ grows large, the sample becomes a more accurate (though still discrete) approximation of $\mathcal{N}(0,1)$.

In [13]:
import matplotlib.pyplot as plt
from scipy.stats import beta, norm

def dirichlet_sample_approximation(base_measure, alpha, tol=0.01):
    betas = []
    pis = []
    betas.append(beta(1, alpha).rvs())
    while sum(pis) < (1.-tol):
        s = np.sum([np.log(1 - b) for b in betas])
        new_beta = beta(1, alpha).rvs() 
        pis.append(new_beta * np.exp(s))
    pis = np.array(pis)
    thetas = np.array([base_measure() for _ in pis])
    return pis, thetas

def plot_normal_dp_approximation(alpha):
    plt.title("Dirichlet Process Sample with N(0,1) Base Measure")
    plt.suptitle("alpha: %s" % alpha)
    pis, thetas = dirichlet_sample_approximation(lambda: norm().rvs(), alpha)
    pis = pis * (norm.pdf(0) / pis.max())
    plt.vlines(thetas, 0, pis, )
    X = np.linspace(-4,4,100)
    plt.plot(X, norm.pdf(X))


Often we want to draw samples from a distribution sampled from a Dirichlet process instead of from the Dirichlet process itself. Much of the literature on the topic unhelpful refers to this as sampling from a Dirichlet process.

Fortunately, we don't have to draw an infinite number of samples from the base distribution and stick breaking process to do this. Instead, we can draw these samples as they are needed.

Suppose, for example, we know a finite number of the $\theta_k$ and $\pi_k$ values for a sample $H\sim \text{Dir}(\alpha H_0)$. For example, we know

$$\pi_1=0.5,\; \pi_3=0.3,\; \theta_1=0.1,\; \theta_2=-0.5.$$

To sample from $H$, we can generate a uniform random $u$ number between 0 and 1. If $u$ is less than 0.5, our sample is $0.1$. If $0.5<=u<0.8$, our sample is $-0.5$. If $u>=0.8$, our sample (from $H$ will be a new sample $\theta_3$ from $H_0$. At the same time, we should also sample and store $\pi_3$. When we draw our next sample, we will again draw $u\sim\text{Uniform}(0,1)$ but will compare against $\pi_1, \pi_2$, AND $\pi_3$.

The class below will take a base distribution $H_0$ and $\alpha$ as arguments to its constructor. The class instance can then be called to generate samples from $H\sim \text{DP}(\alpha H_0)$.

In [20]:
from numpy.random import choice

class DirichletProcessSample():
    def __init__(self, base_measure, alpha):
        self.base_measure = base_measure
        self.alpha = alpha
        self.cache = []
        self.weights = []
        self.total_stick_used = 0.

    def __call__(self):
        remaining = 1.0 - self.total_stick_used
        i = DirichletProcessSample.roll_die(self.weights + [remaining])
        if i is not None and i < len(self.weights) :
            return self.cache[i]
            stick_piece = beta(1, self.alpha).rvs() * remaining
            self.total_stick_used += stick_piece
            new_value = self.base_measure()
            return new_value
    def roll_die(weights):
        if weights:
            return choice(range(len(weights)), p=weights)
            return None

This Dirichlet process class could be called stochastic memoization. This idea was first articulated in somewhat abstruse terms by Daniel Roy, et al.

Below are histograms of 10000 samples drawn from samples drawn from Dirichlet processes with standard normal base distribution and varying $\alpha$ values.

In [22]:
import pandas as pd

base_measure = lambda: norm().rvs()
n_samples = 10000
samples = {}
for alpha in [1, 10, 100, 1000]:
    dirichlet_norm = DirichletProcessSample(base_measure=base_measure, alpha=alpha)
    samples["Alpha: %s" % alpha] = [dirichlet_norm() for _ in range(n_samples)]

_ = pd.DataFrame(samples).hist()

Note that these histograms look very similar to the corresponding plots of sampled distributions above. However, these histograms are plotting points sampled from a distribution sampled from a Dirichlet process while the plots above were showing approximate distributions samples from the Dirichlet process. Of course, as the number of samples from each $H$ grows large, we would expect the histogram to be a very good empirical approximation of $H$.

In a future post, I will look at how this DirichletProcessSample class can be used to draw samples from a hierarchical Dirichlet process.

In [ ]:

Tim HopperHandy One-off Webpages

I'm starting to love single-page informational websites. For example:

My website Should I Get a Phd? is in this same vein.

Publishing a site like this is very cheap with static hosting on AWS. I would love to see more of them created!