A planet of blogs from our members...

Caktus GroupDjango Girls Workshop Recap

This past Saturday we had the pleasure of holding a Django Girls RDU workshop at Caktus in downtown Durham, NC. We hosted 22 students and 10 coaches for a free, all-day coding workshop. Aimed at increasing diversity in tech and encouraging women to gain the skills they need to fall in love with coding, Django Girls is a global nonprofit that provides the framework for these workshops. In regular Django Girls style, the day was part party, part coding marathon and every student left having deployed their very own website!

We were incredibly lucky to have great partners from the local Django and Python community supporting our work that day, including PyLadies RDU, Durham Women in Tech, Code for Durham, and Code for Raleigh. Coaches donated their time and their weekends and attendees came from all walks of life (even as far as Atlanta, Georgia!) to spend their Saturdays teaching and learning to code.

Working in groups of 3 with one coach per group, attendees worked their way through a tutorial covering the development of a blog application and deploying it to Python Anywhere. “It was such a warm and open environment,” remarked Viola, a workshop attendee. “It was very invigorating being part of it.” Viola, who hadn’t made a web page since 1999, was one of many on a vast spectrum of coding experience to attend the workshop.

“It was... immensely satisfying to finish the day with a completed, attractive website that even had some rudimentary security. I loved feeling as if I had worked on a puzzle all day and managed to solve it,” wrote another attendee.

We are thrilled to have had the opportunity to support this movement and can’t wait to help coordinate another workshop in the future. Keep a sharp eye for more events from Django Girls RDU!

Og MacielBooks - September 2015

Caktus GroupColin Copeland to Speak on Police Data and Racial Bias at Code for America Summit

This Thursday, Colin Copeland, CTO and Caktus Group Co-founder, will be co-presenting “Case Study from North Carolina: Identifying Racial Bias in Policing Practices” during the prestigious 2015 Code for America Summit in Oakland, CA. This invite-only event joins technologists, activists, and officials ranging from mayors to White House officials to discuss technology’s role in civic participation.

Colin, as volunteer co-founder and co-captain of Code for Durham, will be discussing how he and fellow developers transformed over 20 million data points and twelve years of NC police data into a website that can help identify race-based policing practices. His co-presenter will be civil rights lawyer Ian Mance of Southern Coalition for Social Justice, who originated the idea and guided the site’s development. Last year, Ian's work with a local coalition of Durham activists received front-page attention in the New York Times for using a data-driven approach to combat racial profiling.

For more about Code for America’s role in the open data and civic technology movement, visit codeforamerica.org.

Tim HopperA Joke

Caktus GroupIntroduction to Monte Carlo Tree Search

For DjangoCon 2015, Jeff Bradberry created an A.I. for our booth game, Ultimate Tic Tac Toe. Reprinted here from jeffbradberry.com is his explanation of the Monte Carlo Tree Search used to build the A.I.

The subject of game AI generally begins with so-called perfect information games. These are turn-based games where the players have no information hidden from each other and there is no element of chance in the game mechanics (such as by rolling dice or drawing cards from a shuffled deck). Tic Tac Toe, Connect 4, Checkers, Reversi, Chess, and Go are all games of this type. Because everything in this type of game is fully determined, a tree can, in theory, be constructed that contains all possible outcomes, and a value assigned corresponding to a win or a loss for one of the players. Finding the best possible play, then, is a matter of doing a search on the tree, with the method of choice at each level alternating between picking the maximum value and picking the minimum value, matching the different players' conflicting goals, as the search proceeds down the tree. This algorithm is called Minimax.

The problem with Minimax, though, is that it can take an impractical amount of time to do a full search of the game tree. This is particularly true for games with a high branching factor, or high average number of available moves per turn. This is because the basic version of Minimax needs to search all of the nodes in the tree to find the optimal solution, and the number of nodes in the tree that must be checked grows exponentially with the branching factor. There are methods of mitigating this problem, such as searching only to a limited number of moves ahead (or ply) and then using an evaluation function to estimate the value of the position, or by pruning branches to be searched if they are unlikely to be worthwhile. Many of these techniques, though, require encoding domain knowledge about the game, which may be difficult to gather or formulate. And while such methods have produced Chess programs capable of defeating grandmasters, similar success in Go has been elusive, particularly for programs playing on the full 19x19 board.

However, there is a game AI technique that does do well for games with a high branching factor and has come to dominate the field of Go playing programs. It is easy to create a basic implementation of this algorithm that will give good results for games with a smaller branching factor, and relatively simple adaptations can build on it and improve it for games like Chess or Go. It can be configured to stop after any desired amount of time, with longer times resulting in stronger game play. Since it doesn't necessarily require game-specific knowledge, it can be used for general game playing. It may even be adaptable to games that incorporate randomness in the rules. This technique is called Monte Carlo Tree Search. In this article I will describe how Monte Carlo Tree Search (MCTS) works, specifically a variant called Upper Confidence bound applied to Trees (UCT) , and then will show you how to build a basic implementation in Python.

Imagine, if you will, that you are faced with a row of slot machines, each with different payout probabilities and amounts. As a rational person (if you are going to play them at all), you would prefer to use a strategy that will allow you to maximize your net gain. But how can you do that? For whatever reason, there is no one nearby, so you can't watch someone else play for a while to gain information about which is the best machine. Clearly, your strategy is going to have to balance playing all of the machines to gather that information yourself, with concentrating your plays on the observed best machine. One strategy, called Upper Confidence Bound 1 (UCB1), does this by constructing statistical confidence intervals for each machine


  • xi: the mean payout for machine i
  • ni: the number of plays of machine i
  • n: the total number of plays

Then, your strategy is to pick the machine with the highest upper bound each time. As you do so, the observed mean value for that machine will shift and its confidence interval will become narrower, but all of the other machines' intervals will widen. Eventually, one of the other machines will have an upper bound that exceeds that of your current one, and you will switch to that one. This strategy has the property that your regret, the difference between what you would have won by playing solely on the actual best slot machine and your expected winnings under the strategy that you do use, grows only as O(lnn). This is the same big-O growth rate as the theoretical best for this problem (referred to as the multi-armed bandit problem), and has the additional benefit of being easy to calculate.

And here's how Monte Carlo comes in. In a standard Monte Carlo process, a large number of random simulations are run, in this case, from the board position that you want to find the best move for. Statistics are kept for each possible move from this starting state, and then the move with the best overall results is returned. The downside to this method, though, is that for any given turn in the simulation, there may be many possible moves, but only one or two that are good. If a random move is chosen each turn, it becomes extremely unlikely that the simulation will hit upon the best path forward. So, UCT has been proposed as an enhancement. The idea is this: any given board position can be considered a multi-armed bandit problem, if statistics are available for all of the positions that are only one move away. So instead of doing many purely random simulations, UCT works by doing many multi-phase playouts.

The first phase, selection, lasts while you have the statistics necessary to treat each position you reach as a multi-armed bandit problem. The move to use, then, would be chosen by the UCB1 algorithm instead of randomly, and applied to obtain the next position to be considered. Selection would then proceed until you reach a position where not all of the child positions have statistics recorded.

Selection: Here the positions and moves selected by the UCB1 algorithm. Note that a number of playouts have already been run to accumulate the statistics shown. Each circle contains the number of wins / number of times played.
Selection. Here the positions and moves selected by the UCB1 algorithm at each step are marked in bold. Note that a number of playouts have already been run to accumulate the statistics shown. Each circle contains the number of wins / number of times played.

The second phase, expansion, occurs when you can no longer apply UCB1. An unvisited child position is randomly chosen, and a new record node is added to the tree of statistics.

Expansion: The position marked 1/1 at the bottom of the tree has no further statistics records under it, so we choose a random move and add a new record for it (bold), initialized to 0/0.
Expansion. The position marked 1/1 at the bottom of the tree has no further statistics records under it, so we choose a random move and add a new record for it (bold), initialized to 0/0.

After expansion occurs, the remainder of the playout is in phase 3, simulation. This is done as a typical Monte Carlo simulation, either purely random or with some simple weighting heuristics if a light playout is desired, or by using some computationally expensive heuristics and evaluations for a heavy playout. For games with a lower branching factor, a light playout can give good results.

Simulation: Once the new record is added, the Monte Carlo simulation begins, here depicted with a dashed arrow. Moves in the simulation may be completely random, or may use calculations to weight the randomness in favor of moves that may be better.
Simulation. Once the new record is added, the Monte Carlo simulation begins, here depicted with a dashed arrow. Moves in the simulation may be completely random, or may use calculations to weight the randomness in favor of moves that may be better.

Finally, the fourth phase is the update or back-propagation phase. This occurs when the playout reaches the end of the game. All of the positions visited during this playout have their play count incremented, and if the player for that position won the playout, the win count is also incremented.

Back-Propagation: After the simulation reaches an end, all of the records in the path taken are updated. Each has its play count incremented by one, and each that matches the winner has its win count incremented by one, here shown by the bolded numbers.
Back-Propagation. After the simulation reaches an end, all of the records in the path taken are updated. Each has its play count incremented by one, and each that matches the winner has its win count incremented by one, here shown by the bolded numbers.

This algorithm may be configured to stop after any desired length of time, or on some other condition. As more and more playouts are run, the tree of statistics grows in memory and the move that will finally be chosen will converge towards the actual optimal play, though that may take a very long time, depending on the game.

For more details about the mathematics of UCB1 and UCT, see Finite-time Analysis of the Multiarmed Bandit Problem and Bandit based Monte-Carlo Planning.

Now let's see some code. To separate concerns, we're going to need a Board class, whose purpose is to encapsulate the rules of a game and which will care nothing about the AI, and a MonteCarlo class, which will only care about the AI algorithm and will query into the Board object in order to obtain information about the game. Let's assume a Board class supporting this interface:

class Board(object):
    def start(self):
        # Returns a representation of the starting state of the game.

    def current_player(self, state):
        # Takes the game state and returns the current player's
        # number.

    def next_state(self, state, play):
        # Takes the game state, and the move to be applied.
        # Returns the new game state.

    def legal_plays(self, state_history):
        # Takes a sequence of game states representing the full
        # game history, and returns the full list of moves that
        # are legal plays for the current player.

    def winner(self, state_history):
        # Takes a sequence of game states representing the full
        # game history.  If the game is now won, return the player
        # number.  If the game is still ongoing, return zero.  If
        # the game is tied, return a different distinct value, e.g. -1.

For the purposes of this article I'm not going to flesh this part out any further, but for example code you can find one of my implementations on github. However, it is important to note that we will require that the state data structure is hashable and equivalent states hash to the same value. I personally use flat tuples as my state data structures.

The AI class we will be constructing will support this interface:

class MonteCarlo(object):
    def __init__(self, board, **kwargs):
        # Takes an instance of a Board and optionally some keyword
        # arguments.  Initializes the list of game states and the
        # statistics tables.

    def update(self, state):
        # Takes a game state, and appends it to the history.

    def get_play(self):
        # Causes the AI to calculate the best move from the
        # current game state and return it.

    def run_simulation(self):
        # Plays out a "random" game from the current position,
        # then updates the statistics tables with the result.

Let's begin with the initialization and bookkeeping. The board object is what the AI will be using to obtain information about where the game is going and what the AI is allowed to do, so we need to store it. Additionally, we need to keep track of the state data as we get it.

class MonteCarlo(object):
    def __init__(self, board, **kwargs):
        self.board = board
        self.states = []

    def update(self, state):

The UCT algorithm relies on playing out multiple games from the current state, so let's add that next.

import datetime

class MonteCarlo(object):
    def __init__(self, board, **kwargs):
        # ...
        seconds = kwargs.get('time', 30)
        self.calculation_time = datetime.timedelta(seconds=seconds)

    # ...

    def get_play(self):
        begin = datetime.datetime.utcnow()
        while datetime.datetime.utcnow() - begin < self.calculation_time:

Here we've defined a configuration option for the amount of time to spend on a calculation, and get_play will repeatedly call run_simulation until that amount of time has passed. This code won't do anything particularly useful yet, because we still haven't defined run_simulation, so let's do that now.

# ...
from random import choice

class MonteCarlo(object):
    def __init__(self, board, **kwargs):
        # ...
        self.max_moves = kwargs.get('max_moves', 100)

    # ...

    def run_simulation(self):
        states_copy = self.states[:]
        state = states_copy[-1]

        for t in xrange(self.max_moves):
            legal = self.board.legal_plays(states_copy)

            play = choice(legal)
            state = self.board.next_state(state, play)

            winner = self.board.winner(states_copy)
            if winner:

This adds the beginnings of the run_simulation method, which either selects a move using UCB1 or chooses a random move from the set of legal moves each turn until the end of the game. We have also introduced a configuration option for limiting the number of moves forward that the AI will play.

You may notice at this point that we are making a copy of self.states and adding new states to it, instead of adding directly to self.states. This is because self.states is the authoritative record of what has happened so far in the game, and we don't want to mess it up with these speculative moves from the simulations.

Now we need to start keeping statistics on the game states that the AI hits during each run of run_simulation. The AI should pick the first unknown game state it reaches to add to the tables.

class MonteCarlo(object):
    def __init__(self, board, **kwargs):
        # ...
        self.wins = {}
        self.plays = {}

    # ...

    def run_simulation(self):
        visited_states = set()
        states_copy = self.states[:]
        state = states_copy[-1]
        player = self.board.current_player(state)

        expand = True
        for t in xrange(self.max_moves):
            legal = self.board.legal_plays(states_copy)

            play = choice(legal)
            state = self.board.next_state(state, play)

            # `player` here and below refers to the player
            # who moved into that particular state.
            if expand and (player, state) not in self.plays:
                expand = False
                self.plays[(player, state)] = 0
                self.wins[(player, state)] = 0

            visited_states.add((player, state))

            player = self.board.current_player(state)
            winner = self.board.winner(states_copy)
            if winner:

        for player, state in visited_states:
            if (player, state) not in self.plays:
            self.plays[(player, state)] += 1
            if player == winner:
                self.wins[(player, state)] += 1

Here we've added two dictionaries to the AI, wins and plays, which will contain the counts for every game state that is being tracked. The run_simulation method now checks to see if the current state is the first new one it has encountered this call, and, if not, adds the state to both plays and wins, setting both values to zero. This method also adds every game state that it goes through to a set, and at the end updates plays and wins with those states in the set that are in the plays and wins dicts. We are now ready to base the AI's final decision on these statistics.

from __future__ import division
# ...

class MonteCarlo(object):
    # ...

    def get_play(self):
        self.max_depth = 0
        state = self.states[-1]
        player = self.board.current_player(state)
        legal = self.board.legal_plays(self.states[:])

        # Bail out early if there is no real choice to be made.
        if not legal:
        if len(legal) == 1:
            return legal[0]

        games = 0
        begin = datetime.datetime.utcnow()
        while datetime.datetime.utcnow() - begin < self.calculation_time:
            games += 1

        moves_states = [(p, self.board.next_state(state, p)) for p in legal]

        # Display the number of calls of `run_simulation` and the
        # time elapsed.
        print games, datetime.datetime.utcnow() - begin

        # Pick the move with the highest percentage of wins.
        percent_wins, move = max(
            (self.wins.get((player, S), 0) /
             self.plays.get((player, S), 1),
            for p, S in moves_states

        # Display the stats for each possible play.
        for x in sorted(
            ((100 * self.wins.get((player, S), 0) /
              self.plays.get((player, S), 1),
              self.wins.get((player, S), 0),
              self.plays.get((player, S), 0), p)
             for p, S in moves_states),
            print "{3}: {0:.2f}% ({1} / {2})".format(*x)

        print "Maximum depth searched:", self.max_depth

        return move

We have added three things in this step. First, we allow get_play to return early if there are no choices or only one choice to make. Next, we've added output of some debugging information, including the statistics for the possible moves this turn and an attribute that will keep track of the maximum depth searched in the selection phase of the playouts. Finally, we've added code that picks out the move with the highest win percentage out of the possible moves, and returns it.

But we are not quite finished yet. Currently, our AI is using pure randomness for its playouts. We need to implement UCB1 for positions where the legal plays are all in the stats tables, so the next trial play is based on that information.

# ...
from math import log, sqrt

class MonteCarlo(object):
    def __init__(self, board, **kwargs):
        # ...
        self.C = kwargs.get('C', 1.4)

    # ...

    def run_simulation(self):
        # A bit of an optimization here, so we have a local
        # variable lookup instead of an attribute access each loop.
        plays, wins = self.plays, self.wins

        visited_states = set()
        states_copy = self.states[:]
        state = states_copy[-1]
        player = self.board.current_player(state)

        expand = True
        for t in xrange(1, self.max_moves + 1):
            legal = self.board.legal_plays(states_copy)
            moves_states = [(p, self.board.next_state(state, p)) for p in legal]

            if all(plays.get((player, S)) for p, S in moves_states):
                # If we have stats on all of the legal moves here, use them.
                log_total = log(
                    sum(plays[(player, S)] for p, S in moves_states))
                value, move, state = max(
                    ((wins[(player, S)] / plays[(player, S)]) +
                     self.C * sqrt(log_total / plays[(player, S)]), p, S)
                    for p, S in moves_states
                # Otherwise, just make an arbitrary decision.
                move, state = choice(moves_states)


            # `player` here and below refers to the player
            # who moved into that particular state.
            if expand and (player, state) not in plays:
                expand = False
                plays[(player, state)] = 0
                wins[(player, state)] = 0
                if t > self.max_depth:
                    self.max_depth = t

            visited_states.add((player, state))

            player = self.board.current_player(state)
            winner = self.board.winner(states_copy)
            if winner:

        for player, state in visited_states:
            if (player, state) not in plays:
            plays[(player, state)] += 1
            if player == winner:
                wins[(player, state)] += 1

The main addition here is the check to see if all of the results of the legal moves are in the plays dictionary. If they aren't available, it defaults to the original random choice. But if the statistics are all available, the move with the highest value according to the confidence interval formula is chosen. This formula adds together two parts. The first part is just the win ratio, but the second part is a term that grows slowly as a particular move remains neglected. Eventually, if a node with a poor win rate is neglected long enough, it will begin to be chosen again. This term can be tweaked using the configuration parameter C added to __init__ above. Larger values of C will encourage more exploration of the possibilities, and smaller values will cause the AI to prefer concentrating on known good moves. Also note that the self.max_depth attribute from the previous code block is now updated when a new node is added and its depth exceeds the previous self.max_depth.

So there we have it. If there are no mistakes, you should now have an AI that will make reasonable decisions for a variety of board games. I've left a suitable implementation of Board as an exercise for the reader, but one thing I've left out here is a way of actually allowing a user to play against the AI. A toy framework for this can be found at jbradberry/boardgame-socketserver and jbradberry/boardgame-socketplayer.

This version that we've just built uses light playouts. Next time, we'll explore improving our AI by using heavy playouts, by training some evaluation functions using machine learning techniques and hooking in the results.

UPDATE: The diagrams have been corrected to more accurately reflect the possible node values.

Caktus GroupPyCon 2016: Behind the Design

Having helped to design an award-winning event site for last year’s PyCon in Montreal, we are thrilled to collaborate again with the Python Software Foundation (PSF) on this year’s site for PyCon 2016.

PyCon 2016 will be held in Portland, OR and PSF knew they wanted to convey the distinctive mood and look of that city with the 2016 website. Working collaboritively with PSF, Designer Trevor Ray drew on everything from the unique architecture of the city’s craftsman-style bungalow houses and the surrounding mountainous landscape to the cool color scheme of the Pacific-Northwest region. The team developed a site that leads the user on a journey. As he or she scrolls, the user is brought further into the city, from the low, rolling, misty, forest-topped mountains on the outskirts, into the very heart of its neighborhoods.

Trevor also redesigned the PyCon logo for 2016, giving it a peaked shape, hand-lettered typography (a reference to the thriving craft community of Portland), and a classic, balanced layout. Ultimately, our team and PSF worked together to achieve a site that we hope is welcoming and functional.

We’re excited for this year’s PyCon, and we hope the site excites attendees as well. Only 249 days till PyCon!

Dave O'ConnorMySQL SQLAlchemy OS X Reference

Whenever I learn something new, I enjoy putting together a reference sheet of whatever it is I am learning. In regards to Python, this process typically involves working in the Python interpreter and diligently recording commands that worked as intended. Today I want to share the reference sheet I put together while exploring MySQL and [SQLAlchemy](http://www.sqlalchemy.org/) on OS

Tim HopperData Science and Agriculture

I'm excited to see tools developed for the web being applied to offline domains like agriculture and health. I posed a question on Twitter yesterday:

I got a number of replies. Companies in this space (not all hiring for DS) include:

  • The Climate Corporation: Known for their high profile acquisition by Monsanto, Climate Corp "combines hyper-local weather monitoring, agronomic modeling, and high-resolution weather simulations" "that help farmers improve profitability by making better informed operating and financing decisions". They are based in San Fransisco.
  • FarmLogs: A YCombinator-backed startup in Ann Arbor, MI, FarmLogs is combining satellite maps, soil surveys, GPS data, and more to "expose critical operational data insights" to row crop farmers.
  • Dairymaster: In Ireland, Darymaster is hiring data scientists for their business of dairy equipment manufacturing.
  • aWhere: aWhere "delivers agricultural intelligence into the hands of farmers, commercial growers, and policymakers everywhere" by collecting detailed weather data from around the world. They are based in Denver.
  • pulsepod: This small startup is building hardware to help farmers "know when to plant, fertilize, irrigate, and harvest to achieve quality and yield goals using data from your own field". They are based in Princeton, NJ.
  • Granular: "Granular, a new kind of farm management software and analytics platform, helps you improve efficiency, profit and yield so you are ready to farm more acres." They are based in San Fransisco.
  • PrecisionHawk: Based in my home of Raleigh, NC, PrecisionHawk is "an information delivery company that combines unmanned aerial systems, cutting-edge artificial intelligence and remote sensing technologies to improve business operations and day-to-day decision making."
  • mavrx: "We use aerial imagery to provide actionable insights to the global agriculture industry." They are based in San Fransisco.
  • AgSmarts: Memphis-based Agsmarts "is a Precision Agriculture company that automates existing agricultural irrigation systems with our universal retrofit solution to optimize crop yield, save water and AgSmarts minimize input costs via mesh networks of IP-enabled controllers & environmental sensors."

Tim HopperNotes on Gibbs Sampling in Hierarchical Dirichlet Process Models, Part 2

Update (2015-09-21): I moved this post to a PDF on Github.

Tim HopperNotes on Gibbs Sampling in Hierarchal Dirichlet Process Models, Part 2

In an earlier post, I derived the necessary equations to sample table assignments when applying the "Posterior sampling in the Chinese restaurant franchise" algorithm to the case of latent Dirichlet allocation. I complete the derivation of necessary equations here.

We need to sample $k_{jt}$ (the dish/topic for table $t$ in restaurant $j$):

\[\displaystyle p(k_{jt}=k \,|\, {\bf t}, {\bf k}^{-jt}) \propto \begin{cases} m_{\cdot k}^{-jt}\cdot f_k^{-{\bf x}_{jt}}({\bf x}_{jt}) & {\tiny \text{if } k \text{ previously used,}}\\ \gamma\cdot f_{k^\text{new}}^{-{\bf x}_{jt}}({\bf x}_{jt}) & {\tiny \text{if } t=k^{\text{new}}}. \end{cases} \]

where $f_k^{-{\bf x}_{jt}}({\bf x}_{jt})$ is the "conditional density of ${\bf x}_{jt}$ given all data items associated with mixture component $k$ leaving out ${\bf x}_{jt}$" (Teh, et al). (${\bf x}_{jt}$ is every customer in restaurant $j$ seated at table $t$). $m_{\cdot k}^{-jt}$ is the number of tables (in all franchises) serving dish $k$ when we remove table $jt$.

This requires $f_k^{-{\bf x}_{jt}}({\bf x}_{jt})$; this is different from Equation (30), though they look quite similar.

\[ \begin{align} f_k^{-{\bf x}_{jt}}({\bf x}_{jt}) &=\frac{\displaystyle \int {\prod_{x_{ji}\in {\bf x}_{jt}}} f(x_{ji} \,|\, \phi_k) \left[ \prod_{x_{i'j'}\not\in {\bf x}_{jt}, z_{i'j'}=k} f(x_{j'i'} \,|\, \phi_k) \right] h(\phi_k) d\phi_k } {\displaystyle \int \left[ \displaystyle\prod_{x_{i'j'}\not\in {\bf x}_{jt}, z_{i'j'}=k} f(x_{i'j'}|\phi_k) \right] h(\phi_k)d\phi_k } \\ &=\frac{\displaystyle \int {\prod_{x_{ji}\in {\bf x}_{jt}}} \phi_{k x_{ji}} \left[ \prod_{x_{i'j'}\not\in {\bf x}_{jt}, z_{i'j'}=k} \phi_{k x_{i'j'}} \right] \prod_{w} \phi_{kw}^{\beta-1} d\phi_k } {\displaystyle \int \left[ \displaystyle\prod_{x_{i'j'}\not\in {\bf x}_{jt}, z_{i'j'}=k} f(x_{i'j'}|\phi_k) \right] \prod_{w} \phi_{kw}^{\beta-1} d\phi_k } \\ &=\frac{\displaystyle \int {\prod_{x_{ji}\in {\bf x}_{jt}}} \phi_{k x_{ji}} \left[ \prod_{x_{i'j'}\not\in {\bf x}_{jt}, z_{i'j'}=k} \phi_{k x_{i'j'}} \right] \prod_{w} \phi_{kw}^{\beta-1} d\phi_k } {\displaystyle \int \left[ \displaystyle\prod_{x_{i'j'}\not\in {\bf x_{jt}}, z_{i'j'}=k} \phi_{k x_{i'j'}} \right] \prod_{w} \phi_{kw}^{\beta-1} d\phi_k } \end{align} \]

The denominator is

\[ \begin{align} \text{denominator}&= \int\left[ \prod_{x_{i'j'}\not\in {\bf x_{jt}}, z_{i'j'}=k} \phi_{k x_{i'j'}} \right] \prod_{w} \phi_{kw}^{\beta-1} d\phi_k \\ &=\int\left[ \prod_w \phi_{kw}^{n_{kw}^{-jt}} \prod_w \phi_{kw}^{\beta-1} \right] d\phi_k \\ &=\int\left[ \prod_w \phi_{kw}^{n_{kw}^{-jt}+\beta-1} \right] d\phi_k \\ &=\frac{ \prod_w \Gamma\left( n_{kw}^{-jt} + \beta \right) }{ \Gamma\left( \sum_w n_{kw}^{-jt}+\beta \right) } \\ &=\frac{ \prod_w \Gamma\left( n_{kw}^{-jt} + \beta \right) }{ \Gamma\left( n_{k\cdot}^{-jt}+V\beta \right) } \end{align} \]

The numerator is

\[ \begin{align} \text{numerator} &=\int {\prod_{x_{ji}\in {\bf x}_{jt}}} \phi_{k x_{ji}} \left[ \prod_{x_{i'j'}\not\in {\bf x}_{jt}, z_{i'j'}=k} \phi_{k x_{i'j'}} \right] \prod_{w} \phi_{kw}^{\beta-1} d\phi_k \\ &=\int \prod_{w} \phi_{kw}^{ n_{kw}^{-jt} + n_{\cdot w}^{jt} + \beta + 1 } d\phi_k \\ &=\frac{ \prod_w \Gamma\left( n_{kw}^{-jt} + n_{\cdot w}^{jt} + \beta \right) }{ \Gamma \left( \sum_w n_{kw}^{-jt} + n_{\cdot w}^{jt} + \beta \right) }\\ &=\frac{ \prod_w \Gamma\left( n_{kw}^{-jt} + n_{\cdot w}^{jt} + \beta \right) }{ \Gamma \left( n_{k\cdot}^{-jt} + n_{\cdot \cdot}^{jt} + \beta \right) } \end{align} \]

This gives us a closed form version of this conditional distribution:

\[ \begin{align} f_k^{-{\bf x_{jt}}}({\bf x}_{jt}) &= \displaystyle\frac{ \prod_w \Gamma\left( n_{kw}^{-jt} + n_{\cdot w}^{jt} + \beta \right) }{ \prod_w \Gamma\left( n_{kw}^{-jt} + \beta \right) } \frac{ \Gamma\left( n_{k\cdot}^{-jt}+V\beta \right) }{ \Gamma \left( n_{k\cdot}^{-jt} + n_{\cdot \cdot}^{jt} + \beta \right) } \end{align} \]

We also need the conditional distribution of $k$ is a new dish: $f_{k^\text{new}}^{-{\bf x}_{jt}}({\bf x}_{jt})$. Shuyo provides without derivation:

\[ \begin{align} f_{k^\text{new}}^{-{\bf x}_{jt}}({\bf x}_{jt}) &=\int \left[ \prod_{x_{ji}\in \mathbf{x_{jt}}} f(x_{ji} \,|\, \phi) \right] h(\phi)d\phi_k \\ &=\int \prod_{x_{ji}\in \mathbf{x_{jt}}} \phi_{x_{ji}} \cdot \frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) } \prod_w \phi_w^{\beta-1} d\phi \\ &=\frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) } \int \prod_{x_{ji}\in \mathbf{x_{jt}}} \phi_{x_{ji}} \cdot \prod_w \phi_w^{\beta-1} d\phi\\ &=\frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) } \int \prod_{x_{ji}\in \mathbf{x_{jt}}} \phi_{x_{ji}}^{\left(\beta+1\right)-1} \cdot \prod_{x_{jt}\not\in \mathbf{x_{jt}}} \phi_{x_{jt}}^{\beta-1} d\phi\\ &=\frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) }\cdot \frac{ \prod_{x_{ji}\in \mathbf{x_{jt}}} \Gamma(\beta+1) \prod_{x_{ji}\not\in \mathbf{x_{jt}}}\Gamma(\beta) }{ \Gamma(V\beta+\sum_{x_{ji}\in \mathbf{x_{jt}}} 1) }\\ &=\frac{ \Gamma(V\beta) \prod_v \Gamma(\beta+n_{\cdot v}^{jt}) }{ \Gamma(V\beta+n_{\cdot\cdot}^{jt}) \prod_v \Gamma(\beta) } \end{align} \]

Given these equations for $f_{k}^{-{\bf x}_{jt}}({\bf x}_{jt})$ and $f_{k^\text{new}}^{-{\bf x}_{jt}}({\bf x}_{jt})$, we can draw samples from $p(k_{jt}=k \,|\, {\bf t}, {\bf k}^{-jt})$ by enumeration over topics. Combined with sampling from table assignment described here, we now have a complete Gibbs sampler for the Posterior sampling in the Chinese restaurant franchise in Teh, et al.

Dave O'ConnorWeb Scraping with Scrapy

In this post we will use the fast and powerful web scraper [Scrapy](http://scrapy.org/) to comb [theonion.com](http://www.theonion.com/) for horoscopes. Web scraping is one of the many tools in the Python tool belt and can be incredibly helpful in retrieving information from a website that does not provide an API endpoint. Unfortunately web pages do change and as such web

Dave O'ConnorBuilding a Simple REST API With Django Rest Framework

Today we will focus on building a simple REST API using the [Django Rest Framework](http://www.django-rest-framework.org/). RESTful services give access, typically over HTTP, to resources containing a wide variety of data and/or media. One of the great benefits of the Django Rest Framework is that it is able to get something tangible up and running very quickly. It integrates

Dave O'ConnorLet's Build a Web App: Part 4 - Model, Template, View

Before we jump into designing the front-end of our website, we should configure Django to render content in the browser. Django is a [MTV framework](https://docs.djangoproject.com/en/1.8/faq/general/), short for Model, Template, View. Remember, we are trying to enable an end user to query PostgreSQL and visualize access log data. We already have our Entry model so now we need a

Dave O'ConnorLet's Build a Web App: Part 5 - Bootstrap

In this article we will dive into Bootstrap and begin working on a front end for querying the data in our Entry models. [Bootstrap](http://getbootstrap.com/) is a popular front-end framework for quickly and cleanly styling a website. Created by the folks at Twitter, it uses a combination of HTML, CSS, and Javascript to standardize views across mobile, tablet, and

Dave O'ConnorExploring Python 3

In this article we will discuss Python 3 as it relates to Python 2 and explore some of the more obvious differences between the two versions. Python 3 was released on December 3rd, 2008 and has been a point of contention within the Python community. Unlike previous iterations of Python, this version broke backwards compatibility and effectively erased

Frank WierzbickiJython 2.7.1 beta1 released!

On behalf of the Jython development team, I'm pleased to announce that the first beta of Jython 2.7.1 is available! This is a bugfix release. Bug fixes include:
  • Import systems fixes for relative imports and some circular imports.
  • -m command now executes scripts from inside a jar file.
  • bytearray matches cpython's behavior better.
Please see the NEWS file for detailed release notes. This release of Jython requires JDK 7 or above.

This release is being hosted at maven central. There are three main distributions. In order of popularity:
To see all of the files available including checksums, go to the maven query for Jython and navigate to the appropriate distribution and version.

Tim HopperDistribution of Number of Tables in Chinese Restaurant Process

The Chinese Restaurant Process (related to the Dirichlet process) often comes up in Bayesian nonparametrics. Some related MCMC algorithm require drawing samples from the distribution of tables created by a Chinese restaurant process with parameter after a given number of patrons are seated. Of course, you could sample from this distribution by simulating the CRP and counting tables, however Antoniak provided a closed form of the probability mass function. Here is a helpful introduction to the "Antoniak equation".

I wrote some quick and dirty Python code to sample from the Antoniak distribution. It's based Matlab code by Yee Whye Teh.

Tim HopperNotes on Gibbs Sampling in Hierarchical Dirichlet Process Models

Update (2015-09-21): I moved this post to a PDF on Github.

Tim HopperNotes on Gibbs Sampling in Hierarchal Dirichlet Process Models

Yee Whye Teh et al's 2005 paper Hierarchical Dirichlet Processes describes a nonparametric prior for grouped clustering problems. For example, the HDP helps in generalizing the latent Dirichlet allocation model to the case the number of topics in the data are discovered by the inference algorithm instead of being specified as a parameter of the model.

The authors describe three MCMC-based algorithms for fitting HDP based models. Unfortunately, the algorithms are described somewhat vaguely and in general terms. A fair bit of mathematical leg work is required before the HDP algorithms can be applied to the specific case of nonparametric latent Dirichlet allocation.

Here are some notes I've compiled in my effort to understand these algorithms.

HDP-LDA Generative Model

The generative model for Hierarchical Dirichlet Process Latent Dirichlet Allocation is as follows:

\[ \begin{align} H & \sim \text{Dirichlet}(\beta) \\ G_0 \,|\, \gamma, H & \sim \text{DP}(\gamma, H) \\ G_j \,|\, \alpha_0, G_0 & \sim \text{DP}(\alpha_0, G_0) \\ \theta_{ji} \,|\, G_j & \sim G_j \\ x_{ij} \,|\, \theta_{ji} & \sim \text{Categorical}(\theta_{ji}) \\ \end{align} \]

  • $H$ is Dirichlet distribution whose dimension is the size of the vocabulary, i.e. it is distribution over an uncountably-number of term distributions (topics).
  • $G_0$ is a distribution over a countably-infinite number of categorical term distributions, i.e. topics.
  • For each document $j$, $G_j$ is a is a distribution over a countably-infinite number of categorical term distributions, i.e. topics.
  • $\theta_{ji}$ is a categorical distribution over terms, i.e. a topic.
  • $x_{ij}$ is a term.

To see code for sampling from this generative model, see this post.

Chinese Restaurant Franchise Approach

Instead of the above Dirichlet process model, we can think of an identical "Chinese Restaurant Franchise" model.

Each $\theta_{ji}$ is a customer in restaurant $j$. Each customer is sitting at a table, and each table has multiple customers.

There is a global menu of $K$ dishes that the restaurants serve, $\phi_1,\ldots,\phi_K$.

Some other definitions:

  • $\psi_{jt}$ is the dish served at table $t$ in restaurant $j$; i.e. each $\psi_{jt}$ corresponds to some $\phi_k$.
  • $t_{ji}$ is the index of the $\psi_{jt}$ associated with $\theta_{ji}$.
  • $k_{jt}$ is the index of $\phi_k$ associated with $\psi_{jt}$.

Customer $i$ in restaurant $j$ sits at table $t_{ji}$ while table $t$ in restaurant $j$ serves dish $k_{jt}$.

There are two arrays of count variables we will want to track:

  • $n_{jtk}$ is the number of customers in restaurant $j$ at table $t$ eating dish $k$.
  • $m_{jk}$ is the number of tables in restaurant $j$ serving dish $k$ (multiple tables may serve the same dish).

To summarize:

$x_{ij}$ are observed data (words). We assume $x_{ij}\sim F(\theta_{ij})$. Further, we assume $\theta_{ji}$ is associated with table $t_{ji}$, that is $\theta_{ji}=\psi_{jt_{ji}}$. Further, we assume the topic for table $j$ is indexed by $k_{jt}$, i.e. $\psi_{jt}=\phi_{k_{jt}}$. Thus, if we know $t_{ji}$ (the table assignment for $x_{ij}$) and $k_{jt}$ (the dish assignment for table $t$) for all $i, j, t$, we could determine the remaining parameters by sampling.

Gibbs Sampling

Teh et al describe three Gibbs samplers for this model. The first and third are most applicable to the LDA application. The section helps with more complication applications of the LDA algorithm (e.g. the hidden Markov model).

5.3 Posterior sampling by direct assignment

Section 5.3 describes a direct assignment Gibbs sampler that directly assigns words to topics by augmenting the model with an assignment variable $z_{ji}$ that is equivalent to $k_{jt_{ji}}$. This also requires a count variable $m_{jk}$: the number of tables in document/franchise $j$ assigned to dish/topic $k$. This sampler requires less "bookkeeping" than the algorithm from 5.1, however it requires expensive simulation or computation of recursively computed Stirling numbers of the first kind.

5.1 Posterior sampling in the Chinese restaurant franchise

Section 5.1 describes "Posterior sampling in the Chinese restaurant franchise". Given observed data $\mathbf{x}$ (i.e. documents), we sample over the index variables $t_{ji}$ (associating tables with customers/words) and $k_{jt}$ (associating tables with dishes/topics). Given these variables, we can reconstruct the distribution over topics for each document and distribution over words for each topic.

Notes on Implementing Algorithm 5.1

Teh et al's original HDP paper is sparse on details with regard to applying these samplers to the specific case of nonparametric LDA. For example, both samplers require computing the conditional distribution of word $x_{ji}$ under topic $k$ given all data items except $x_{ji}$: $f_k^{x_{ji}}(x_{ji})$) (eq. 30).

A Blogger Formerly Known As Shuyo has a brief post where he states (with little-to-no derivation) the equations specific to the LDA case. Below I attempt provide some of those derivations in pedantic detail.

As stated above, in the case of topic modeling, $H$ is a Dirichlet distribution over terms distributions and $F$ is a multinomial distribution over terms.

By definition,

\[ h(\phi_k)=\frac{1}{Z}\prod_v[\phi_k]_v^{\beta-1} \]


\[ f(x_{ji}=v \,|\, \phi_k)=\phi_{kv}. \]

Equation (30): $f_k^{x_{ji}}(x_{ji})$

For convenience, define $v=x_{ji}$ (word $i$ in document $j$), define $k=k_{jt_{ji}}$ (topic assignment for the table in document $j$ containing word $i$), and

\[ n_{kv}^{-ji}=\left|\left\{ x_{mn} \,|\, k_{mt_{mn}}=k,\, x_{mn}=v,\, (m,n)\neq(j,i) \right\}\right| \]

(the number of time the term $x_{ji}$, besides $x_{ji}$ itself, is generated by the same topic as was $x_{ji}$).

First look at the term (for fixed $k$):

\[ \prod_{j'i'\neq ji, z_{j'i'=k}} f(x_{j'i'} \,|\, \phi_k)= \prod_{j'} \prod_{i'\neq i, z_{j'i'}=k} [\phi_{k}]_{x_{j'i'}} \]

$[\phi_k]_v$ is the probability that term $v$ is generated by topic $k$. The double sums run over every word generated by topic $k$ in every document. Since $[\phi_{k}]_{x_{j'i'}}$ is fixed for a given word $w$, we could instead do a product over the each word of the vocabulary:

\[ \prod_{j'i'\neq ji, z_{j'i'=k}} f(x_{j'i'} \,|\, \phi_k) =\prod_{w\in\mathcal{V}}[\phi_k]_w^{n_{kw}^{-ji}} \]

We use this early on in the big derivation below.

Also, note that

\[ \int \phi_{kv}^{n_{kv}^{-ji}+\beta} \prod_{w\neq v} \phi_{kw}^{n_{kw}^{-ji}+\beta-1} d\phi_k \text{ and } \int \prod_w \phi_{kw}^{n_{kw}^{-ji}+\beta-1} d\phi_k \]

are the normalizing coefficients for Dirichlet distributions.

Equation (30) in Teh's paper is:

\[ \begin{align} f_k^{-x_{ji}}(x_{ji}) &=\frac{ \int f(x_{ji} \,|\, \phi_k) \left[ \prod_{j'i'\neq ji, z_{j'i'}=k} f(x_{j'i'} \,|\, \phi_k) \right] h(\phi_k) d\phi_k } { \int \left[ \prod_{j'i'\neq ji, z_{j'i'}=k} f(x_{j'i'}|\phi_k) \right] h(\phi_k)d\phi_k } \\ &=\frac{ \int f(x_{ji} \,|\, \phi_k) \left[ \prod_{j'} \prod_{i'\neq i, z_{j'i'}=k} \phi_{kv} \right]\cdot h(\phi_k) d\phi_k } { \int \left[ \prod_{j'i'\neq ji, z_{j'i'}=k} f(x_{j'i'}|\phi_k) \right]h(\phi_k)d\phi_k } \\ &\propto\frac{ \int \phi_{kv} \prod_w \phi_{kw}^{n_{kw}^{-ji}} \prod_{w} \phi_{kw}^{\beta-1} d\phi_k }{ \int \prod_w \phi_{kw}^{n_{kw}^{-ji}} \prod_w \phi_{kw}^{\beta-1} d\phi_k }\\ &=\frac{ \int \phi_{kv}\cdot \phi_{kv}^{n_{kv}^{-ji}}\cdot \prod_{w\neq v} \phi_{kw}^{n_{kw}^{-ji}}\cdot \phi_{kv}^{\beta-1}\cdot \prod_{w\neq v} \phi_{kw}^{\beta-1} d\phi_k }{ \int \prod_w \phi_{kw}^{n_{kw}^{-ji}} \prod_w \phi_{kw}^{\beta-1} d\phi_k }\\ &= \int \phi_{kv}^{n_{kv}^{-ji}+\beta} \prod_{w\neq v} \phi_{kw}^{n_{kw}^{-ji}+\beta-1} d\phi_k \cdot \frac{ 1 }{ \int \prod_w \phi_{kw}^{n_{kw}^{-ji}+\beta-1} d\phi_k }\\ &=\frac{ \Gamma\left(\beta+n_{kv}^{-ji}+1\right) \prod_{w\neq v} \Gamma\left(\beta+n_{kw}^{-ji}\right) }{ \Gamma\left( \sum_{w\neq v} \left[ \beta+n_{kw}^{-ji} \right]+ (\beta+n_{kv}^{-ji}+1)\right) }\cdot \frac{ 1 }{ \int\prod_w \left(\phi_{kw}\right)^{n_{kw}^{-ji}+\beta-1} d\phi_k }\\ &=\frac{ \Gamma\left(\beta+n_{kv}^{-ji}+1\right) \prod_{w\neq v} \Gamma\left(\beta+n_{kw}^{-ji}\right) }{ \Gamma\left( \sum_{w\in\mathcal{V}} \left[ \beta+n_{kw}^{-ji} \right] +1\right) }\cdot \frac{ 1 }{ \int\prod_w \left(\phi_{kw}\right)^{n_{kw}^{-ji}+\beta-1} d\phi_k }\\ &=\frac{ \Gamma\left(\beta+n_{kv}^{-ji}+1\right) \prod_{w\neq v} \Gamma\left(\beta+n_{kw}^{-ji}\right) }{ \Gamma\left( \sum_{w\in\mathcal{V}} \left[ \beta+n_{kw}^{-ji} \right] +1\right) }\cdot \frac{ \Gamma\left( \sum_{w\in\mathcal{V}} \left[ \beta+n_{kw}^{-ji} \right] \right) }{ \prod_{w}\Gamma\left(\beta+n_{kw}^{-ji}\right) }\\ &=\frac{ \Gamma\left(\beta+n_{kv}^{-ji}+1\right) \prod_{w\neq v} \Gamma\left(\beta+n_{kw}^{-ji}\right) }{ \Gamma\left(V\beta+n_{k\cdot}^{-ji}+1\right) }\cdot \frac{ \Gamma\left(V\beta+n_{k\cdot}^{-ji}\right) }{ \prod_{w}\Gamma\left(\beta+n_{kw}^{-ji}\right) }\\ &=\frac{ \Gamma\left(\beta+n_{kv}^{-ji}+1\right) \Gamma\left(V\beta+n_{k\cdot}^{-ji}\right) }{ \Gamma\left(V\beta+n_{k\cdot}^{-ji}+1\right) }\cdot \frac{ \prod_{w\neq v} \Gamma\left(\beta+n_{kw}^{-ji}\right) }{ \prod_{w}\Gamma\left(\beta+n_{kw}^{-ji}\right) }\\ &=\frac{ \Gamma\left(\beta+n_{kn}^{-ji}+1\right) \Gamma\left(V\beta+n_{k\cdot}^{-ji}\right) }{ \Gamma\left(V\beta+n_{k\cdot}^{-ji} + 1\right) \Gamma\left(\beta+n_{kv}^{-ji}\right) }\\ &=\frac{ \Gamma\left(\beta+n_{kn}^{-ji}+1\right) }{ \Gamma\left(\beta+n_{kv}^{-ji}\right) }\cdot \frac{ \Gamma\left(V\beta+n_{k\cdot}^{-ji}\right) }{ \Gamma\left(V\beta+n_{k\cdot}^{-ji} + 1\right) }\\ &=\frac{\beta+n_{kv}^{-ji}}{V\beta+n_{k\cdot}^{-ji}} \end{align} \]

This seems to be validated in the appendix of this paper.


We also need the prior density of $x_{ji}$ to compute the likelihood that $x_{ji}$ will be seated at a new table.

\[ \begin{align} f_{k^{\text{new}}}^{-x_{ji}}(x_{ji}) &= \int f(x_{ji} \,|\, \phi) h(\phi)d\phi_k \\ &=\int \phi_v \cdot \frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) } \prod_w \phi_w^{\beta-1} d\phi \\ &=\frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) } \int \phi_v \cdot \prod_w \phi_w^{\beta-1} d\phi\\ &=\frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) } \int \phi_v^\beta \cdot \prod_{w\neq v} \phi_w^{\beta-1} d\phi\\ &=\frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) }\cdot \frac{ \Gamma(\beta+1)\prod_{w\neq v}\Gamma(\beta) }{ \Gamma(V\beta+1) }\\ &=\frac{ \Gamma(V\beta) }{ \Gamma(V \beta+1) }\cdot \frac{ \beta\prod_{w}\Gamma(\beta) }{ \prod_w \Gamma(\beta) }\\ &=\frac{ 1 }{ V \beta }\cdot \beta\\ &= \frac{1}{V} \end{align} \]

Equation (31): $p(x_{ji} \,|\, {\bf t}^{-ji}, t_{ji}=t^{new}, {\bf k})$

These last two derivations give us Equation (31), the likelihood that $t_{ji}=t^{new}$:

\[ \begin{align} p(x_{ji} \,|\, {\bf t}^{-ji}, t_{ji}=t^{new}, {\bf k}) &=\sum_{k=1}^K \left[ \frac{ m_{\cdot k} }{ m_{\cdot\cdot}+\gamma }\cdot \frac{\beta+n_{kv}^{-ji}}{V\beta+n_{k\cdot}^{-ji}} \right] \\ &\phantom{=}+ \frac{ \gamma }{ m_{\cdot\cdot}+\gamma } \cdot \frac{1}{V} \end{align} \]

Equation (32): $p(t_{ji}=t \,|\, {\bf t}^{-ji}, {\bf k})$

From this, we know the conditional distribution of $t_{ji}$ is:

\[ p(t_{ji}=t \,|\, {\bf t}^{-ji}, {\bf k}) \propto \begin{cases} n_{jt\cdot}^{-ji} \cdot \frac{\beta+n_{k_{jt}v}^{-ji}}{V\beta+n_{k_{jt}\cdot}^{-ji}} & {\tiny \text{if } t \text{ previously used,}}\\ {\tiny \alpha_0 \cdot p(x_{ji} \,|\, {\bf t}^{-ji}, t_{ji}=t^{new}, {\bf k})} & {\tiny \text{if } t=t^{\text{new}}}. \end{cases} \]

Equation (33): $p(k_{jt^\text{new}}=k \,|\, {\bf t}, {\bf k}^{-jt^\text{new}})$

If the sampled value of $t_{ji}$ is $t^{\text{new}}$, we sample a dish $k_{jt^\text{new}}$ for the table with:

\[ p(k_{jt^\text{new}}=k \,|\, {\bf t}, {\bf k}^{-jt^\text{new}}) \propto \begin{cases} m_{\cdot k}\cdot\frac{\beta+n_{kv}^{-ji}}{V\beta+n_{k\cdot}^{-ji}} & {\tiny \text{if } k \text{ previously used,}}\\ \frac{ \gamma }{V} & {\tiny \text{if } t=k^{\text{new}}}. \end{cases} \]

My notes on derivation the final equation (34) for sampling $k$ are in rough shape. I intend to follow this with a post on that. In the mean time, you can see Shuyo's post, his implementation, and equation 7 in this paper.

Tim HopperClasses Future Programmers Should Take

I appreciated James Hague's post on Computer Science Courses that Don't Exist, But Should.

I really like Dave Thomas's idea of a Software Archeology class. I have spent a huge amount of time as a developer reading (vs. writing) code. I wish I'd been taught how to read code effectively.

Similarly, I have thought there should be a class (or series of classes) in "interacting with others' code". Topics could include inheriting a software project, handing off a software project, writing code using 3rd party libraries, using package repositories, and understanding software licenses. These are such important skills in real world software development, but they seem to be rarely taught in the classroom. Perhaps a follow-up class could be "contributing to open source".

Tim HopperFitting a Mixture Model with Gibbs Sampling

In attempt to clarify my own understanding of how Gibbs samplers are derived for Bayesian models, I collected some notes on the sampler for a finite mixture model into an IPython notebook.

I think most of the literature on Bayesian computation is terrible. There's lots of handwaving with cavalier appeals to conjugacy and "integrating out". There aren't a lot of clear derivations of the equations needed for doing sampling in MCMC. As I have tried to write about it myself, I have a better appreciation of why that might be: this is complex math with a lot of moving parts. Given that, I would appreciate constructive feedback on how I could make this more clear or helpful. Even better, fork my repo and submit a pull request.

Tim HopperDealing with Outliers

From Testicular volume is inversely correlated with nurturing-related brain activity in human fathers in PNAS:

One participant's testes volume measurement was excluded because his value was 2.8 SDs above the mean (mean = 38,064; SD = 11,183) and was more than 13,000 mm^3 larger than any recorded value found in the literature. Of the more than 1,500 healthy, age-matched men in these studies, the largest reported value was 56,000 mm^3, and this participant’s measurement was 69,736 mm^3.

Tim HopperCross Entropy and KL Divergence

As we saw in an earlier post, the entropy of a discrete probability distribution is defined to be

$$H(p)=H(p_1,p_2,\ldots,p_n)=-\sum_{i}p_i \log p_i.$$

Kullback and Leibler defined a similar measure now known as KL divergence. This measure quantifies how similar a probability distribution $p$ is to a candidate distribution $q$.

$$D_{\text{KL}}(p\| q)=\sum_i p_i \log \frac{p_i}{q_i}.$$

$D_\text{KL}$ is non-negative and zero if and only if $p_i=q_i$ for all $i$. However, it is important to note that it is not in general symmetric: $D_{\text{KL}}(p\| q) \neq D_{\text{KL}}(q\| p)$.

Jonathon Shlens explains that KL Divergence can be interpreted as measuring the likelihood that samples represented by the empirical distribution $p$ were generated by a fixed distribution $q$. If $D_{\text{KL}}(p\| q)=0$, we can guarantee that $p$ is generated by $q$. As $D_{\text{KL}}(p\| q)\rightarrow\infty$, we can say that it is increasingly unlikely that $p$ was generated by $q$.

Algebraically, we can rewrite the definition as

$$ \begin{array}{rl} D_{\text{KL}}(p\| q) &=\sum_i p_i \log \frac{p_i}{q_i} \\ &=\sum_i \left ( - p_i \log q_i + p_i \log p_i \right)\\ &=- \sum_i p_i \log q_i + \sum_i p_i \log p_i \\ &=- \sum_i p_i \log q_i - \sum_i p_i \log \frac{1}{p_i} \\ &=- \sum_i p_i \log q_i-H(p) \\ &=\sum_i p_i \log \frac{1}{q_i}-H(p)\\ \end{array} $$

KL Divergence breaks down as something that looks similar to entropy (but combining $p$ and $q$) minus the entropy of $p$. This first term is often called cross entropy:

$$H(p, q)=\sum_i p_i \log \frac{1}{q_i}.$$

We could alternatively use this relationship to define cross entropy as:

$$H(p, q)=H(p) + D_\text{KL}(p\| q).$$

Intuatively, the cross entropy is the uncertainty implicit in $H(p)$ plus the likelihood that $p$ could have be generated by $q$. If we consider $p$ to be a fixed distribution, $H(p, q)$ and $D_\text{KL}(p \| q)$ differ by a constant factor for all $q$.

Tim HopperEntropy of a Discrete Probability Distribution

Supposed we have a discrete set of possible events $1,\ldots, n$ that occur with probabilities $p_1, p_2, \ldots, p_n$. Claude Shannon asked the question

Can we find a measure of how much "choice" is involved in the selection of the event or of how uncertain we are of the outcome?

For example, supposed we have coin that lands on heads with probability $p$ and tails with probability $1-p$. If $p=1$, the coin always lands on heads. Since there is no uncertainty, we might want to say the uncertainty is 0. However, if the coin is fair and $p=0.5$, we maximize our uncertainty: it's a complete tossup whether the coin is heads or tails. We might want to say the uncertainty in this case is 1.

In general, Shannon wanted to devise a function $H(p_1, p_2, \ldots, p_n)$ describing the uncertainty of an arbitrary set of discrete events (i.e. a $n$-sided die). He thought that "it is reasonable" that $H$ should have three properties:

  1. $H$ should be a continuous function of each $p_i$. A small change in a single probability should result in a similarly small change in the entropy (uncertainty).
  2. If each event is equally likely ($p_i=\frac{1}{n}$), $H$ should increase as a function of $n$: the more events there are, the more uncertain we are.
  3. Finally, entropy should be recursive with respect to independent events. Supposed we generate a random variable $x$ by the following process: Flip a fair coin. If it is heads, $x=0$. If the flip was tails, flip the coin again. If the second flip is heads, $x=1$, if tails $x=2$.

    We can compute the entropy as $H(p_0=1/2, p_1=1/4, p_2=1/4)$. However, the independence property tells us that this relationship should hold:

    $$H\left(\frac{1}{2}, \frac{1}{4}, \frac{1}{4}\right)=H\left(\frac{1}{2}, \frac{1}{2}\right) + \frac{1}{2} H\left(\frac{1}{2}, \frac{1}{2}\right).$$

    As David MacKay explains, this is the general claim that

    $$H(\mathbf{p})=H(p_1, 1-p_1)+(1-p_1)H \left( \frac{p_2}{1-p_1}, \frac{p_3}{1-p_1}, \ldots, \frac{p_n}{1-p_1} \right).$$

Shannon showed that given these three assumptions, there is unique for that $H$ must take:

$$H\propto -\sum_{i=1}^n p_i \log p_i=\sum_{i=1}^n p_i \log \frac{1}{p_i}.$$

He named this measure of uncertainty entropy, because the form of $H$ bears striking similarity to that of Gibbs Entropy in statistical thermodynamics.1

Shannon observes that $H$ has many other interesting properties:

  1. Entropy $H$ is 0 if and only if exactly one event has probability 1 and the rest have probability 0. (Uncertainty vanishes only when we are certain about the outcomes.)
  2. Entropy $H$ is maximized when the $p_i$ values are equal.
  3. The joint entropy of two events is less than or equal the sum of the individual entropies. $H(x, y)=H(x)+H(y)$ only when $x$ and $y$ are independent events.

You can read more about this in Shannon's seminal paper A Theory of Mathematical Communication.

  1. Caianiello and Aizerman say the name entropy is thanks to von Neumann who said

    You should call it entropy and for two reasons: first, the function is already in use in thermodynamics under that name; second and more importantly, most people don't know what entropy really is, and if you use the word.

    They argue that the name "uncertainty" would have been much more helpful since "Shannon entropy is simply and avowedly the 'measure of the uncertainty inherient in a pre-assigned probability scheme.'" 

Caktus GroupCaktus at DjangoCon 2015

Django is kind of our thing, so we’ve been looking forward to DjangoCon 2015 for months! Now that it is finally here, we thought we would give a little preview of what Cakti will be up to at this year’s conference.

Tuesday: Django Authors Panel (1:30 pm)

Meet Caktus Technical Director and co-author of Lightweight Django Mark Lavin at this panel of people who write about Django. Learn about the experience of writing or recording video about Django from a collection of talented tech authors!

Wednesday: Intro to Client-Side Testing (11:50 am)

You've just built a REST API and client-side application to consume it. With all your focus on architecture, you missed a critical piece: how do you test it? In this intermediate-level talk, Mark Lavin will look at some of the tools available to test this style of application with both Javascript unit tests and integration tests written in Python.

Wednesday: Lightweight Django Book Signing (2:50 pm)

Come meet co-author of Lightweight Django Mark Lavin and have him sign your copy. You can also say hello to everyone at our Caktus Group table!

Thursday: DjangoGirls Austin Workshop (9:00 am)

We’re a proud DjangoGirls sponsor and organizer of DjangoGirls RDU. For this workshop, David Ray and Mark Lavin will coach participants, teaching them to create their first Django app.

Tim HopperProfile in Computational Imagination

I recently had the honor of being interviewed by Michael Swenson for his interview series called "Profiles in Computational Imagination". I talked a bit about my current work, my wandering road to data science, and my love for remote work. You can read it here.

Og MacielBooks - August 2015

Books - August 2015

This August 2015 I took a break from work and spent about 6 days enjoying some R&R down the North Carolina shore with my family. I managed to get through some of the books that were waiting for a long time for me to get to them, as well as try some new authors.


  • The Sentinel by Arthur C. Clarke

    I forgot where I read about how the short story "The Sentinel" was the inspiration for "2001: A Space Odyssey", but being that I have always considered the latter a great book and movie, I managed to grab a copy of the anthology "The Sentinel" just so that I could read the short story by the same name and see what else Arthur C. Clarke "had to offer." Interestingly enough (to me), most if not all the other short stories included in this collection could easily be published today and still feel just as futuristic as they probably were back when they were first published! This was yet another one of the books that I read by the beach this Summer and though it didn't blow me away, it was still a very relaxing read.

  • A Princess of Mars by Edgar Rice Burroughs

    This last June I got from my family for my birthday "John Carter of Mars" containing the complete series and I was just itching for a good opportunity to start reading it. That chance came up this week as I started reading some of Melville's short stories and found that I needed a bit of a "break". First off, I have never read Edgar Rice Burroughs before but I do have a copy of "Tarzan of the Apes" also awaiting for a chance, so I had an idea about what to expect from his style. Sure enough, reading "A Princess of Mars" felt like a taking a trip down memory's lane, back when it was easy to tell who the good and the bad guys were, and there was always a damsel in distress somewhere waiting to be rescued. I have to confess that it took me a few chapters to get re-acclimated with this style, but once I got into it, it was easy reading, which is exactly what I was looking for any how.

    John Carter, the main character, shows all the expected, cliché virtues one would expect from a "hero" but one thing that bothered me a bit was the language he used to describe those who were different from him (which was mostly everyone in the story, since they were all Martians) and the way he treated them. It felt a bit abusive and even a but racist? I don't know if someone could get away with writing in the same style today, but then again I remembered that back then people were not as politically correct as we are today... or maybe I was reading too much into it? Anyhow, it was a fun read and I think I will try to add the next 4 books of the series in the coming months so that I can hopefully get a better opinion formed about the author.

Caktus GroupComposting at Caktus

At Caktus we have an employee suggestion policy that has been the birthplace of tons of great ideas, from tech community yoga to a pair programming station.

The most recent addition to our workplace comes from the suggestion of Developer Rebecca Muraya: composting at Caktus! We’re partnering with Compost Now to decrease our waste. Compost Now helps to divert more than ⅓ of waste away from landfills while improving local soil for farmers by increasing the availability of all natural compost. There is nothing better for keeping soil healthy like a good supply of organic compost, and we’re thrilled to be contributing to keep worms, plants, and farmers happy.

Caktus GroupMaking Clean Code a Part of Your Build Process (And More!)

At Caktus, "clean" (in addition to "working"!) code is an important part of our delivery. For all new projects, we achieve that by using flake8. flake8 is a wrapper around several tools: pep8, pyflakes, and McCabe. pep8 checks to make sure your code matches the PEP 0008 style guidelines, pyflakes looks for a few additional things like unused imports or variables, and McCabe raises warnings about overly complex sections of code.

We usually check code formatting locally before committing, but we also have safety checks in place in case someone forgets (as I more than anyone have been known to do!). This prevents code formatting standards from slowly eroding over time. We accomplish this by making our continuous integration (CI) server "fail" the build if unclean code is committed. For example, using Travis CI, simply adding the following line to the script: section of your .travis.yml will fail the build if flake8 detects any formatting issues (the command returns a non-zero exit code if errors are found):

- flake8 .

You can adjust flake8 defaults by adding a file named setup.cfg to the top level of your repository. For example, we usually relax the 80 character limit a little and exclude certain automatically generated files:


As a result you not only have code that is more readable for everyone, but avoids actual errors as well. For example, flake8 will detect missing imports or undefined variables in code paths that might not be tested by your unit test suite.

Adding flake8 to an older project

This is all well and good for new projects, but bringing old projects up to speed with this approach can be a challenge. I recently embarked on such a task myself and thought I'd share what I learned.

I started by adding a setup.cfg to the project and running flake8 on my source tree:

$ flake8 --count

The result: a whopping 1798 warnings. Many of these turned out to be pep8's "E128 continuation line under-indented for visual indent":

$ flake8 --select=E128 --count

In other words, in a huge number of cases, we weren't indenting multi-line continuations the way pep8 wanted us to. Other errors included things like not having a space after commas (E231), or not having two spaces before inline comments (E261). While many editors do support automatically fixing errors like this, doing so manually would still be tedious: In this project we had nearly 250 Python source files.

Enter autopep8 and autoflake. These tools purport to automatically fix pep8- and pyflakes-related issues. There are two ways to use these tools; one wholesale for all errors across the project at once, and one more granular, addressing only a single group of similar errors at a time.

Addressing all errors at once

This approach is best for smaller projects or those with just a few errors (50-100) to address:

$ pip install autoflake
$ find . -name '*.py'|grep -v migrations|xargs autoflake --in-place --remove-all-unused-imports --remove-unused-variables

In my case, this resulted in changes across 39 files and reduced the number of flake8 errors from 1798 to 1726. Not a huge change, but a time saver nonetheless. autopep8 was even more impressive:

$ pip install autopep8
$ autopep8 --in-place --recursive --max-line-length=100 --exclude="*/migrations/*" .

This brought the number of changed files up to 160, and brought the number of warnings from 1726 down to 211. Note that autopep8 also supports an --aggressive option which allows non-whitespace changes. When I tried this, however, it only reduced the number of warnings from 211 to 198. I'll most likely fix those by hand.

Please note: If or when you're ready to try these commands on a project, you must first make sure you have no uncommitted changes locally. After each change, I also recommend (a) committing the results of each command as a separate commit so they're easier to unravel or review later, and (b) running your unit test suite to make sure nothing is inadvertently broken.

Addressing errors as groups (recommended)

While the above approach may work for smaller projects (not this one!), it can make code reviews difficult because all pyflakes or pep8 fixes are grouped together in a single commit. The more labor intensive but recommended approach is to address them in groups of similar errors. My colleague Rebecca Muraya recommended this approach and suggested the groups (thanks, Rebecca!):

  1. First, find and remove any unused imports:

    $ pip install autoflake
    $ find . -name '*.py'|grep -v migrations|xargs autoflake --in-place --remove-all-unused-imports

    After it's finished, review the changes, run your test suite, and commit the code.

  2. Now, do the same for unused variables:

    $ find . -name '*.py'|grep -v migrations|xargs autoflake --in-place --remove-unused-variables

    Again, once complete, review the changes, run your test suite, and commit the code.

  3. Before moving on to pep8 errors, the following command provides an invaluable summary of errors not yet addressed:

    $ flake8 --statistics --count -qq
  4. Finally, autopep8 can be told to fix only certain error codes, like so:

    $ pip install autopep8
    $ autopep8 --in-place --recursive --max-line-length=100 --exclude="*/migrations/*" --select="W291,W293" .

    This will remove trailing whitespace and trailing whitespace on blank lines. Once complete, review and commit your changes and move on to the next group of errors.

pep8's error codes are listed in detail on the autopep8 PyPI page and in the pep8 documentation. You can either group them yourself based on your preferences and the particular warnings in your project, or use the following as a guide:

  • Remove trailing whitespace, then configure your editor to keep it away:
    • W291 - Remove trailing whitespace.
    • W293 - Remove trailing whitespace on blank line.
  • Use your editor to find/replace all tabs, if any, with spaces, and then fix indentation with these error codes. This can have a semantic impact so the changes need to be reviewed carefully:
    • E101 - Reindent all lines.
    • E121 - Fix indentation to be a multiple of four.
  • Fix whitespace errors:
    • E20 - Remove extraneous whitespace.
    • E211 - Remove extraneous whitespace.
    • E22 - Fix extraneous whitespace around keywords.
    • E224 - Remove extraneous whitespace around operator.
    • E226 - Fix missing whitespace around arithmetic operator.
    • E227 - Fix missing whitespace around bitwise/shift operator.
    • E228 - Fix missing whitespace around modulo operator.
    • E231 - Add missing whitespace.
    • E241 - Fix extraneous whitespace around keywords.
    • E242 - Remove extraneous whitespace around operator.
    • E251 - Remove whitespace around parameter '=' sign.
    • E27 - Fix extraneous whitespace around keywords.
  • Adjust blank lines:
    • W391 - Remove trailing blank lines.
    • E301 - Add missing blank line.
    • E302 - Add missing 2 blank lines.
    • E303 - Remove extra blank lines.
    • E304 - Remove blank line following function decorator.
    • E309 - Add missing blank line (after class declaration).
  • Fix comment spacing:
    • E26 - Fix spacing after comment hash for inline comments.
    • E265 - Fix spacing after comment hash for block comments.
  • The following are aggressive fixes that can have semantic impact. It's best to do these one commit at a time and with careful review:
    • E711 - Fix comparison with None.
    • E712 - Fix comparison with boolean.
    • E721 - Use "isinstance()" instead of comparing types directly.
    • W601 - Use "in" rather than "has_key()".
    • W602 - Fix deprecated form of raising exception.
    • W603 - Use "!=" instead of "<>"

You can repeat steps 3 and 4 in the above with each group of error codes (or in the case of more aggressive fixes, single error codes) until they're all resolved. Once all the automatic fixes are done, you'll likely have some manual fixes left to do. Before those, you may want to see what remaining automatic fixes, if any, autopep8 suggests:

$ autopep8 --in-place --recursive --max-line-length=100 --exclude="*/migrations/*" .

Once all the errors have been resolved, add flake8 to your build process so you never have to go through this again.

Concluding Remarks

While I haven't finished all the manual edits yet as of the time of this post, I have reduced the number to about 153 warnings across the whole project. Most of the remaining warnings are long lines that pep8 wasn't comfortable splitting, for example, strings that needed to be broken up like so::

foo = ('a very '
       'long string')

Or other similar issues that couldn't be auto-corrected. To its credit, flake8 did detect two bugs, namely, a missing import (in some unused test code that should probably be deleted), and an instance of if not 'foo' in bar (instead of the correct version, if 'foo' not in bar).

My colleague Mark Lavin also remarked that flake8 does not raise warnings about variable naming, but the pep8-naming plugin is available to address this. The downside is that it doesn't like custom assertions which match the existing unittest style (i.e., assertOk, assertNotFound, etc.).

Good luck, and I hope this has been helpful!

Tim HopperOn Showing Your Work

Caktus GroupAWS load balancers with Django

We recently had occasion to reconfigure some of our existing servers to use Amazon Web Services Elastic Load Balancers in front of them. Setting this up isn't hard, exactly, but there are a lot of moving parts that have to mesh correctly before things start to work, so I thought I'd write down what we did.

All of these tools have lots of options and ways to use them. I'm not trying to cover all the possibilities here. I'm just showing what we ended up doing.

Our requirements

We had some specific goals we wanted to achieve in this reconfiguration.

  • There should be no outside requests sneaking in -- the only requests that should reach the backend servers are those that come through the load balancer. We'll achieve this by setting the backend servers' security group(s) to only allow incoming traffic on those ports from the load balancer's security group.
  • The site should only handle requests that have the right Host header. We achieve this already by Nginx configuration (server_name) and won't have to change anything.
  • Redirect any non-SSL requests to SSL. The load balancer can't do this for us (as far as I could see), so we just forward incoming port 80 requests to the server’s port 80, and let our existing port 80 Nginx configuration continue to redirect all requests to our https: URL.
  • All SSL connections are terminated at the load balancer. Our site certificate and key are only needed on the load balancer. The backend servers don't need to process encryption, nor do we need to maintain SSL configuration on them. We'll have the load balancers forward the connections, unencrypted, to a new listening port in nginx, 8088, because we're redirecting everything from port 80 to https. (We could have configured the port 80 server to figure out from the headers whether the connection came into the load balancer over SSL, but we didn't, figuring that using a separate port would be fool-proof.) If we were concerned about security of the data between the load balancer and the backend, for example if financial or personal information was included, we could re-encrypt the forwarded connections, maybe using self-signed certificates on the backend servers to simplify managing their configurations.
  • Strict-Transport-Security header - we add this already in our Nginx configuration and will include it in our port 8088 configuration.
  • We need to access backend servers directly for deploys (via ssh). We achieve this by keeping our elastic IP addresses on our backend servers so they have stable IP addresses, even though the load balancers don't need them.
  • Some of our servers use basic auth (to keep unreleased sites private). This is in our Nginx configuration, but we'll need to open up the health check URL to bypass basic auth, since the load balancers can't provide basic auth on health checks.
  • Sites stay up through the change. We achieve this by making the changes incrementally, and making sure at all times there's a path for incoming requests to be handled.

All the pieces

Here are all the pieces that we had to get in place:

  • The site's hostname is a CNAME for the elastic load balancer’s hostname, so that requests for the site go to the load balancer instead of the backend servers. Don’t use the load balancer IP addresses directly, since they’ll change over time.
  • The backend servers' security group allows incoming requests on ports 80 and 8088, but only from the load balancer's security group. That allows the load balancer to forward requests, but requests cannot be sent directly to the backend servers even if someone knows their addresses.
  • There's a health check URL on the backend server that the load balancer can access, and that returns a 200 status (not 301 or 401 or anything else), so the load balancers can determine if the backend servers are up.
  • Apart from the health check, redirect port 80 requests to the https URL of the server (non-SSL to SSL), so that any incoming requests that aren't over SSL will be redirected to SSL.
  • Get the data about the request's origin from the headers where the load balancer puts it, and pass it along to Django in the headers that our Django configuration is expecting. This lets Django tell whether a request came in securely.
  • The load balancer must be in the same region as the servers (AWS requirement).
  • Keep the elastic IP on our backend server so we can use that to get to it for administration. Deploys and other administrative tasks can no longer use the site domain name to access the backend server, since it now points at the load balancer.

Where we started

Before adding the load balancer, our site was running on EC2 servers with Ubuntu. Nginx was accepting incoming requests on ports 80 and 443, redirecting all port 80 requests to https, adding basic auth on port 443 on some servers, proxying some port 443 requests to gunicorn with our Django application, and serving static files for the rest.

To summarize our backend server configuration before and after the change:


  • Port 80 redirects all requests to https://server_URL
  • Port 443 terminates SSL and processes requests
  • Server firewall and AWS security group allow all incoming connections on port 80 and 443


  • Port 80 redirects all requests to https://server_URL
  • Port 8088 processes requests
  • Server firewall and AWS security group allow port 80 and 8088 connections from the load balancer only, and no port 443 connections at all.

Steps in order

  • DNS: shorten the DNS cache time for the site domain names to something like 5 minutes, so when we start changing them later, clients will pick up the change quickly. We'll lengthen these again when we're done.
  • Django: if needed, create a new view for health checks. We made one at /health/ that simply returned a response with status 200, bypassing all authentication. We can enhance that view later to do more checking, such as making sure the database is accessible.
  • Nginx: We added a new port 8088 server, copying the configuration from our existing port 443 server, but removing the ssl directives. We did keep the line that added the Strict-Transport-Security header.
  • Nginx: Added configuration in our new port 8088 to bypass basic auth for the /health/ URL only.
  • Ufw: opened port 8088 in the Linux firewall.
  • AWS: opened port 8088 in the servers' security group - for now, from all source addresses so we can test easily as we go.
  • AWS: add the SSL certificate in IAM
  • AWS: create a new load balancer in the same region as the servers
  • AWS: configure the new load balancer:
  • configure to use the SSL certificate
  • set up a security group for the load balancer. It needs to accept incoming connections from the internet on ports 80 and 443.
  • instances: the backend servers this load balancer will forward to
  • health check: port 8088, URL /health/. Set the period and number of checks small for now, e.g. 30 seconds and 2 checks.
  • listeners: 80->80, 443 ssl -> 8088 non-ssl
  • Tests: Now stop to make sure things are working right so far:
  • The load balancer should show the instance in service (after the health check period has passed).
  • With the site domain set in your local /etc/hosts file to point at one of the load balancer's IP addresses, the site should work on ports 80 & 443
  • undo your local /etc/hosts changes since the load balancer IPs will change over time!

  • AWS: update the backend servers' security group to only accept 8088 traffic from the load balancer's security group

  • Test: the health check should still pass, since it's coming in on port 8088 from the load balancer.

  • DNS: update DNS to make the site domain a CNAME for the load balancer's A name

  • wait for DNS propagation
  • test: site should still work when accessed using its hostname.


These steps should now be safe to do, but it doesn't hurt to test again after each step, just to be sure.

  • Nginx: remove port 443 server from nginx configuration.
  • AWS: remove port 443 from backend servers’ security group. Configure ports 80 and 8088 to only accept incoming connections from the load balancer's security group.
  • Ufw: block port 443 in server firewall
  • AWS: in the load balancer health check configuration, lengthen the time between health checks, and optionally require more passing checks before treating an instance as live.
  • docs: Update your deploy docs with the changes to how the servers are deployed!
  • DNS: lengthen the cache time for the server domain name(s) if you had shortened it before.

Tim HopperDirichlet Process Notebooks

I created a dedicated Github repository for the recent posts I've been doing on Dirichlet processes.

I also added a note responding to some of Dan Roy's helpful push-back on my understanding of the term Dirichlet process.

Tim HopperReleasing to Anaconda.org via Travis-CI

I've spent the last few days figuring out how to co-opt Travis CI as a build server for Anaconda.org (the package management service previously known as Binstar). I need to be able release five conda packages to Anaconda built for both Linux and OS X. Travis is not the right tool for this job, but I have managed to make it work. Since I've talked to others who are wrestling with this same problem, I am sharing my solution here.

Each of my five projects lives in a separate repository on Github. We have each of them as a submodule in a common repository called release. Each of these projects has a build.sh and meta.yaml file (required by Anaconda.org) in a subdirectory called conda/PACKAGE-NAME.

In my release directory, I have a simple Python script called build.py. It takes the name of a submodule and an Anaconda.org channel as command line arguments. It then uses the Python sh module to call the conda build and binstar upload commands.

binstar upload requires authentication at Anaconda.org. I did this by using a Binstar token which I add to the binstar command after fetching it from an encrypted Travis environmental variable.

I call build.py for each of the five projects from the script: section of my .travis.yml file. This will build the each package on the Travis workers and then release to Anaconda.org.

There is no easy way to get Travis to build for Linux and OS X simultaneously. However, it can be tricked into building for one or the other by changing the language: specified in the .travis.yml file. (language: objective-c will force Travis to use an OS X work; by default, it uses a Linux worker.) I wrote a fabric script that provides command line commands which will modify the language value in my .travis.yml file and then push the release repository to Github. Github triggers a Travis CI build which then deploys the repository to Anaconda.org!

If I want to cut a new release for OS X, I simply call $ fab release_osx, for Linux, I call $ fab release_linux. By default, this will release to the "main" channel on Anaconda.org. I can release to a different channel (e.g. "dev") with $ fab release_linux:dev. When specifying the channel, the fabfile will modify my .travis.yml file to set an environmental variable that is picked up when calling build.py.

Finally, my .travis.yml file instructs Travis on preparing the build environment differently between operating systems by using Travis's built in $TRAVIS_OS_NAME environmental variable and calling appropriate setup scripts.1

Also, to update the submodules to origin/master, I created a fabric command: $ fab update. This command calls git submodule update --remote --rebase.

This certainly isn't a perfect solution, but it'll work for the time being. I certainly look forward to easier solutions being developed in the future!2

Thanks to my colleague Stephen Tu who laid a lot of the groundwork for this!

  1. You may notice I have files called travis.yml and .travis.yml. Originally, my fabfile just modified the latter on the fly. For clarity, I started using travis.yml as the canonical file and .travis.yml is what is generated by the fabfile commands and used by Travis. 

  2. Continuum is starting to provide paid, hosted build servers to do this very task 

Caktus GroupAnnouncing the Caktus Open Source Fellowship

We are excited to announce the creation and funding of a pilot program for open source contributions here at Caktus Group. This program is inspired by the Django Software Foundation’s fellowship as well as the Two Day Manifesto. For this program, Caktus seeks to hire a part-time developer for twelve weeks this fall for the sole purpose of contributing back to open source projects. Caktus builds web applications based on open source tools and the continued growth of these projects is important to us. Open source projects such as Python and Django have given so much to this company and this is one of many ways we are trying to give back.

We are looking for candidates who would love to collaborate with us in our offices in downtown Durham, NC to directly contribute to both open source projects created at Caktus as well as open source projects used by Caktus, such as Django. As previously mentioned, this will be a part-time position and will be taking the place of our normal fall internship program. If successful, we may expand this into a full-time position for future iterations.

I think this will be a great opportunity for a developer to get experience working with and contributing to open source. It could be a great resume builder for a relatively new developer where all of your work will be publicly visible. It could also be a fun break from the ordinary for a more experienced developer who would love to be paid to work on open source. Through our mentorship, we hope this program will empower people who otherwise would not contribute to open source.

Prior experience with Python is a general requirement, but no prior open source contributions are required. If you have specific projects you would like to contribute to, we would love to know about them during the application process.

We are looking forwarding to reviewing and selecting applicants over the next few weeks. You can find more details on the position as well as the application form here: https://www.caktusgroup.com/careers/#op-73393-caktus-open-source-fellow

Caktus GroupAnnouncing Django Girls RDU: Free Coding Workshop for Women

Django Girls Logo

We’re incredibly excited to announce the launch of Django Girls RDU, a group in NC’s Triangle region that hosts free one-day Django coding workshops for women. Django Girls is part of an international movement that’s helped 1,600 (and counting!) women learn how to code.

We originally got involved with Django Girls at the impetus of our Technical Director, Mark Lavin. While discussing our efforts to support women in tech, Mark was emphatic: there was this group, Django Girls, that was doing extraordinary work engaging women in Django. Clearly, we needed something like that locally. Luckily for us, Django Girls was coming to PyCon and we'd get to see first hand just how wonderful they are.

Mark Lavin coaching at Django Girls PyCon 2015

Four of our team members volunteered as coaches for a Django Girls workshop during PyCon 2015. There was nothing quite like seeing the impact Django Girls had on each attendee. The environment was warm and friendly. The tutorials for students, coaches, and organizers, were detailed and extremely well thought out. The passion of the Django Girls organizers was simply infectious. Out of a desire to prolong this excitement and share it with everyone we knew, we put together a team and applied to have a Django Girls RDU chapter. We’re honored to partner with such a wonderful group!

The first workshop will be on October 3rd, applications are due September 4th. We have five great coaches from Caktus and PyLadies RDU and each coach will work one-on-one with three students to build their first Django website, a blog. We’re looking for volunteers to coach and sponsor the workshop. Each additional coach means three more students we can accept. If you’d like to get involved, please email us at durham@djangogirls.org. And, of course, we’re also looking for women who want to learn how to code.

Not able to make the October 3rd meetup? You'll also find members of our team coaching at DjangoCon's Django Girls Austin. To learn about more Django Girl activities, please follow us @djangogirlsRDU or visit the DjangoGirls website.

Tim HopperA Programmer's Portfolio

I am convinced that a programming student hoping to get a job in that field should be actively building a portfolio online. Turn those class projects, presentations, and reports into Github repositories or blog posts! I felt vindicated as I read this anecdote in Peopleware:

In the spring of 1979, while teaching together in western Canada,we got a call from a computer science professor at the local technical college. He proposed to stop by our hotel after class one evening and buy us beers in exchange for ideas. That's the kind of offer we seldom turn down. What we learned from him that evening was almost certainly worth more than whatever he learned from us.

The teacher was candid about what he needed to be judged a success in his work: He needed his students to get good job offers and lots of them. "A Harvard diploma is worth something in and of itself, but our diploma isn't worth squat. If this year's graduates don't get hired fast, there are no students next year and I'm out of a job." So he had developed a formula to make his graduates optimally attractive to the job market. Of course he taught them modern techniques for system construction, including structured analysis and design, data-driven design, information hiding, structured coding, walk throughs, and metrics. He also had them work on real applications for nearby companies and agencies. But the center piece of his formula was the portfolio that all students put together to show samples of their work.

He described how his students had been coached to show off their portfolios as part of each interview:

"I've brought along some samples of the kind of work I do. Here, for instance, is a subroutine in Pascal from one project and a set of COBOL paragraphs from another. As you can see in this portion, we use the loop-with-exit extension advocated by Knuth, but aside from that, it's pure structured code, pretty much the sort of thing that your company standard calls for. And here is the design that this code was written from. The hierarchies and coupling analysis use Myers' notation. I designed all of this particular subsystem, and this one little section where we used some Orr methods because the data structure really imposed itself on the process structure. And these are the leveled data flow diagrams that makeup the guts of our specification, and the associated data dictionary. ..."

In the years since, we've often heard more about that obscure technical college and those portfolios. We've met recruiters from as far away as Triangle Park, North Carolina, and Tampa, Florida,who regularly converge upon that distant Canadian campus for a shot at its graduates.

Of course, this was a clever scheme of the professor's to give added allure to his graduates, but what struck us most that evening was the report that interviewers were always surprised by the portfolios. That meant they weren't regularly requiring all candidates to arrive with portfolios. Yet why not? What could be more sensible than asking each candidate to bring along some samples of work to the interview?

Tim HopperNonparametric Latent Dirichlet Allocation

I wrote this in an IPython Notebook. You may prefer to view it on nbviewer.

In [1]:
%matplotlib inline
%precision 2

Latent Dirichlet Allocation is a generative model for topic modeling. Given a collection of documents, an LDA inference algorithm attempts to determined (in an unsupervised manner) the topics discussed in the documents. It makes the assumption that each document is generated by a probability model, and, when doing inference, we try to find the parameters that best fit the model (as well as unseen/latent variables generated by the model). If you are unfamiliar with LDA, Edwin Chen has a friendly introduction you should read.

Because LDA is a generative model, we can simulate the construction of documents by forward-sampling from the model. The generative algorithm is as follows (following Heinrich):

  • for each topic $k\in [1,K]$ do
    • sample term distribution for topic $\overrightarrow \phi_k \sim \text{Dir}(\overrightarrow \beta)$
  • for each document $m\in [1, M]$ do
    • sample topic distribution for document $\overrightarrow\theta_m\sim \text{Dir}(\overrightarrow\alpha)$
    • sample document length $N_m\sim\text{Pois}(\xi)$
    • for all words $n\in [1, N_m]$ in document $m$ do
      • sample topic index $z_{m,n}\sim\text{Mult}(\overrightarrow\theta_m)$
      • sample term for word $w_{m,n}\sim\text{Mult}(\overrightarrow\phi_{z_{m,n}})$

You can implement this with a little bit of code and start to simulate documents.

In LDA, we assume each word in the document is generated by a two-step process:

  1. Sample a topic from the topic distribution for the document.
  2. Sample a word from the term distribution from the topic.

When we fit the LDA model to a given text corpus with an inference algorithm, our primary objective is to find the set of topic distributions $\underline \Theta$, term distributions $\underline \Phi$ that generated the documents, and latent topic indices $z_{m,n}$ for each word.

To run the generative model, we need to specify each of these parameters:

In [2]:
vocabulary = ['see', 'spot', 'run']
num_terms = len(vocabulary)
num_topics = 2 # K
num_documents = 5 # M
mean_document_length = 5 # xi
term_dirichlet_parameter = 1 # beta
topic_dirichlet_parameter = 1 # alpha

The term distribution vector $\underline\Phi$ is a collection of samples from a Dirichlet distribution. This describes how our 3 terms are distributed across each of the two topics.

In [3]:
from scipy.stats import dirichlet, poisson
from numpy import round
from collections import defaultdict
from random import choice as stl_choice
In [4]:
term_dirichlet_vector = num_terms * [term_dirichlet_parameter]
term_distributions = dirichlet(term_dirichlet_vector, 2).rvs(size=num_topics)
print term_distributions
[[ 0.41  0.02  0.57]
 [ 0.38  0.36  0.26]]

Each document corresponds to a categorical distribution across this distribution of topics (in this case, a 2-dimensional categorical distribution). This categorical distribution is a distribution of distributions; we could look at it as a Dirichlet process!

The base base distribution of our Dirichlet process is a uniform distribution of topics (remember, topics are term distributions).

In [5]:
base_distribution = lambda: stl_choice(term_distributions)
# A sample from base_distribution is a distribution over terms
# Each of our two topics has equal probability
from collections import Counter
for topic, count in Counter([tuple(base_distribution()) for _ in range(10000)]).most_common():
    print "count:", count, "topic:", [round(prob, 2) for prob in topic]
count: 5066 topic: [0.40999999999999998, 0.02, 0.56999999999999995]
count: 4934 topic: [0.38, 0.35999999999999999, 0.26000000000000001]

Recall that a sample from a Dirichlet process is a distribution that approximates (but varies from) the base distribution. In this case, a sample from the Dirichlet process will be a distribution over topics that varies from the uniform distribution we provided as a base. If we use the stick-breaking metaphor, we are effectively breaking a stick one time and the size of each portion corresponds to the proportion of a topic in the document.

To construct a sample from the DP, we need to again define our DP class:

In [6]:
from scipy.stats import beta
from numpy.random import choice

class DirichletProcessSample():
    def __init__(self, base_measure, alpha):
        self.base_measure = base_measure
        self.alpha = alpha
        self.cache = []
        self.weights = []
        self.total_stick_used = 0.

    def __call__(self):
        remaining = 1.0 - self.total_stick_used
        i = DirichletProcessSample.roll_die(self.weights + [remaining])
        if i is not None and i < len(self.weights) :
            return self.cache[i]
            stick_piece = beta(1, self.alpha).rvs() * remaining
            self.total_stick_used += stick_piece
            new_value = self.base_measure()
            return new_value
    def roll_die(weights):
        if weights:
            return choice(range(len(weights)), p=weights)
            return None

For each document, we will draw a topic distribution from the Dirichlet process:

In [7]:
topic_distribution = DirichletProcessSample(base_measure=base_distribution, 

A sample from this topic distribution is a distribution over terms. However, unlike our base distribution which returns each term distribution with equal probability, the topics will be unevenly weighted.

In [8]:
for topic, count in Counter([tuple(topic_distribution()) for _ in range(10000)]).most_common():
    print "count:", count, "topic:", [round(prob, 2) for prob in topic]
count: 9589 topic: [0.38, 0.35999999999999999, 0.26000000000000001]
count: 411 topic: [0.40999999999999998, 0.02, 0.56999999999999995]

To generate each word in the document, we draw a sample topic from the topic distribution, and then a term from the term distribution (topic).

In [9]:
topic_index = defaultdict(list)
documents = defaultdict(list)

for doc in range(num_documents):
    topic_distribution_rvs = DirichletProcessSample(base_measure=base_distribution, 
    document_length = poisson(mean_document_length).rvs()
    for word in range(document_length):
        topic_distribution = topic_distribution_rvs()
        documents[doc].append(choice(vocabulary, p=topic_distribution))

Here are the documents we generated:

In [10]:
for doc in documents.values():
    print doc
[&apossee&apos, &aposrun&apos, &apossee&apos, &aposspot&apos, &apossee&apos, &aposspot&apos]
[&apossee&apos, &aposrun&apos, &apossee&apos]
[&apossee&apos, &aposrun&apos, &apossee&apos, &apossee&apos, &aposrun&apos, &aposspot&apos, &aposspot&apos]
[&aposrun&apos, &aposrun&apos, &aposrun&apos, &aposspot&apos, &aposrun&apos]
[&aposrun&apos, &aposrun&apos, &apossee&apos, &aposspot&apos, &aposrun&apos, &aposrun&apos]

We can see how each topic (term-distribution) is distributed across the documents:

In [11]:
for i, doc in enumerate(Counter(term_dist).most_common() for term_dist in topic_index.values()):
    print "Doc:", i
    for topic, count in doc:
        print  5*" ", "count:", count, "topic:", [round(prob, 2) for prob in topic]
Doc: 0
      count: 6 topic: [0.38, 0.35999999999999999, 0.26000000000000001]
Doc: 1
      count: 3 topic: [0.40999999999999998, 0.02, 0.56999999999999995]
Doc: 2
      count: 5 topic: [0.40999999999999998, 0.02, 0.56999999999999995]
      count: 2 topic: [0.38, 0.35999999999999999, 0.26000000000000001]
Doc: 3
      count: 5 topic: [0.38, 0.35999999999999999, 0.26000000000000001]
Doc: 4
      count: 5 topic: [0.40999999999999998, 0.02, 0.56999999999999995]
      count: 1 topic: [0.38, 0.35999999999999999, 0.26000000000000001]

To recap: for each document we draw a sample from a Dirichlet Process. The base distribution for the Dirichlet process is a categorical distribution over term distributions; we can think of the base distribution as an $n$-sided die where $n$ is the number of topics and each side of the die is a distribution over terms for that topic. By sampling from the Dirichlet process, we are effectively reweighting the sides of the die (changing the distribution of the topics).

For each word in the document, we draw a sample (a term distribution) from the distribution (over term distributions) sampled from the Dirichlet process (with a distribution over term distributions as its base measure). Each term distribution uniquely identifies the topic for the word. We can sample from this term distribution to get the word.

Given this formulation, we might ask if we can roll an infinite sided die to draw from an unbounded number of topics (term distributions). We can do exactly this with a Hierarchical Dirichlet process. Instead of the base distribution of our Dirichlet process being a finite distribution over topics (term distributions) we will instead make it an infinite Distribution over topics (term distributions) by using yet another Dirichlet process! This base Dirichlet process will have as its base distribution a Dirichlet distribution over terms.

We will again draw a sample from a Dirichlet Process for each document. The base distribution for the Dirichlet process is itself a Dirichlet process whose base distribution is a Dirichlet distribution over terms. (Try saying that 5-times fast.) We can think of this as a countably infinite die each side of the die is a distribution over terms for that topic. The sample we draw is a topic (distribution over terms).

For each word in the document, we will draw a sample (a term distribution) from the distribution (over term distributions) sampled from the Dirichlet process (with a distribution over term distributions as its base measure). Each term distribution uniquely identifies the topic for the word. We can sample from this term distribution to get the word.

These last few paragraphs are confusing! Let's illustrate with code.

In [12]:
term_dirichlet_vector = num_terms * [term_dirichlet_parameter]
base_distribution = lambda: dirichlet(term_dirichlet_vector).rvs(size=1)[0]

base_dp_parameter = 10
base_dp = DirichletProcessSample(base_distribution, alpha=base_dp_parameter)

This sample from the base Dirichlet process is our infinite sided die. It is a probability distribution over a countable infinite number of topics.

The fact that our die is countably infinite is important. The sampler base_distribution draws topics (term-distributions) from an uncountable set. If we used this as the base distribution of the Dirichlet process below each document would be constructed from a completely unique set of topics. By feeding base_distribution into a Dirichlet Process (stochastic memoizer), we allow the topics to be shared across documents.

In other words, base_distribution will never return the same topic twice; however, every topic sampled from base_dp would be sampled an infinite number of times (if we sampled from base_dp forever). At the same time, base_dp will also return an infinite number of topics. In our formulation of the the LDA sampler above, our base distribution only ever returned a finite number of topics (num_topics); there is no num_topics parameter here.

Given this setup, we can generate documents from the hierarchical Dirichlet process with an algorithm that is essentially identical to that of the original latent Dirichlet allocation generative sampler:

In [13]:
nested_dp_parameter = 10

topic_index = defaultdict(list)
documents = defaultdict(list)

for doc in range(num_documents):
    topic_distribution_rvs = DirichletProcessSample(base_measure=base_dp, 
    document_length = poisson(mean_document_length).rvs()
    for word in range(document_length):
        topic_distribution = topic_distribution_rvs()
        documents[doc].append(choice(vocabulary, p=topic_distribution))

Here are the documents we generated:

In [14]:
for doc in documents.values():
    print doc
[&aposspot&apos, &aposspot&apos, &aposspot&apos, &aposspot&apos, &aposrun&apos]
[&aposspot&apos, &aposspot&apos, &apossee&apos, &aposspot&apos]
[&aposspot&apos, &aposspot&apos, &aposspot&apos, &apossee&apos, &aposspot&apos, &aposspot&apos, &aposspot&apos]
[&aposrun&apos, &aposrun&apos, &aposspot&apos, &aposspot&apos, &aposspot&apos, &aposspot&apos, &aposspot&apos, &aposspot&apos]
[&apossee&apos, &aposrun&apos, &apossee&apos, &aposrun&apos, &aposrun&apos, &aposrun&apos]

And here are the latent topics used:

In [15]:
for i, doc in enumerate(Counter(term_dist).most_common() for term_dist in topic_index.values()):
    print "Doc:", i
    for topic, count in doc:
        print  5*" ", "count:", count, "topic:", [round(prob, 2) for prob in topic]
Doc: 0
      count: 2 topic: [0.17999999999999999, 0.79000000000000004, 0.02]
      count: 1 topic: [0.23000000000000001, 0.58999999999999997, 0.17999999999999999]
      count: 1 topic: [0.089999999999999997, 0.54000000000000004, 0.35999999999999999]
      count: 1 topic: [0.22, 0.40000000000000002, 0.38]
Doc: 1
      count: 2 topic: [0.23000000000000001, 0.58999999999999997, 0.17999999999999999]
      count: 1 topic: [0.17999999999999999, 0.79000000000000004, 0.02]
      count: 1 topic: [0.35999999999999999, 0.55000000000000004, 0.089999999999999997]
Doc: 2
      count: 4 topic: [0.11, 0.65000000000000002, 0.23999999999999999]
      count: 2 topic: [0.070000000000000007, 0.65000000000000002, 0.27000000000000002]
      count: 1 topic: [0.28999999999999998, 0.65000000000000002, 0.070000000000000007]
Doc: 3
      count: 2 topic: [0.17999999999999999, 0.79000000000000004, 0.02]
      count: 2 topic: [0.25, 0.55000000000000004, 0.20000000000000001]
      count: 2 topic: [0.28999999999999998, 0.65000000000000002, 0.070000000000000007]
      count: 1 topic: [0.23000000000000001, 0.58999999999999997, 0.17999999999999999]
      count: 1 topic: [0.089999999999999997, 0.54000000000000004, 0.35999999999999999]
Doc: 4
      count: 3 topic: [0.40000000000000002, 0.23000000000000001, 0.37]
      count: 2 topic: [0.42999999999999999, 0.17999999999999999, 0.40000000000000002]
      count: 1 topic: [0.23000000000000001, 0.29999999999999999, 0.46000000000000002]

Our documents were generated by an unspecified number of topics, and yet the topics were shared across the 5 documents. This is the power of the hierarchical Dirichlet process!

This non-parametric formulation of Latent Dirichlet Allocation was first published by Yee Whye Teh et al.

Unfortunately, forward sampling is the easy part. Fitting the model on data requires complex MCMC or variational inference. There are a limited number of implementations of HDP-LDA available, and none of them are great.

Tim HopperSampling from a Hierarchical Dirichlet Process

This may be more readable on NBViewer.

In [137]:
%matplotlib inline

As we saw earlier the Dirichlet process describes the distribution of a random probability distribution. The Dirichlet process takes two parameters: a base distribution $H_0$ and a dispersion parameter $\alpha$. A sample from the Dirichlet process is itself a probability distribution that looks like $H_0$. On average, the larger $\alpha$ is, the closer a sample from $\text{DP}(\alpha H_0)$ will be to $H_0$.

Suppose we're feeling masochistic and want to input a distribution sampled from a Dirichlet process as base distribution to a new Dirichlet process. (It will turn out that there are good reasons for this!) Conceptually this makes sense. But can we construct such a thing in practice? Said another way, can we build a sampler that will draw samples from a probability distribution drawn from these nested Dirichlet processes? We might initially try construct a sample (a probability distribution) from the first Dirichlet process before feeding it into the second.

But recall that fully constructing a sample (a probability distribution!) from a Dirichlet process would require drawing a countably infinite number of samples from $H_0$ and from the beta distribution to generate the weights. This would take forever, even with Hadoop!

Dan Roy, et al helpfully described a technique of using stochastic memoization to construct a distribution sampled from a Dirichlet process in a just-in-time manner. This process provides us with the equivalent of the Scipy rvs method for the sampled distribution. Stochastic memoization is equivalent to the Chinese restaurant process: sometimes you get seated an an occupied table (i.e. sometimes you're given a sample you've seen before) and sometimes you're put at a new table (given a unique sample).

Here is our memoization class again:

In [162]:
from numpy.random import choice 
from scipy.stats import beta

class DirichletProcessSample():
    def __init__(self, base_measure, alpha):
        self.base_measure = base_measure
        self.alpha = alpha
        self.cache = []
        self.weights = []
        self.total_stick_used = 0.

    def __call__(self):
        remaining = 1.0 - self.total_stick_used
        i = DirichletProcessSample.roll_die(self.weights + [remaining])
        if i is not None and i < len(self.weights) :
            return self.cache[i]
            stick_piece = beta(1, self.alpha).rvs() * remaining
            self.total_stick_used += stick_piece
            new_value = self.base_measure()
            return new_value
    def roll_die(weights):
        if weights:
            return choice(range(len(weights)), p=weights)
            return None

Let's illustrate again with a standard normal base measure. We can construct a function base_measure that generates samples from it.

In [95]:
from scipy.stats import norm

base_measure = lambda: norm().rvs() 

Because the normal distribution has continuous support, we can generate samples from it forever and we will never see the same sample twice (in theory). We can illustrate this by drawing from the distribution ten thousand times and seeing that we get ten thousand unique values.

In [163]:
from pandas import Series

ndraws = 10000
print "Number of unique samples after {} draws:".format(ndraws), 
draws = Series([base_measure() for _ in range(ndraws)])
print draws.unique().size
Number of unique samples after 10000 draws: 10000

However, when we feed the base measure through the stochastic memoization procedure and then sample, we get many duplicate samples. The number of unique samples goes down as $\alpha$ increases.

In [164]:
norm_dp = DirichletProcessSample(base_measure, alpha=100)

print "Number of unique samples after {} draws:".format(ndraws), 
dp_draws = Series([norm_dp() for _ in range(ndraws)])
print dp_draws.unique().size
Number of unique samples after 10000 draws: 446

At this point, we have a function dp_draws that returns samples from a probability distribution (specifically, a probability distribution sampled from $\text{DP}(\alpha H_0)$). We can use dp_draws as a base distribution for another Dirichlet process!

In [155]:
norm_hdp = DirichletProcessSample(norm_dp, alpha=10)

How do we interpret this? norm_dp is a sampler from a probability distribution that looks like the standard normal distribution. norm_hdp is a sampler from a probability distribution that "looks like" the distribution norm_dp samples from.

Here is a histogram of samples drawn from norm_dp, our first sampler.

In [152]:
import matplotlib.pyplot as plt
pd.Series(norm_dp() for _ in range(10000)).hist()
_=plt.title("Histogram of Samples from norm_dp")

And here is a histogram for samples drawn from norm_hdp, our second sampler.

In [154]:
pd.Series(norm_hdp() for _ in range(10000)).hist()
_=plt.title("Histogram of Samples from norm_hdp")

The second plot doesn't look very much like the first! The level to which a sample from a Dirichlet process approximates the base distribution is a function of the dispersion parameter $\alpha$. Because I set $\alpha=10$ (which is relatively small), the approximation is fairly course. In terms of memoization, a small $\alpha$ value means the stochastic memoizer will more frequently reuse values already seen instead of drawing new ones.

This nesting procedure, where a sample from one Dirichlet process is fed into another Dirichlet process as a base distribution, is more than just a curiousity. It is known as a Hierarchical Dirichlet Process, and it plays an important role in the study of Bayesian Nonparametrics (more on this in a future post).

Without the stochastic memoization framework, constructing a sampler for a hierarchical Dirichlet process is a daunting task. We want to be able to draw samples from a distribution drawn from the second level Dirichlet process. However, to be able to do that, we need to be able to draw samples from a distribution sampled from a base distribution of the second-level Dirichlet process: this base distribution is a distribution drawn from the first-level Dirichlet process.

Though it appeared that we would need to be able to fully construct the first level sample (by drawing a countably infinite number of samples from the first-level base distribution). However, stochastic memoization allows us to only construct the first distribution just-in-time as it is needed at the second-level.

We can define a Python class to encapsulate the Hierarchical Dirichlet Process as a base class of the Dirichlet process.

In [165]:
class HierarchicalDirichletProcessSample(DirichletProcessSample):
    def __init__(self, base_measure, alpha1, alpha2):
        first_level_dp = DirichletProcessSample(base_measure, alpha1)
        self.second_level_dp = DirichletProcessSample(first_level_dp, alpha2)

    def __call__(self):
        return self.second_level_dp()

Since the Hierarchical DP is a Dirichlet Process inside of Dirichlet process, we must provide it with both a first and second level $\alpha$ value.

In [167]:
norm_hdp = HierarchicalDirichletProcessSample(base_measure, alpha1=10, alpha2=20)

We can sample directly from the probability distribution drawn from the Hierarchical Dirichlet Process.

In [170]:
pd.Series(norm_hdp() for _ in range(10000)).hist()
_=plt.title("Histogram of samples from distribution drawn from Hierarchical DP")

norm_hdp is not equivalent to the Hierarchical Dirichlet Process; it samples from a single distribution sampled from this HDP. Each time we instantiate the norm_hdp variable, we are getting a sampler for a unique distribution. Below we sample five times and get five different distributions.

In [180]:
for i in range(5):
    norm_hdp = HierarchicalDirichletProcessSample(base_measure, alpha1=10, alpha2=10)
    _=pd.Series(norm_hdp() for _ in range(100)).hist()
    _=plt.title("Histogram of samples from distribution drawn from Hierarchical DP")
<matplotlib.figure.Figure at 0x112a2da50>

In a later post, I will discuss how these tools are applied in the realm of Bayesian nonparametrics.

Tim HopperHigh Quality Code at Quora

I love this new post on Quora's engineering blog. The post states "high code quality is the long-term boost to development speed" and goes on to explain how they go about accomplishing this.

I've inherited large code bases at each of my jobs out of grad school, and I've spent a lot of thinking about this question. At least on the surface, I love the solutions Quora has in place for ensuring quality code: thoughtful code review, careful testing, style guidelines, static checking, and intentional code cleanup.

Og MacielBooks - July 2015

Books - July 2015

This July 2015 I travelled to the Red Hat office in Brno, Czech Republic to spend some time with my teammates there, and I managed to get a lot of reading done between long plane rides and being jet lagged for many nights :) So I finally managed to finish up some of the books that had been lingering on my ToDo list and even managed to finally read a few of the books that together make up the Chronicles of Narnia, since I had never read them as a kid.


Out of all the books I read this month, I feel that All Quiet on the Western Front and The October Country were the ones I enjoyed reading the most, closely followed by Cryptonomicon, which took me a while to get through. The other books, with the exception of The Memoirs of Sherlock Holmes, helped me pass the time when I only wanted to be entertained.

All Quiet on the Western Front takes the prize for being one of the best books I have ever read! I felt that the way WWI was presented through the eyes of the main character was a great way to represent all the pain, angst and suffering that all sides of conflict went through, without catering for any particular side or having an agenda. Erich Maria Remarque's style had me some times breathless, some times with a knot on the pit of my stomach I as 'endured' the many life changing events that took place in the book. Is this an action-packed book about WWI? Will it read like a thriller? In my opinion, even though there are many chapters with gory details about killings and battles, the answer is a very bland 'maybe'. I think that the real 'star' of this book is its philosophical view of the war and how the main characters, all around 19-20 years of age, learn to deal with its life lasting effects.

Now, I have been a huge fan of Ray Bradbury for a while now, and when I got The October Country for my birthday last month, I just knew that it would be time well spent reading it. For those of you who are more acquainted his science fiction works, this book will surprise you as it shows you a bit of his 'darker' side. All of the short stories included in this collection deal with death, mysterious apparitions, inexplicable endings and are sure to spook you a little bit.

Cryptonomicon was at times slow, some other times funny and, especially toward the end, a very entertaining book. Weighing in at a hefty 1000 pages (depending on the edition you have, plus/minus 50 odd pages), this book covers two different periods in the lives of a number of different characters, past (around WWII) and present, all different threads eventually leading to a great finale. Alternating between past and present, the story takes us to the early days of how cryptology was 'officially invented' and used during the war, and how many of the events that took place back then were affecting the lives of some of the direct descendants of the main characters in our present day. As you go through the back and forth you start to gather bits and pieces of information that eventually connects all the dots of an interesting puzzle. It definitely requires a long term commitment to go though it, but it was enjoyable and, as I mention before, it made me laugh at many places.

Caktus GroupUsing Unsaved Related Models for Sample Data in Django 1.8

Note: In between the time I originally wrote this post and it getting published, a ticket and pull request were opened in Django to remove allow_unsaved_instance_assignment and move validation to the model save() method, which makes much more sense anyways. It's likely this will even be backported to Django 1.8.4. So, if you're using a version of Django that doesn't require this, hopefully you'll never stumble across this post in the first place! If this is still an issue for you, here's the original post:

In versions of Django prior to 1.8, it was easy to construct "sample" model data by putting together a collection of related model objects, even if none of those objects was saved to the database. Django 1.8 - 1.8.3 adds a restriction that prevents this behavior. Errors such as this are generally a sign that you're encountering this issue:

ValueError: Cannot assign "...": "MyRelatedModel" instance isn't saved in the database.

The justification for this is that, previously, unsaved foreign keys were silently lost if they were not saved to the database. Django 1.8 does provide a backwards compatibility flag to allow working around the issue. The workaround, per the Django documentation, is to create a new ForeignKey field that removes this restriction, like so:

class UnsavedForeignKey(models.ForeignKey):
    # A ForeignKey which can point to an unsaved object
    allow_unsaved_instance_assignment = True

class Book(models.Model):
    author = UnsavedForeignKey(Author)

This may be undesirable, however, because this approach means you lose all protection for all uses of this foreign key, even if you want Django to ensure foreign key values have been saved before being assigned in some cases.

There is a middle ground, not immediately obvious, that involves changing this attribute temporarily during the assignment of an unsaved value and then immediately changing it back. This can be accomplished by writing a context manager to change the attribute, for example:

import contextlib

def allow_unsaved(model, field):
    model_field = model._meta.get_field(field)
    saved = model_field.allow_unsaved_instance_assignment
    model_field.allow_unsaved_instance_assignment = True
    model_field.allow_unsaved_instance_assignment = saved

To use this decorator, surround any assignment of an unsaved foreign key value with the context manager as follows:

with allow_unsaved(MyModel, 'my_fk_field'):
    my_obj.my_fk_field = unsaved_instance

The specifics of how you access the field to pass into the context manager are important; any other way will likely generate the following error:

RelatedObjectDoesNotExist: MyModel has no instance.

While strictly speaking this approach is not thread safe, it should work for any process-based worker model (such as the default "sync" worker in Gunicorn).

This took a few iterations to figure out, so hopefully it will (still) prove useful to someone else!

Tim Hopper10x Engineering

Tim HopperNotes on the Dirichlet Distribution and Dirichlet Process

In [3]:
%matplotlib inline

Note: I wrote this post in an IPython notebook. It might be rendered better on NBViewer.

Dirichlet Distribution

The symmetric Dirichlet distribution (DD) can be considered a distribution of distributions. Each sample from the DD is a categorial distribution over $K$ categories. It is parameterized $G_0$, a distribution over $K$ categories and $\alpha$, a scale factor.

The expected value of the DD is $G_0$. The variance of the DD is a function of the scale factor. When $\alpha$ is large, samples from $DD(\alpha\cdot G_0)$ will be very close to $G_0$. When $\alpha$ is small, samples will vary more widely.

We demonstrate below by setting $G_0=[.2, .2, .6]$ and varying $\alpha$ from 0.1 to 1000. In each case, the mean of the samples is roughly $G_0$, but the standard deviation is decreases as $\alpha$ increases.

In [10]:
import numpy as np
from scipy.stats import dirichlet

def stats(scale_factor, G0=[.2, .2, .6], N=10000):
    samples = dirichlet(alpha = scale_factor * np.array(G0)).rvs(N)
    print "                          alpha:", scale_factor
    print "              element-wise mean:", samples.mean(axis=0)
    print "element-wise standard deviation:", samples.std(axis=0)
for scale in [0.1, 1, 10, 100, 1000]:
                          alpha: 0.1
              element-wise mean: [ 0.2  0.2  0.6]
element-wise standard deviation: [ 0.38  0.38  0.47]

                          alpha: 1
              element-wise mean: [ 0.2  0.2  0.6]
element-wise standard deviation: [ 0.28  0.28  0.35]

                          alpha: 10
              element-wise mean: [ 0.2  0.2  0.6]
element-wise standard deviation: [ 0.12  0.12  0.15]

                          alpha: 100
              element-wise mean: [ 0.2  0.2  0.6]
element-wise standard deviation: [ 0.04  0.04  0.05]

                          alpha: 1000
              element-wise mean: [ 0.2  0.2  0.6]
element-wise standard deviation: [ 0.01  0.01  0.02]

Dirichlet Process

The Dirichlet Process can be considered a way to generalize the Dirichlet distribution. While the Dirichlet distribution is parameterized by a discrete distribution $G_0$ and generates samples that are similar discrete distributions, the Dirichlet process is parameterized by a generic distribution $H_0$ and generates samples which are distributions similar to $H_0$. The Dirichlet process also has a parameter $\alpha$ that determines how similar how widely samples will vary from $H_0$.

We can construct a sample $H$ (recall that $H$ is a probability distribution) from a Dirichlet process $\text{DP}(\alpha H_0)$ by drawing a countably infinite number of samples $\theta_k$ from $H_0$) and setting:

$$H=\sum_{k=1}^\infty \pi_k \cdot\delta(x-\theta_k)$$

where the $\pi_k$ are carefully chosen weights (more later) that sum to 1. ($\delta$ is the Dirac delta function.)

$H$, a sample from $DP(\alpha H_0)$, is a probability distribution that looks similar to $H_0$ (also a distribution). In particular, $H$ is a discrete distribution that takes the value $\theta_k$ with probability $\pi_k$. This sampled distribution $H$ is a discrete distribution even if $H_0$ has continuous support; the support of $H$ is a countably infinite subset of the support $H_0$.

The weights ($\pi_k$ values) of a Dirichlet process sample related the Dirichlet process back to the Dirichlet distribution.

Gregor Heinrich writes:

The defining property of the DP is that its samples have weights $\pi_k$ and locations $\theta_k$ distributed in such a way that when partitioning $S(H)$ into finitely many arbitrary disjoint subsets $S_1, \ldots, S_j$ $J<\infty$, the sums of the weights $\pi_k$ in each of these $J$ subsets are distributed according to a Dirichlet distribution that is parameterized by $\alpha$ and a discrete base distribution (like $G_0$) whose weights are equal to the integrals of the base distribution $H_0$ over the subsets $S_n$.

As an example, Heinrich imagines a DP with a standard normal base measure $H_0\sim \mathcal{N}(0,1)$. Let $H$ be a sample from $DP(H_0)$ and partition the real line (the support of a normal distribution) as $S_1=(-\infty, -1]$, $S_2=(-1, 1]$, and $S_3=(1, \infty]$ then

$$H(S_1),H(S_2), H(S_3) \sim \text{Dir}\left(\alpha\,\text{erf}(-1), \alpha\,(\text{erf}(1) - \text{erf}(-1)), \alpha\,(1-\text{erf}(1))\right)$$

where $H(S_n)$ be the sum of the $\pi_k$ values whose $\theta_k$ lie in $S_n$.

These $S_n$ subsets are chosen for convenience, however similar results would hold for any choice of $S_n$. For any sample from a Dirichlet process, we can construct a sample from a Dirichlet distribution by partitioning the support of the sample into a finite number of bins.

There are several equivalent ways to choose the $\pi_k$ so that this property is satisfied: the Chinese restaurant process, the stick-breaking process, and the Pólya urn scheme.

To generate $\left\{\pi_k\right\}$ according to a stick-breaking process we define $\beta_k$ to be a sample from $\text{Beta}(1,\alpha)$. $\pi_1$ is equal to $\beta_1$. Successive values are defined recursively as

$$\pi_k=\beta_k \prod_{j=1}^{k-1}(1-\beta_j).$$

Thus, if we want to draw a sample from a Dirichlet process, we could, in theory, sample an infinite number of $\theta_k$ values from the base distribution $H_0$, an infinite number of $\beta_k$ values from the Beta distribution. Of course, sampling an infinite number of values is easier in theory than in practice.

However, by noting that the $\pi_k$ values are positive values summing to 1, we note that, in expectation, they must get increasingly small as $k\rightarrow\infty$. Thus, we can reasonably approximate a sample $H\sim DP(\alpha H_0)$ by drawing enough samples such that $\sum_{k=1}^K \pi_k\approx 1$.

We use this method below to draw approximate samples from several Dirichlet processes with a standard normal ($\mathcal{N}(0,1)$) base distribution but varying $\alpha$ values.

Recall that a single sample from a Dirichlet process is a probability distribution over a countably infinite subset of the support of the base measure.

The blue line is the PDF for a standard normal. The black lines represent the $\theta_k$ and $\pi_k$ values; $\theta_k$ is indicated by the position of the black line on the $x$-axis; $\pi_k$ is proportional to the height of each line.

We generate enough $\pi_k$ values are generated so their sum is greater than 0.99. When $\alpha$ is small, very few $\theta_k$'s will have corresponding $\pi_k$ values larger than $0.01$. However, as $\alpha$ grows large, the sample becomes a more accurate (though still discrete) approximation of $\mathcal{N}(0,1)$.

In [13]:
import matplotlib.pyplot as plt
from scipy.stats import beta, norm

def dirichlet_sample_approximation(base_measure, alpha, tol=0.01):
    betas = []
    pis = []
    betas.append(beta(1, alpha).rvs())
    while sum(pis) < (1.-tol):
        s = np.sum([np.log(1 - b) for b in betas])
        new_beta = beta(1, alpha).rvs() 
        pis.append(new_beta * np.exp(s))
    pis = np.array(pis)
    thetas = np.array([base_measure() for _ in pis])
    return pis, thetas

def plot_normal_dp_approximation(alpha):
    plt.title("Dirichlet Process Sample with N(0,1) Base Measure")
    plt.suptitle("alpha: %s" % alpha)
    pis, thetas = dirichlet_sample_approximation(lambda: norm().rvs(), alpha)
    pis = pis * (norm.pdf(0) / pis.max())
    plt.vlines(thetas, 0, pis, )
    X = np.linspace(-4,4,100)
    plt.plot(X, norm.pdf(X))


Often we want to draw samples from a distribution sampled from a Dirichlet process instead of from the Dirichlet process itself. Much of the literature on the topic unhelpful refers to this as sampling from a Dirichlet process.

Fortunately, we don't have to draw an infinite number of samples from the base distribution and stick breaking process to do this. Instead, we can draw these samples as they are needed.

Suppose, for example, we know a finite number of the $\theta_k$ and $\pi_k$ values for a sample $H\sim \text{Dir}(\alpha H_0)$. For example, we know

$$\pi_1=0.5,\; \pi_3=0.3,\; \theta_1=0.1,\; \theta_2=-0.5.$$

To sample from $H$, we can generate a uniform random $u$ number between 0 and 1. If $u$ is less than 0.5, our sample is $0.1$. If $0.5<=u<0.8$, our sample is $-0.5$. If $u>=0.8$, our sample (from $H$ will be a new sample $\theta_3$ from $H_0$. At the same time, we should also sample and store $\pi_3$. When we draw our next sample, we will again draw $u\sim\text{Uniform}(0,1)$ but will compare against $\pi_1, \pi_2$, AND $\pi_3$.

The class below will take a base distribution $H_0$ and $\alpha$ as arguments to its constructor. The class instance can then be called to generate samples from $H\sim \text{DP}(\alpha H_0)$.

In [20]:
from numpy.random import choice

class DirichletProcessSample():
    def __init__(self, base_measure, alpha):
        self.base_measure = base_measure
        self.alpha = alpha
        self.cache = []
        self.weights = []
        self.total_stick_used = 0.

    def __call__(self):
        remaining = 1.0 - self.total_stick_used
        i = DirichletProcessSample.roll_die(self.weights + [remaining])
        if i is not None and i < len(self.weights) :
            return self.cache[i]
            stick_piece = beta(1, self.alpha).rvs() * remaining
            self.total_stick_used += stick_piece
            new_value = self.base_measure()
            return new_value
    def roll_die(weights):
        if weights:
            return choice(range(len(weights)), p=weights)
            return None

This Dirichlet process class could be called stochastic memoization. This idea was first articulated in somewhat abstruse terms by Daniel Roy, et al.

Below are histograms of 10000 samples drawn from samples drawn from Dirichlet processes with standard normal base distribution and varying $\alpha$ values.

In [22]:
import pandas as pd

base_measure = lambda: norm().rvs()
n_samples = 10000
samples = {}
for alpha in [1, 10, 100, 1000]:
    dirichlet_norm = DirichletProcessSample(base_measure=base_measure, alpha=alpha)
    samples["Alpha: %s" % alpha] = [dirichlet_norm() for _ in range(n_samples)]

_ = pd.DataFrame(samples).hist()

Note that these histograms look very similar to the corresponding plots of sampled distributions above. However, these histograms are plotting points sampled from a distribution sampled from a Dirichlet process while the plots above were showing approximate distributions samples from the Dirichlet process. Of course, as the number of samples from each $H$ grows large, we would expect the histogram to be a very good empirical approximation of $H$.

In a future post, I will look at how this DirichletProcessSample class can be used to draw samples from a hierarchical Dirichlet process.

In [ ]:

Tim HopperHandy One-off Webpages

I'm starting to love single-page informational websites. For example:

My website Should I Get a Phd? is in this same vein.

Publishing a site like this is very cheap with static hosting on AWS. I would love to see more of them created!

Caktus GroupPyCon 2015 Workshop Video: Building SMS Applications with Django

As proud sponsors of PyCon, we hosted a one and a half hour free workshop. We see the workshops as a wonderful opportunity to share some practical, hands-on experience in our area of expertise: building applications in Django. In addition, it’s a way to give back to the open source community.

This year, Technical Director Mark Lavin and Developers Caleb Smith and David Ray presented “Building SMS Applications with Django.” In the workshop, they taught the basics of SMS application development using Django and Django-based RapidSMS. Aside from covering the basic anatomy of an SMS-based application, as well as building SMS workflows and testing SMS applications, Mark, David, and Caleb were able to bring their practical experience with Caktus client projects to the table.

We’ve used SMS on behalf of international aid organizations and agencies like UNICEF as a cost-effective and pervasive method for conveying urgent information. We’ve built tools to help Libyans register to vote via SMS, deliver critical infant HIV/AIDs results in Zambia and Malawi, and alert humanitarian workers of danger in and around Syria.

Interested in SMS applications and Django? Don’t worry. If you missed the original workshop, we have good news: we recorded it. You can participate by watching the video above!

Caktus GroupReviews of two recent Django Books


When I started building sites in Django, I learned the basics from the excellent Django tutorial. But I had to learn by trial and error which approaches to using Django's building blocks worked well and which approaches tended to cause problems later on. I looked for more intermediate-level documentation, but beyond James Bennett's Practical Django Projects and our Karen Tracey's Django 1.1 Testing and Debugging, there wasn't much to be found.

Over the years, ever more interesting introductory material has been showing up, including recently our lead technical manager Mark Lavin's Lightweight Django.

But more experienced developers are also in luck. Some of the community's most experienced Django developers have recently taken the time to put down their experience in books so that we can all benefit. I've been reading two of those books and I can highly recommend both.

Two Scoops of Django

I guess Two Scoops of Django, by Daniel Roy Greenfeld and Audrey Roy Greenfeld, isn't that recent; its third edition just came out. It's been updated twice to keep up with changes in Django (the latest edition covers Django 1.8), improve the existing material, and add more information.

The subtitle of the most recent edition is Best Practices for Django 1.8, and that's what the book is. The authors go through most facets of Django development and share what has worked best for them and what to watch out for, so you don't have to learn it all the hard way.

For example, reading the Django documentation, you can learn what each setting does. Then you can read in chapter 5 of Two Scoops a battle-tested scheme for managing different settings and files across multiple environments, from local development to testing servers and production, while protecting your secrets (passwords, keys, etc).

Similarly, chapter 19 of Two Scoops covers what cases you should and shouldn't use the Django admin for, warns about using list_editable in multi-user environments, and gives some tips for securing the admin and customizing it.

Those are just two examples. The most recent edition of the book has 35 chapters, each covering a useful topic. It's over 500 pages.

Another great thing about the book is that the chapters stand alone - you can pick it up and read whatever chapter you need right now.

I'll be keeping this book handy when I'm working on Django projects.

High Performance Django

High Performance Django by Peter Baumgartner and Yann Malet is aimed at the same audience as Two Scoops, but is tightly focused on performance. It moves on from building a robust Django app to how to deploy and scale it. It covers load balancers, proxies, caching, and monitoring.

One of its best features is that it gives war stories of deploys gone wrong and how the problems were attacked and solved.

Like Two Scoops, this book talks about general principles, along with specific approaches and tools that the authors are familiar with and have had success with. It does a good job of showing the overall architecture that most high-performance sites use, from the load balancer out front to the database at the back, and listing some popular choices of tools at each tier. Then they go into more detail about the specific tools they favor.

It also delves into how to spot performance bottlenecks in your Django site’s code, where they’re most likely to be, and good ways to deal with them.

This is the book I'll be coming back to when I have a question about performance.


There are two things I want to say about books like these.

First, they are immensely useful to those of us who work with Django every day. There's a huge amount of experience captured here for our benefit.

Second, I cannot imagine the amount of time and work it takes to create books like these. When I flip through Two Scoops, not only is it full of useful information, almost every page has examples or diagrams that had to be prepared too.

Caktus has added both these books to the office library, and I've bought personal copies too (including all three editions of Two Scoops). I hope you'll try either or both, and if you find them useful, spread the word.

Tim HopperThinking at Work

Having worked from home for the last few years, I have a hard time understanding how people get anything done in open-floor plan offices. I would be overwhelmed and frustrated by the noise and commotion.

I assumed open-floor plans for software shops were a relatively new invention. However, I just started reading Peopleware: Productive Projects and Teams, first published in 1987, and discovered that the first third of the book rails against open floor plan offices. I particularly enjoyed this quote:

In my years at Bell Labs, we worked in two-person offices. They were spacious, quiet, and the phones could be diverted. I shared my office with Wendl Thomis, who went on to build a small empire as an electric toy maker. In those days, he was working on the Electronic Switching System fault dictionary. The dictionary scheme relied upon the notion of n-space proximity, a concept that was hairy enough to challenge even Wendel's powers of concentration. One afternoon, I was bent over a program listing while Wendl was staring into space, his feet propped up on his desk. Our boss came in and asked, "Wendl! What are you doing?" Wendl said, "I'm thinking." And the boss said, "Can't you do that at home?"

The difference between that Bell Labs environment and a typical modern-day office plan is that in those quiet offices, one at least had the option of thinking on the job. In most of the office space we encounter today, there is enough noise and interruption to make any serious thinking virtually impossible. More is the shame: Your people bring their brains with them every morning. They could put them to work for you at no additional cost if only there were a small measure of peace and quiet in the workplace.

Tim HopperTweets I'm Proud Of

Tim HopperNew Post Function for Bash

On of the things I don't like about using a static site generator is the friction required for creating a new post. I've often end up posting things to Twitter that I would prefer to be more permanent simply because of the ease of tweeting.

To that end, I created a quick Bash function to create a new post for me. Creating this post in my Pelican directory only requires typing

$ new-post "New Post Function for Bash"

Combined with Greg Reda's Travis CI trick, the friction in getting a new post online is greatly reduced.

Caktus GroupQ3 2015 ShipIt Day ReCap

Last Friday marked another ShipIt Day at Caktus, a chance for our employees to set aside client work for experimentation and personal development. It’s always a wonderful chance for our developers to test new boundaries, learn new skills and sometimes even build something entirely new in a single day.

NC Nwoko and Mark Lavin teamed up to develop a pizza calculator app. The app simply and efficiently calculates how much pizza any host or catering planner needs to order to feed a large group of people. We eat a lot of pizza at Caktus. Noticing deficiencies in other calculators on the internet, NC and Mark built something simple, clean, and (above all) well researched. In the future, they hope to add size mixing capabilities and as well as a function for calculating the necessary ratios to provide for certain dietary restrictions, such as vegan, vegetarian, or gluten-free eaters.

Jeff Bradberry and Scott Morningstar worked on getting Ansible functioning to replace SALT states in the Django project template and made a lot of progress. And Karen Tracey approached some recent test failures, importing solutions from the database, while Rebecca Muraya began the massive task of updating some of our client based projects to Python 3.

Hunter MacDermut continued building the game he started last ShipIt Day, an HTML5 game using the Phaser framework. He added logic and other game-like elements to make a travelable board with the goal of destroying opponents. He also added animated sprites, including animations for an attack, giving each character their own unique moves. The result was a lot of fun to watch!

Dmitriy Chukhin and Caleb Smith developed a YouTube listening queue using ReactJS, using JQuery for the data layer. They loved the tag functions inherent in ReactJS as well as the speed.

Victor Rocha wrote a new admin action that enables a user to export models as a CSV file. He even found time to open source his work.

Vinod Kurup spent his day fixing RapidSMS bugs, creating two new pull requests. You can find them here and here. Once reviewed, they will be incorporated in the next RapidSMS release.

Neil Ashton worked through three chapters of experiments from The Foundations of Statistics: a Simulation-based Approach using iPython Notebook. He subsequently fell in love with iPython Notebook. An interactive computational environment, the iPython notebook seems the perfect platform for Neil’s love of data visualization and interactive experimentation. The iPython notebook ultimately allows the user to combine code execution, rich text, mathematics, plots, and other rich media.

Ross Pike spent the day exploring Font Awesome, the open-source library for scalable vector icons. He also took several tutorials in Sketch, an application for designing websites, interfaces, icons, and pretty much anything else.

Tobias McNulty spent some time working on the next release of django-cache-machine, a 3rd party Django app that adds caching and automatic invalidation to your Django models on a per-model basis. This ShipIt Day he worked on adding Python 3 support (with help from Vinod) and added a feature to support invalidation of queries when new model instances are created.

Finally, inspired by the open data apps built by Code for Durham, Rebecca Conley used D3 to write data visualizations of data on North Carolina’s public schools. Eventually, she wants to test more complex bar graph visualizations as well as learn data visualization in D3 beyond the bar graph.

We had a number of people on vacation this ShipIt Day and several administrators and team members who couldn’t put away their typical workload this time around. But no matter; there is always the next ShipIt Day!

Astro Code SchoolVideo - Why go to code school?

In this video Astro Lead Instructor Caleb Smith answers the question, "Why go to code school?". A major point is the laser focus. Code schools allow you to learn precisely the skills needed to perform a highly technical job. Watch this video for other parts of his answer. This is the first in a series of question and answer videos. More answers to real questions you may have about going to code school are on the way.

Don't forget to subscribe to the Astro Code School YouTube channel.

Astro Code SchoolDjango Bootcamp in Baltimore at Betamore

On August 7, 2015 Caleb Smith will be teaching a Django Bootcamp in Baltimore, Maryland! This short class is from 9am to 3pm and will be held at Betamore, a coworking space, incubator and campus for technology and entrepreneurship. At the class students will learn how to build a simple Django app. This bootcamp is targeted at beginners but everyone is welcome. The prerequisites are a laptop with Python and pip installed. Sign up now and reserve your spot.

Big thanks to the super kind folks at Betamore for hosting this and working with us to get this class going.

Caktus GroupLAN Party at Caktus

This past weekend, our wonderful Technology Support Specialist Scott Morningstar hosted a Local Area Network (LAN) party at Caktus HQ. Held twice a year since 2008, the event allows geeks, gamers, and retro technology lovers to relive the nostalgia of multiplayer gaming in the early days of dial-up internet. In other words, everyone brings their own computer, and uses the LAN to play online games in the company of others. These parties are a lot of fun and add a more personal social element to the online gaming community.

This year, participants played Terraria, an action game with creative world-building elements, Artemis, a spaceship bridge simulator game, and Counter Strike: Global Offensive, a team-based modern warfare game. Not only was it wonderful to see our space filled with enthusiastic gamers, but it was doubly exciting that participants joined gameplay remotely from LAN party events in Boston, Massachusetts, Farmville, Virginia, and Minneapolis, Minnesota.

We love being able to support and host such a wide variety of technology-related events in our community meeting space! For information on other functions held in our downtown Durham headquarters, or in our Astro Code School space, be sure to check out the Events page on our website.

Og MacielBooks - June 2015

Books - June 2015

Those of you who know me know that I am a huge book reader and spend most of my free time reading several books at the same time. One could say that reading is one of my passions, and having wasted so many years after high school completely ignoring this passion (in exchange for spending most of my time trying to learn about Linux, get an education, a job and, let's be frank, chasing after girls), I decided that something had to be done about it, and starting around 2008 I 'forced' myself to dedicate at least one solid hour of reading for fun every day.

I find it funny to say that I had to force myself, but this statement is very much true. Being so used to spending all of my time sitting in front of a computer and getting flooded with information every single minute of the day (IRC, Twitter, Facebook, commit emails, RSS feeds, etc), I found it difficult to 'unplug' and spend time doing nothing but focusing on only one thing. I was so used to multitasking and being constantly bombarded with lots of information that sitting quietly and reading didn't feel very productive to me... sad but true.

Anyhow, after several 'agonizing' months of getting up from my desk and making a point of turning off my cel phone and finding a quiet place somewhere in the building (or at home during the weekends), I finally got into the habit of reading for pleasure. I actually looked forward to these reading periods (imagine that, huh?) and eventually I realized that if I skipped this 'ritual' even one day, my days felt like they got longer and I felt stressed out and irritable for the remaining of the day. Reading became not only a good habit but my mechanism for relaxing and recharging my energies during the day!

Well, this passion and appetite for reading has only gotten bigger, and with time I have to say that it has become a pretty big part of who I am today! In a way I am happy that it took me this long to get back into the habit of reading... I mean, I feel that getting older was an important part of preparing myself so that I could really appreciate John Steinbeck, Ray Bradbury and the likes of them! Would I have truly appreciated The Grapes of Wrath when I was younger? Perhaps... but it took me around 40 years to get to it and I'm happy that when it did I was able to appreciate this amazing piece of art!

These last few months I decided that I wanted to start tracking all the books that I read, buy or receive as a gift every month (see my reading progress on GoodReads and add me as a friend), and jot down some of my impressions and motives for reading or buying them. Those familiar with Nick Hornby will probably associate this post (and hopefully others that will surely come) with the work he has done writing for the Believer Magazine ... and this would be correct. My intention is not to copy his style or anything like that, but I thought that the format he chose to report on his own reading 'adventures' would fit in quite nicely with what I wanted to get across to my readers... and I'm sticking with the format as long as it works for me :)

Astro Code SchoolVideo - Interview with Lead Instructor Caleb Smith

In this Astro interview video I talk with our Lead Instructor Caleb Smith. We learn about Caleb's formal education, a connection between music and computer programming, and why teaching excites him. Caleb wrote the curriculum and teaches the Astro Code School Python & Django Web Development class.

Don't forget to subscribe to the Astro Code School YouTube channel. We have a lot more videos in the works.

Caktus GroupLightning Talk Lunch: Two Useful Organizational Tools

Monthly, we organize short Lightning Talks that take place during the lunch hour here at Caktus. Not only does this allow us a wonderful excuse to have lunch delivered from one of our many local foodie options, but it’s an excellent chance to expand our knowledge on a variety of topics. Past talks have included everything from an introduction to synthesizers and other forms of electronic music, to bug fixing, to the design inspiration behind our PyCon 2015 site.

This month, we had two talks on organizational tools for project management and resource sorting. Developer Dan Poirier gave a brief talk on Pinboard, or, as he fondly refers to it, “social bookmarking for introverts.” Essentially, Pinboard is a database for storing, organizing, and sharing links and bookmarks to articles and pages on the web. Though lacking in sharp design or beautiful layout, Pinboard is useful, highly functional, and extremely intuitive. Dan was a wonderful guide in walking us through how he uses Pinboard to store development tips and articles, as well as information related to his various projects for Caktus. He even built his own front-end for the site to help organize his finds for daily use and to share with other Caktus developers.

Our second talk came from Game Designers Edward and Lucas Rowe, who are currently finishing up the work on our Epic Allies app. Before this project, Caktus wanted to try out a new management tool for development; Epic Allies turned out to be a good fit for testing JIRA, the issue and project tracking software from Atlassian. In their talk, Lucas and Edward took us on a tour of JIRA, discussed its functionality for development projects, and showed us how Epic Allies specifically used this highly customizable platform.

All in all it was an informative day, and Dan, Edward, and Lucas may have all won a few converts to their favorite organizational tools. Now we can’t wait to see what’s in the pipeline for our next set of Lightning Talks!

Astro Code SchoolAnnouncing Caktus Scholarships for Astro Code School

We’re very pleased to announce that Caktus Group will be sponsoring up to $20,000 worth of scholarships annually for Astro Code School students. There will be twenty $1,000 scholarships. We hope that these scholarships help increase access to code schools and the wider tech industry:

Caktus Group Diversity & Veterans Scholarship

This scholarship aims to support the careers of underrepresented groups in technology, specifically women, people of color, military veterans, and people with disabilities. For classrooms and for the tech industry to be the best it can be, it requires ideas from diverse groups of people.

Caktus Group North Carolinians Scholarship

Anyone who lives in North Carolina is eligible to receive this scholarship. Caktus was founded in North Carolina and we’ve benefited from the great talent here. We want tech growth in our area to include those that live here.

You can find more information about our scholarships on the financial aid page.

Caktus GroupAnnouncing Caktus Scholarships for Astro Code School

We’re very pleased to announce that Caktus Group will be sponsoring up to $20,000 worth of scholarships for Astro Code School students per year. There will be twenty $1,000 scholarships. We hope that these scholarships help increase access to code schools and the wider tech industry:

Caktus Group Diversity & Veterans Scholarship

This scholarship aims to support the careers of underrepresented groups in technology, specifically women, people of color, military veterans, and people with disabilities. For classrooms and for the tech industry to be the best it can be, it requires ideas from diverse groups of people.

Caktus Group North Carolinians Scholarship

Anyone who lives in North Carolina is eligible to receive this scholarship. Caktus was founded in North Carolina and we’ve benefited from the great talent here. We want tech growth in our area to include those that live here.

Astro Code SchoolWhat I Learned Teaching at UNC

This spring semester, I had the honor of teaching JOMC-583 "Multimedia Programming and Production" for the University of North Carolina at Chapel Hill School of Journalism and Mass Communication. The course requires university permission and two prior multimedia programming courses that focus on frontend web development. It was a wonderful opportunity to partner with the university, especially with a department that has shown leadership in recent years with adopting innovative programs and coursework for students interested in the data-driven area of journalism.

The subject matter of the course centered around backend web development with Python and Django and also included other technologies such as git, SQL, and the Unix command line. As a rough outline, the lecture topics were:

  1. Unix command line

  2. Git and Github

  3. Python

  4. Introductory Django

  5. Django views and templates

  6. Django models and data modeling

  7. Frontend development inside a Django project

  8. Miscellaneous topics

  9. Group project time

The course materials were based on Steven King's curriculum for the course from the year prior and is available at https://github.com/calebsmith/j583

At a high-level, the first half of the course was a mixture of lecture and individual assignments while the second half of the course was spent on two projects. The first development project was completed individually and was small in scale. The second and final project was more ambitious and required collaboration using Github. This served as a nice progression from focusing on concrete skills in isolation to applying those skills and developing further experientially.

One of the group projects was deployed successfully to Heroku and is visible here: http://rackfind.herokuapp.com/

While I think the course was a major learning experience for the students, it certainly was for me as well. It was particularly interesting to see the subject areas that students picked up easily or struggled with and how this often differed with my expectations. In particular, some areas that students picked up quickly were:

  1. The essential Unix command line tools such as: pwd, ls, cd, and so on

  2. Python basics

  3. Python packaging and setup, especially pip and virtualenv

  4. Using Git as a sole contributor

  5. Creating a data model

The students were much quicker to learn these concepts than I anticipated. For instance, we spent two lecture periods focusing on developing skills for the command line, but the first class was enough for most tasks. In the future, I would likely plan on needing only one lecture for that topic.

Some topics that required more reinforcement than anticipated were:

  1. Why writing a custom backend is desirable as opposed to a static HTML site

  2. The semantics of Django URL routing.

  3. How to glue JavaScript code into Django templates

I think the fundamental reason that students struggled with this more than anticipated relates to their arrival to the domain of backend programming from a background of frontend web development.

This was a great experience for me and it was rewarding to see my students succeed in programming with Python and Django. I'm very much looking forward to more opportunities to teach web development in the future.

Astro Code SchoolPython Beginner’s Night at Astro

Last night we held the first TriPython Python Beginner’s Night. About twenty three people interested in Python attended. Many of them were very experienced developers who answered all kinds of questions. From the very basic to the advanced.

A big thanks to all the Caktus Group folks who attended. You helped a lot of people! Thanks also to the other volunteers who attended. It's really cool to live in a city with so many people who enjoy helping others.

The next free Python Beginner's Night is Monday July 6, 2015 from 6pm to 8pm here at Astro Code School (map). We'll be here on the first Monday of each month with free pizza and Python experts. If you can join us please RSVP on the Meetup page. See you soon!