A planet of blogs from our members...

Josh JohnsonCentralized Ansible Management With Knocd + Auto-provisioning with AWS

Ansible is a great tool. We’ve been using it at my job with a fair amount of success. When it was chosen, we didn’t have a requirement for supporting Auto scaling groups in AWS. This offers a unique problem – we need machines to be able to essentially provision themselves when AWS brings them up. This has interesting implications outside of AWS as well. This article covers using the Ansible API to build just enough of a custom playbook runner to target a single machine at a time, and discusses how to wire it up to knockd, a “port knocking” server and client, and finally how to use user data in AWS to execute this at boot – or any reboot.

Ansible – A “Push” Model

Ansible is a configuration management tool used in orchestration of large pieces of infrastructure. It’s structured as a simple layer above SSH – but it’s a very sophisticated piece of software. Bottom line, it uses SSH to “push” configuration out to remote servers – this differs from some other popular approaches (like Chef, Puppet and CFEngine) where an agent is run on each machine, and a centralized server manages communication with the agents. Check out How Ansible Works for a bit more detail.

Every approach has it’s advantages and disadvantages – discussing the nuances is beyond the scope of this article, but the primary disadvantage that Ansible has is one of it’s strongest advantages: it’s decentralized and doesn’t require agent installation. The problem arises when you don’t know your inventory (Ansible-speak for “list of all your machines”) beforehand. This can be mitigated with inventory plugins. However, when you have to configure machines that are being spun up dynamically, that need to be configured quickly, the push model starts to break down.

Luckily, Ansible is highly compatible with automation, and provides a very useful python API for specialized cases.

Port Knocking For Fun And Profit

Port knocking is a novel way of invoking code. It involves listening to the network at a very low level, and listening for attempted connections to a specific sequence of ports. No ports are opened. It has its roots in network security, where it’s used to temporarily open up firewalls. You knock, then you connect, then you knock again to close the door behind you. It’s very cool tech.

The standard implementation of port knocking is knockd, included with  most major linux distributions. It’s extremely light weight, and uses a simple configuration file. It supports some interesting features, such as limiting the number of times a client can invoke the knock sequence, by commenting out lines in a flat file.

User Data In EC2

EC2 has a really cool feature called user data, that allows you to add some information to an instance upon boot. It works with cloud-init (installed on most AMIs) to perform tasks and run scripts when the machine is first booted, or rebooted.

Auto Scalling

EC2 provides a mechanism for spinning up instances based on need (or really any arbitrary event). The AWS documentation gives a detailed overview of how it works. It’s useful for responding to sudden spikes in demand, or contracting your running instances during low-demand periods.

Ansilbe + Knockd = Centralized, On-Demand Configuration

As mentioned earlier, Ansible provides a fairly robust API for use in your own scripts. Knockd can be used to invoke any shell command. Here’s how I tied the two together.


All of my experimentation was done in EC2, using the Ubuntu 12.04 LTS AMI.

To get the machine running ansible configured, I ran the following commands:

$ sudo apt-get update
$ sudo apt-get install python-dev python-pip knockd
$ sudo pip install ansible

Note: its important that you install the python-dev package before you install ansible. This will provide the proper headers so that the c-based SSH library will be compiled, which is faster than the pure-python version installed when the headers are not available.

You’ll notice some information from the knockd package regarding how to enable it. Take note of this for final deployment, but we’ll be running knockd manually during this proof-of-concept exercise.

On the “client” machine, the one who is asking to be configured, you need only install knockd. Again, the service isn’t enabled by default, but the package provides the knock command.

EC2 Setup

We require a few things to be done in the EC2 console for this all to work.

First, I created a keypair for use by the tool. I called “bootstrap”. I downloaded it onto a freshly set up instance I designated for this purpose.

NOTE: It’s important to set the permissions of the private key correctly. They must be set to 0600.

I then needed to create a special security group. The point of the group is to allow all ports from within the current subnet. This gives us maximum flexibility when assigning port knock sequences.

Here’s what it looks like:

Depending on our circumstances, we would need to also open up UDP traffic as well (port knocks can be TCP or UDP based, or a combination within a sequence).

For the sake of security, a limited range of a specific type of connection is advised, but since we’re only communicating over our internal subnet, the risk here is minimal.

Note that I’ve also opened SSH traffic to the world. This is not advisable as standard practice, but it’s necessary for me since I do not have a fixed IP address on my connection.

Making It Work

I wrote a simple python script that runs a given playbook against a given IP address:

Script to run a given playbook against a specific host

import ansible.playbook
from ansible import callbacks
from ansible import utils

import argparse
import os, sys

parser = argparse.ArgumentParser(
    description="Run an ansible playbook against a specific host."

    help="The IP address or hostname of the machine to run the playbook against."

    help="Specify path to a specific playbook to run."

    help="Specify path to a config file. Defaults to %(default)s."

def run_playbook(host, playbook, user, key_file):
    Run a given playbook against a specific host, with the given username
    and private key file.
    stats = callbacks.AggregateStats()
    playbook_cb = callbacks.PlaybookCallbacks(verbose=utils.VERBOSITY)
    runner_cb = callbacks.PlaybookRunnerCallbacks(stats, verbose=utils.VERBOSITY)

    pb = ansible.playbook.PlayBook(


options = parser.parse_args()

playbook = os.path.abspath("./playbooks/%s" % options.playbook)

run_playbook(options.host, playbook, 'ubuntu', "./bootstrap.pem")

Most of the script is user-interface code, using argparse to bring in configuration options. One unimplemented feature is using an INI file to specify things like the default playbook, pem key, user, etc. These things are just hard coded in the call to run_playbook for this proof-of-concept implementation.

The real heart of the script is the run_playbook function. Given a host (IP or hostname), a path to a playbook file (assumed to be relative to a “plabooks” directory), a user and a private key, it uses the Ansible API to run the playbook.

This function represents the bare-minimum code required to apply a playbook to one or more hosts. It’s surprisingly simple – and I’ve only scratched the surface here of what can be done. With custom callbacks, instead of the ones used by the ansible-playbook runner, we can fine tune how we collect information about each run.

The playbook I used for testing this implementation is very simplistic (see the Ansible playbook documentation for an explaination of the playbook syntax):

- hosts: all
  sudo: yes
  - name: ensure apache is at the latest version
    apt: update_cache=yes pkg=apache2 state=latest
  - name: drop an arbitrary file just so we know something happened
    copy: src=it_ran.txt dest=/tmp/ mode=0777

It just installs and starts apache, does an apt-get update, and drops a file into /tmp to give me a clue that it ran.

Note that the hosts: setting is set to “all” – this means that this playbook will run regardless of the role or class of the machine. This is essential, since, again, the machines are unknown when they invoke this script.

For the sake of simplicity, and to set a necessary environment variable, I wrapped the call to my script in a shell script:

cd /home/ubuntu
/usr/bin/python /home/ubuntu/run_playbook.py $1 >> $1.log 2>&1

The $ANSIBLE_HOST_KEY_CHECKING environment variable here is necessary, short of futzing with the ssh configuration for the ubuntu user, to tell Ansible to not bother verifying host keys. This is required in this situation because the machines it talks to are unknown to it, since the script will be used to configure newly launched machines. We’re also running the playbook unattended, so there’s no one to say “yes” to accepting a new key.

The script also does some very rudimentary logging of all output from the playbook run – it creates logs for each host that it services, for easy debugging.

Finally, the following configuration in knockd.conf makes it all work:


        sequence    = 9000, 9999
        seq_timeout = 5
        Command     = /home/ubuntu/run.sh %IP%

The first configuration section [options], is special to knockd – its used to configure the server. Here we’re just asking knockd to log message to the system log (e.g. /var/log/messages).

The [ansible] section sets up the knock sequence for an machine that wants Ansible to configure it. The sequence set here (it can be anything – any port number and any number of ports >= 2) is 9000, 9999. There’s a 5 second timeout – in the event that the client doing the knocking takes longer than 5 seconds to complete the sequence, nothing happens.

Finally, the command to run is specified. The special %IP% variable is replaced when the command is executed by the IP address of the machine that knocked.

At this point, we can test the setup by running knockd. We can use the -vD options to output lots of useful information.

We just need to then do the knocking from a machine that’s been provisioned with the bootstrap keypair.

Here’s what it looks like (these are all Ubuntu 12.04 LTS instances):

On the “server” machine, the one with the ansible script:

$  sudo knockd -vD
config: new section: 'options'
config: usesyslog
config: new section: 'ansible'
config: ansible: sequence: 9000:tcp,9999:tcp
config: ansible: seq_timeout: 5
config: ansible: start_command: /home/ubuntu/run.sh %IP%
ethernet interface detected
Local IP:
listening on eth0...

On the “client” machine, the one that wants to be provisioned:

$ knock 9000 9999

Back on the server machine, we’ll see some output upon successful knock:

2014-03-23 10:32:02: tcp: -> 74 bytes ansible: Stage 1
2014-03-23 10:32:02: tcp: -> 74 bytes ansible: Stage 2 ansible: OPEN SESAME
ansible: running command: /home/ubuntu/run.sh


Making It Automatic With User Data

Now that we have a way to configure machines on demand – the knock could happen at any time, from a cron job, executed via a distributed SSH client (like fabric), etc – we can use the user data feature of EC2 with cloud-init to do the knock at boot, and every reboot.

Here is the user data that I used, which is technically cloud config code (more examples here):

 - knockd

 - knock 9000 9999

User data can be edited at any time as long as an EC2 instance is in the “stopped” state. When launching a new instance, the field is hidden in Step 3, under “Advanced Details”:

User Data FieldOnce this is established, you can use the “launch more like this” feature of the AWS console to replicate the user data.

This is also a prime use case for writing your own provisioning scripts (using something like boto) or using something a bit higher level, like CloudFormation.

Auto Scaling And User Data

Auto Scaling is controlled via “auto scaling groups” and “launch configuration”. If you’re not familiar these can sound like foreign concepts, but they’re quite simple.

Auto Scaling Groups define how many instances will be maintained, and set up the events to scale up or down the number of instances in the group.

Launch Configurations are nearly identical to the basic settings used when launching an EC2 instance, including user data. In fact, user data is entered in on Step 3 of the process, in the “Advanced Details” section, just like when spinning up a new EC2 instance.

In this way, we can automatically configure machines that come up via auto scaling.

Conclusions And Next Steps

This proof of concept presents an exciting opportunity for people who use Ansible and have use cases that benefit from a “pull” model – without really changing anything about their setup.

Here are a few miscellaneous notes, and some things to consider:

  • There are many implementations of port knocking, beyond knockd. There is a huge amount of information available to dig into the concept itself, and it’s various implementations.
  • The way the script is implemented, it’s possible to have different knock sequences execute different playbooks. A “poor-man’s” method of differentiating hosts.
  • The Ansible script could be coupled the AWS API to get more information about the particular host it’s servicing. Imagine using a tag to set the “class” or “role” of the machine. The API could be used to look up that information about the host, and apply playbooks accordingly. This could also be done with variables – the values that are “punched in” when a playbook is run. This means one source of truth for configuration – just add the relevant bits to the right tags, and it just works.
  • I tested this approach with an auto scaling group, but I’ve used a trivial playbook and only launched 10 machines at a time – it would be a good idea to test this approach with hundreds of machines and more complex plays – my “free tier” t1.micro instance handled this “stampeding herd” without a blink, but it’s unclear how this really scales. If anyone gives this a try, please let me know how it went.
  • Custom callbacks could be used to enhance the script to send notifications when machines were launched, as well as more detailed logging.

Caktus GroupCaktus Attends YTH Live

Last week Tobias and I had a great time attending our first Youth+Tech+Health Live conference. I went to present along with our partners Sara LeGrand and Emily Pike from Duke and UNC respectively on our NIH/SBIR funded game focused on encouraging HIV medication adherence. The panel we spoke on "Stick to it: Tech for Medical Adherence + Health Interventions" was rounded out by Dano Beck from the Oregon Health Authority speaking about how they have used SMS message reminders successfully to increase HIV medication adherence in Oregon.

We had a great response to our talk. It’s not often that you get a chance to talk to other teams around North America focused on creating games to improve health outcomes. We learned about other teams making health related educational games and lots of programs doing mobile support for youth through SMS messaging help lines. It was clear from the schedule of talks and conversations that happened around the event space, that everyone was learning a lot and getting a unique chance to share about their projects.

Caktus GroupCaktus is going to Montréal for PyCon 2014!

Caktus is happy to once again sponsoring and attending PyCon in Montreal this year. Year after year, we look forward to this conference and we are always impressed with the quality of the speakers that the conference draws. The team consistently walks away with new ideas from attending the talks, open spaces and working on sprints that they are excited to implement here at Caktus and in their personal projects.

A favorite part of PyCon for us is meeting and catching up with people, so we are excited to premiere Duckling at PyCon, an app that will helps people organize outings and connect at the conference. Come by our booth, #700 on Thursday night at the opening reception to say “hi” and chat with our team! 

Caktus GroupCaktus has a new website!

It’s been a few years since we last updated our website, and we gave it a whole new look!

With the new site, it’s easy to see just what services we offer, and our processes for bringing our client’s ideas to life. The new layout allows for more in-depth reviews of our projects, and also highlights our talented and growing team. We also wanted to share more information on our commitment to the open source community and social good. And the updated structure makes finding out about events and reading our ever-popular blog posts simple.

The new design utilizes a responsive grid structure and a refined typographic sensibility.

Wrap this all up with our new branding—adding a bold blue to our green/grey—and you get a polished and and informative new site that reflects what Caktus does best.

We hope you find the new site more intuitive, user-friendly and as easy-on-the-eyes as we do!

Caktus GroupNew for PyCon: App for Group Outings + Giant Duck

For PyCon 2014, we’ve been working for the past few months on Duckling, an app to make it easier to find and join casual group outings. Our favorite part of PyCon is meeting up with fellow Pythonistas, but without someone rounding everyone up and sorting the logistics, we’ve found it difficult to figure who’s going where and when. Our answer to this age-old conference conundrum is Duckling

Duckling, made for conferences, lets you find, join, and create outings to local restaurants, bars, or other entertainment venues. You can see who’s going, when, and where they’re meeting up to leave. It’s hooked up to Yelp, so you can look at reviews before heading out. And of course, it is written in Django too! 

Lots of options to keep up to date on all outings at PyCon: follow @ducklingquacks, find our Duckling sign in the lobby, or come visit our booth, #700, to see a map that shows where everyone is headed. Also, our booth will have a giant duck. Why? Because look how happy it makes people!

Caktus GroupCaktus Implementing New Policy Modeled on Foreign Corrupt Practices Act

Caktus’ successful growth in mobile messaging applications overseas has been a wonderful point of pride for us. The work we’re doing internationally has real impact on burgeoning democracies, HIV/AIDs patients, and others where technology makes all the difference.

As we continue this work—work we truly believe in—we think it’s important to step back and state our values. We are building a new policy modeled on the U.S. Foreign Corrupt Practices Act (FCPA). The FCPA was originally implemented to address bribery and other corrupt behavior. Our new policy will re-state our company philosophy: we will not give or receive bribes in any form, including gifts. We will adhere to all laws and regulations.

Creating a policy like this was obvious in many ways. Though we’d always operated with the assumption that we’d be transparent and above board (because why else be an open source firm if we didn’t believe in that?), we’re glad to be taking this official step.

Caktus GroupCaktus Presenting HIV mHealth Gamification App for Adherence in San Francisco

We’re pleased to announce that Caktus will be presenting Epic Allies, an mHealth Gamification App to improve ART drug adherence for HIV patients, at this year’s YTH Live in San Francisco.

Alex Lemann, Co-founder and Chief Business Development Officer, will be speaking on the panel, “Stick to It: Tech for Medical Adherence + Health Interventions,” along with our research partners from Duke University and UNC-Chapel Hill.

Our application, still in development, has a strong gamification aspect due to focus group feedback. We’ve been diligently working on building a complete new world within Epic Allies for our users-- complete with a power mad artificial intelligence and all sorts of monsters to defeat. It’s exciting to share all the work we’ve done!

For more information about the conference, visit YTH.

Caktus GroupCaktus Just Bought a Building in Downtown Durham!

After sustained growth that has us packed in six suites, we have spent the past year and a half seeking new space. We’ve found it! I’m pleased to announce that Caktus has bought a historic 1910 building with nearly 11,000 square feet of space in Downtown Durham. We'll be right near five points at 108 Morris St. The new building will be completely renovated from top to bottom to create an open workspace that’ll make it even easier for us to collaborate and share ideas.

The renovations include plans to welcome the open source community and encourage continued small business growth in Downtown Durham. The first floor will contain a retail space up for lease plus a community meeting area for local tech events. The meeting area will have a small kitchen because what’s an event without snacks?

We’re sad to leave all the great restaurants and shops within walking distance from our Carrboro offices, but we’ll also have new opportunities to explore the amenities near the new office. There’s bakeries, burger joints, food trucks, and more. Our exercise will be the short walk to and from the local eateries.

We’ll keep everyone posted on construction updates. There is much to be done. The building has been, in its 100+ year history, a nightclub (that our very own project manager, Ben, has played his saxaphone at), a funeral home, a bowling alley, and a furniture store. Currently, the construction crew is removing the giant bar and amoeba-like mirrors that line the entire first floor. We’ll let you know if we find anything interesting hidden in the walls.

We look forward to our move to downtown Durham!

Caktus GroupCongrats to PearlHacks Winners (Including Our Intern, Annie)!

Caleb Smith, Caktus developer, awarding the third place prize to TheRightFit creators Bipasa Chattopadhyay, Ping Fu, and Sarah Andrabi.

Many congratulations to the PearlHacks third place winners who won Sphero Balls! The team from UNC’s Computer Science department created TheRightFit, an Android app that helps shoppers know what sizes will fit them and their families among various brands. Their prize of Sphero Balls, programmable balls that can interact and play games via smart phones, was presented by Caktus developer and Pearl Hacks mentor Caleb Smith as part of our sponsorship. PearlHacks, held at UNC-Chapel Hill, is a conference designed to encourage female high school and college programmers from the NC and VA area.

Also, we’re deeply proud of our intern, Annie Daniel (UNC School of Journalism), who was part of the first place team for their application, The Culture of Yes. Excellent job, Annie!

This is what Annie had to say about her team's first place project:

The Culture of Yes was a web app that's meant to broaden the conversation on sexual assault on college campuses. We chose a flagship university from each of the 50 states, created a json file that summarized that university (population, male/female ratio, % in greek life, etc) and wrote scrapers to pull stories on sexual assault from each of the universitys' student newspapers. We also allow for user generated content where survivors of sexual assault are able to share their stories and experiences anonymously (or not) and present the similarities and differences of assault experiences across the U.S. It's a Django-based web app built on a bootstrap CSS framework.

Basically we wanted to focus more on the story than quantitative data. We present the university newspaper stories side-by-side with survivor experiences to show how the conversation compares to the experience of sexual assault and how the crime is treated across campuses.

Caktus GroupCaktus is sponsoring Pearl Hacks

Caktus is excited to further encourage young female programmers through our support of PearlHacks, a two-day hackathon and conference hosted by the University of North Carolina - Chapel Hill. This weekend, March 22-23rd, over 200 young women from local high schools and universities in Virginia and North Carolina will arrive for the conference.

We wanted to be more than general corporate sponsors and worked with organizers to find a way to directly engage with the students. Our participation includes being a judge during the competition, and one of our staff, Caleb Smith, will teach a workshop, Introduction to Python. We also hope to build excitement for the hackathon by providing the third place team prize, Sphero balls, an open source robotic ball that the students can program games with.

Over the course of the hackathon, attendees will be able to attend workshops about different areas of development to improve their skills. After these workshops, attendees will create teams and put what they learned in the workshops to use. There will be an all-night hackathon followed by a judging session where the attendees will demo their projects.

It is very exciting that an event like this is being held in the area and we’re so pleased to continue supporting young programmers, especially women. Our other recent efforts include the donation of tickets to PyCon 2014 to PyLadies financial aid and hosting a Python course for Girl Develop It! at our offices.

Caktus GroupCaktus Completes RapidSMS Community Coordinator Development for UNICEF

Colin Copeland, Managing Member at Caktus, has wrapped up work, supported by UNICEF, as the Community Coordinator for the open source RapidSMS project. RapidSMS is a text messaging application development library built on top of the Django web framework. It creates a SMS provider agnostic way of sending and receiving text messages. RapidSMS has been used widely in the mobile health field, in particular in areas where internet access cannot be taken for granted and cell phones are the best communication tool available. This has included projects initiated by UNICEF country offices in Ethiopia, Madagascar, Malawi, Rwanda, Uganda, Zambia, and Zimbabwe.

Modeling RapidSMS on the Django Community

The overall goals set forth by UNICEF for this project included improving community involvement by making RapidSMS easier to use and contribute to. Colin accomplished this by using Django's large and active community as a model. The community employs common standards to maintain consistency across developers. Using this best practice for the RapidSMS developers meant easier communication, collaboration, and work transfer.

Colin shepherded six releases of the RapidSMS framework over his year long stint including 948 code commits to the repository. Colin broadened engagement in the RapidSMS community by tapping five others at Caktus’ to contribute code including Dan, Alex, Rebecca, Tobias and Caleb. Evan Wheeler, a founder of RapidSMS, oversaw Caktus’ work on the project and offered an outside perspective. Evan acted as a liaison between Caktus and UNICEF by coordinating our work with Erica Kochi, co-lead of UNICEF’s Innovation Unit.

The major releases to the framework included releases 0.10 through 0.15 and included major updates both on the frontend and backend of RapidSMS.

  • 0.10 Pluggable Routers — This opened the door for different router algorithms for different use cases and removed support for the legacy and difficult to debug threaded router. For example, texts can be handled within the request cycle by using the blocking router or pushed off to a queue (which requires extra dependencies) for handling later. This settled a long standing debate on the mailing list, by letting users make their own decisions and having RapidSMS support different router options out of the box.
  • 0.11 Continuous Iteration — This release focused on testing & continuous iteration with the inclusion of a new RapidTest base class, PEP8 related changes, and monitoring test coverage using the TravisCI continuous integration tool.
  • 0.12 Interface Redesign — Caktus developers updated the default RapidSMS interface to now use Twitter Bootstrap. This included reviewing all of the current contrib apps and deprecating the ones that were no longer necessary.
  • 0.13 Bulk Messaging — This included adding an API for sending messages to many phones at once. As part of this change, the new database router was added which keeps track of which messages in a bulk message group have been sent and resends messages if there is an error.
  • 0.14 Production Hosting Documentation — This change includes best practices for hosting a production instance of RapidSMS. It is agnostic as to which cloud provider is chosen, or what provisioning tool is used, but encourages the use of a tool supported by the Django community to automate creating new servers and pushing out code changes.
  • 0.15 Tutorials & Contributing Documentation — The new documentation released in 0.15 was aimed at helping new users get up to speed quickly by following along with the tutorials. The Django tutorials were a strong influence in the tutorials developed for RapidSMS. The final push was to update the documentation to make it clear how to contirbute back to the RapidSMS development community so that development work is not duplicated across RapidSMS implementations.

A few of the themes of Colin’s tenure as Community Coordinator of the RapidSMS project were code consistency via PEP8, a focus on automated testing, test coverage monitoring, continuous integration, and improving documentation. There were also some important new features like the Bootstrap facelift & bulk messaging (supported by the router refactor) which will make it easier to write new apps and interact with RapidSMS as an administrator on the web. Colin pushed the community to embody the spirit of the Django and Python communities in RapidSMS through detailed documentation and testing. For more details about the changes, you can see the Release Notes documentation or the commits themselves.

Enabling Sharing of Development and Field Work

Part way through the development tasks for the RapidSMS project, it was brought up on the RapidSMS community mailing list that rapidsms.org was being reported as a source of malware by Google. This motivated an already present need to rebuild RapidSMS.org as a sharing platform for both types of the RapidSMS framework users—software engineers & mobile health project coordinators.

The software engineers need a place to share their reusable RapidSMS packages on the site. This is a repository of reusable code so that new community members can build their packages using existing code. These packages include apps for appointment reminders and SMS based polling. Beyond a shopping ground for reusable code, the package repository also is a place for new developers to go to see the work of others so that they can get a sense of how to structure their own projects.

Project implementers on the other hand want a high level review of real life projects where RapidSMS has been used and what the outcomes of the projects were. That is, if they are evaluating frameworks to be used by a new SMS based product, they can look at the successes that have been attributed to RapidSMS based projects.

Taking into account the needs of both software engineers and project implementors, Caktus redesigned RapidSMS.org with leadership from Colin and Evan Wheeler. The work was done by Caktus team members including design by Julia and backend development by Rebecca, Victor, David, and Caleb. The website is also open source and welcomes all contributions from new bug tickets to pull requests.

Final Thoughts

Overall, Colin provided leadership to RapidSMS and pushed the development standards higher and more inline with parent projects, Django and Python. The Caktus team, with input from Evan Wheeler, and all of the RapidSMS community, rebuilt parts of RapidSMS from deep in the core of how messages are handled to the look of the external website used by administrative staff. Colin’s leadership lay the groundwork for the next phases of RapidSMS’s codebase and the community surrounding the project.

Tim HopperShould I Do a Ph.D.? Oscar Boykin

Continuing my series of interviews on whether or not a college senior in a technical field should consider a Ph.D., I interviewed Oscar Boykin, a data scientist at Twitter. Prior to joining Twitter, Oscar was a professor of computer engineering at the University of Florida.

A 22-year old college student has been accepted to a Ph.D. program in a technical field. He's academically talented, and he's always enjoyed school and his subject matter. His acceptance is accompanied with 5-years of guaranteed funding. He doesn't have any job offers but suspects he could get a decent job as a software developer. He's not sure what to do. What advice would you give him, or what questions might you suggest he ask himself as he goes about making the decision?

Oscar: The number one question: does he or she have a burning desire to do a PhD? If so, and funding is available without taking loans, then absolutely! Go for it! It is a wonderful experience to try to, at least in one small area, touch the absolute frontier of human knowledge. If you greatly enjoy the area of study, and you find it is the kind of thing you love thinking about day in and day out, if you imagine yourself as some kind of ascetic scholar of old, toiling to make the most minor advance simply for the joy the work, then by all means.

The second reason to do a Ph.D., and this one hardly needs discussing, is that you want a career that requires it. If you want to teach at a university or do certain scientific occupations, a Ph.D may be required.

If the student has doubts about any of the above, I recommend a master's degree and an industry job. There are a few reasons: 1) If you lack the passion, the risk of not completing and the time investment do not equal the cost. 2) There is little, if any, direct financial benefit to having a Ph.D. and the cost in time is substantial. 3) Professions are changing very fast now and you should expect a lifetime of learning in any case, so without a strong desire for a PhD, why not do that learning in the environment where it is most relevant?

He decides he wants to do the Ph.D. but his timing is negotiable. Would you recommend he jump straight in or take some time off?

Oscar: Doing an internship or one year of work at a relevant company will often give students much better insight into choosing research topics. Choosing a great topic and a great advisor is the entire name of the game when it comes to the Ph.D. BUT, taking time off makes it very easy to get out of the habit of studying and learning, so if commitment is a concern, there is substantial risk that time off will turn into never coming back.

Are there skills developed while earning a Ph.D. that are particularly valuable to being a practicing software engineer? Are there ways in which a non-Ph.D. can work to build similar skills?

Oscar: You can't take too much math. Learning linear algebra, probability theory, information theory, Markov chains, differential equations, and how to do proofs, have all been very valuable to me, and very few, if any, of my colleagues without PhDs have these skills. I have seen on the order of 3-10 people in the hundreds to ~thousand that I've interacted with, that picked these up on their own, so doing so is clearly possible. It is hard to say if getting a PhD helps learn those things: perhaps the same people who learned them with a PhD would have done so without. A safe bet seems to be structured education to pick up such classical mathematics.

In your experience, are there potential liabilities that come with getting a Ph.D.? Do doctoral students learn habits that have to be reversed for them to become successful in industry?

Oscar: One problem with the industry/academia divide is that each has a caricatured picture of the other. Academics fear entering industry means becoming a "code monkey" and often disparage strong coding as a skill. I think this is to the detriment of academia as coding is perhaps as powerful a tool as mathematics. Yet, many academics muddle through coding so much that the assumption by teams hiring academics is they will have to unlearn a lot of bad habits if they join somewhere like Twitter, Facebook, Google, etc… This assumption often means that hiring committees are a bit skeptical of an academic's ability to actually be productive. This skepticism must either be countered with strong coding in an interview or some existing coding project that gives evidence of skill.

You didn't really ask much about the career path of academia vs industry. I did want to address that a bit. First, those paths are much more similar than most people realize. As a professor, on average, your colleagues are brighter, and that is exciting. But academia today is very focused on fund raising, and that fund raising is involves a lot of politics and sales-personship. In the software industry today, one has a lot of perks: great salary, lots (even unlimited) vacation time, the freedom to focus on the things you enjoy the most (compare to being a professor and doing 3-4 different jobs concurrently). As a professor, you are running a startup that can never be profitable: you are always raising money and hiring. The caliber of the very best in industry is also just as high or higher than academia (though the mean may be lower). I much prefer my job at Twitter to my time in academia.

Oscar Boykin is data scientst at Twitter. He earned a Ph.D. in Physics at UCLA. You can find his scholarly work on Google Scholar. You can find him on Twitter and Github.

Joe GregorioIPython and Curves

So I've been impressed with IPython so far, it can be a little fiddly to install, but given the power I'm not going to complain. I'm now working on graphics and in trying to get up to speed going back and learning, or relearning, some basics. Today was Bézier curves, and thus this IPython notebook. Note that the content there isn't actually educational, you should follow the links provided to really learn about these constructs, I just wanted an excuse to try out LaTeX and plot some interesting graphs.

Tim HopperRosetta Stone for Stochastic Dynamic Programming

In grad school, I worked on what operations researchers call approximate dynamic programming. This field is virtually identical to what computer scientists call reinforcement learning. It's also been called neuro-dynamic programming. All these things are an extension of stochastic dynamic programming, which is usually introduced through Markov decision processes.

Because these overlapping fields have been developed in different disciplines, mathematical notation for them can vary dramatically. After much confusion and frustration, I created this Rosetta Stone to help me keep everything straight. I hope others will find it useful as well. A PDF and the original LaTeX code are available on my Github.

Bertsekas Sutton and Barto Puterman Powell
Stages $k$ $t$ $t$ $t$
First Stage $N$ $1$ $1$ 1
Final Stage $0$ $T$ $N$ $T$
State Space $S$ $\mathcal
State $i$, $i_ $s$ $s$ $s$, $S_
Action Space $U(i)$ $A=\cup_ $\mathcal
Action $a$ $a$ $a$
Policy $\mu_ $\pi(s,a)$, $\pi$ $\pi$, $d_ $\pi$
Transitions $p_ $\mathcal $p_ $\mathbb
Cost $g(i,u,j)$ $\mathcal $r_t(s,a)$ $C_
Terminal Cost $G(i_ $r_ $r_ $V_
Discount $\alpha$ $\gamma$ $\lambda$ $\gamma$
Q-Value (Policy) $J_ $\mathcal
Q-Value (Optimal) $\mathcal
Value (Policy) $J_ $V^ $u_ $V_
Value (Optimal) $J_ $V^ $u_ $V_
Bellman Operator $T$ $\mathscr $\mathcal

Optimal Value Function

  • Bertsekas

$$J_{k}^{*}=\min_{u\in U(i)}\sum_{j=1}^{n}p_{ij}(u)\left(g(i,u,j)+\alpha J^{*}_{k-1}(j)\right)$$

  • Sutton and Barto

$$V^{*}(s)=\max_{a}\mathcal{P}^{a}_{ss'}\left[\mathcal{R}^{a}_{ss'}+\gamma V^{*}(s')\right]$$

  • Puterman

$$u_{t}^{*}(s_{t})=\max_{a\in A_{s_{t}}}\left\{r_{t}(s_{t},a)+\sum_{j\in S}p_{t}(j\,|\,s_{t},a)u_{t+1}^{*}(j)\right\}$$

  • Powell

$$V_{t}(S_{t})=\max_{a_{t}}\left\{C_{t}(S_{t},a_{t})+\gamma\sum_{s'\in \mathcal{S}}\mathbb{P}(s'\,\vert\,S_{t},a_{t})V_{t+1}(s')\right\}$$


  • D.P. Bertsekas. Dynamic Programming and Optimal Control. Number v. 2 in Athena Scientific Optimization and Computation Series. Athena Scientific, 2007. ISBN 9781886529304. URL.
  • W.B. Powell. Approximate Dynamic Programming: Solving the Curses of Dimensionality. Wiley Series in Probability and Statistics. John Wiley & Sons, 2011. ISBN 9781118029152. URL.
  • M.L. Puterman. Markov decision processes: discrete stochastic dynamic programming. Wiley series in probability and statistics. Wiley-Interscience, 1994. ISBN 9780471727828. URL.
  • R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. Adaptive Computation and Machine Learning. Mit Press, 1998. ISBN 9780262193986. URL.

Tim HopperShould I Do a Ph.D.? Mike Nute

A day late, but never a dollar short, the next contributor to my series on whether or not a college senior should consider getting a Ph.D. is Mike Nute, a current Ph.D. student. Mike scoffed at my banal interview format and instead wrote an open letter to a hypothetical student.

Dear Young Student,

First, as far as whether you should do the Ph.D. program, you should think long and hard about it because there is a lot of upside and a lot of downside. Here are some things in particular to should think about:

1) Going to a Ph.D. program straight from undergrad and continuing essentially the same field of study is a lot like marrying your high school sweetheart. Certainly for some people that works out ok, but for many others in ends in a terrible divorce. In the case of the Ph.D. program, you should bear in mind that the academic world and the real world are two very different places, and that up until this point you have only seen the academic world, so you might very well enjoy the real world just as much. Bear in mind also that the professors and advisors you have dealt with up to this point tend to be people who have thrived in the academic world and their advice will come from that lens. So do your best to get at least some exposure to the variety of jobs out there where you could apply the discipline. You should especially do this if the reason that you think you'd like to get a Ph.D. is to enter academia afterward. Take a little Rumspringa from that life before you enter the order.

2) If you find yourself thinking about the Ph.D. program as a means to an end, like as a requirement for some job, then you should strongly consider whether there is an easier way. More specifically, if you don't think that getting the Ph.D. is going to be fun on its own, then there's a strong chance you'll be miserable and it will end badly. If you want to be a professor, then getting a Ph.D. is really the only way to go about it, but for virtually any other goal there is probably an alternative that doesn't require the same sacrifice.

3) The way you should be thinking about the program is like this: it's a chance to spend five years doing nothing but studying your favorite subject in great depth while keeping adequate health insurance and avoiding debt. You really won't have either the time or the money to do very much else, so you had better really love this subject. You remember that episode of the Simpsons where the devil's ironic punishment for Homer is to force feed him an endless stream of donuts, but gets frustrated because Homer never stops loving the donuts? Well you have to love your subject like Homer loves donuts, because that's going to be you. If you do love it though, you won't even notice being broke or studying all the time. 

4) In fact, you should come to grips right now with the fact that you may finish your Ph.D. and find that you want to change careers and never revisit that subject again. That may sound unimaginable, but it's possible mainly because you're 22 and who the hell knows what you'll want when you're 28. If that happens though, will you look back on having gotten a Ph.D. as a terrible decision and a waste of time? If so, then don't do it. If you think you'll be proud of it and will have enjoyed yourself no matter what, then it's actually a low risk move because that's basically the worst case scenario.

5) You should also note that there is a major difference between a Masters and Ph.D. program. First of all, the Ph.D. will be much more intense, even in the first two years, than a Masters. Since most Masters programs are not funded but Ph.D.s are, you can think of it as the difference between being an employee and being a customer. But on the other hand, most industry jobs are as happy to have you with a Masters as with a Ph.D., so you can easily use the extra years of work to pay off the loans from the Masters program. This reinforces the last point above: the only real reason to do a Ph.D. program is for love of the subject. 

In my case, I came back to grad school after seven years in industry. From my experience you can take two lessons: 1) try to avoid waiting seven years to go back if you can, and 2) if you do wait that long, just know that it's never too late. The longer you wait, the harder the sacrifice of time and money will be. But you gotta do what's going to make you happy, and there are a lot of ways to be happy with a job. You can literally enjoy being at work every day, you can do something that benefits others, you can do something very challenging, or you can do something that enables you to enjoy other parts of your life, like gives you schedule flexibility or a sufficiently high salary. There are others too. In my case, I had a very tiny bit of all of those but not enough of any one to really count, which is why it took me so long to leave. Grad school though is challenging, and it makes me proud as hell to tell people about because I know how hard it is. So think about which of those is the most important to you, and plan accordingly. 

So anyway, good luck young man or lady, and don't stress about it too hard; you can always change your mind later.


Mike Nute

Mike Nute is a recovering actuary and a current Ph.D. student in statistics at University of Illinois at Urbana-Champaign.

Caktus GroupManaging Events with Explicit Time Zones

Recently we wanted a way to let users create real-life events which could occur in any time zone that the user desired. By default, Django interprets any date/time that the user enters as being in the user’s time zone, but it never displays that time zone, and it converts the time zone to UTC before storing it, so there is no record of what time zone the user initially chose. This is fine for most purposes, but if you want to specifically give the user the ability to choose different time zones for different events, this won’t work.

One idea I had was to create a custom model field type. It would store both the date/time (in UTC) and the preferred time zone in the database, and provide a form field and some kind of compound widget to let the user set and see the date/time with its proper time zone.

We ended up with a simpler solution. It hinged on considering the time zone separately from a time. In our case, we would set a time zone for an event. Any date/time fields in that event form would then be interpreted to be in that time zone.

Now, displaying a time in any time zone you want isn't too hard, and we weren't worried about that. More troublesome was letting a user enter an arbitrary time zone in one form field, and some date and time in other fields, and interpreting that date and time using the chosen time zone when the form was validated. Normally, Django parses a date/time form field using the user's time zone and gives you back a UTC - all time zone information is lost.

We started by defining a custom form to validate entry of time zone names:

class TimeZoneForm(forms.Form):
    Form just to validate the event timezone field
    event_time_zone = fields.ChoiceField(choices=TIME_ZONE_CHOICES)

Then in our view, we processed the submitted form in two steps. First, we got the time zone the user entered.

from django.utils import timezone

def view(request):

    if request.method == 'POST':
        tz_form = TimeZoneForm(request.POST)
        if tz_form.is_valid():
            tz = tz_form.cleaned_data['event_time_zone']

Then, before handling the complete form, we activated that time zone in Django, so the complete form would be processed in the context of that event's time zone:

from django.utils import timezone

def view(request):

    if request.method == 'POST':
        tz_form = TimeZoneForm(request.POST)
        if tz_form.is_valid():
            tz = tz_form.cleaned_data['event_time_zone']
            # Process the full form now

When displaying the initial form, we activate the event's time zone before constructing the form, so those times are displayed using the event's time zone:

    # assuming we have an event object already
    # Continue to create form for display on the web page


What we just showed is a simplification of our actual solution, because we were using the Django admin to add and edit events, not custom forms. Here's how we customized the admin.

First, we wanted to display event times in a column in the admin change list, in their proper time zones. That kind of thing is pretty easy in the admin:

from pytz import timezone as pytz_timezone

class EventAdmin(admin.ModelAdmin):
    list_display = [..., 'event_datetime_in_timezone', ...]

    def event_datetime_in_timezone(self, event):
        """Display each event time on the changelist in its own timezone"""
        fmt = '%Y-%m-%d %H:%M:%S %Z'
        dt = event.event_datetime.astimezone(pytz_timezone(event.event_time_zone))
        return dt.strftime(fmt)
    event_datetime_in_timezone.short_description = _('Event time')

This uses pytz to convert the event's time into the event's time zone, then strftime to format it the way we wanted it, including the timezone.

Next, when adding a new event, we wanted to interpret the times in the event's time zone. The admin's add view is just a method on the model admin class, so it's not hard to override it, and insert the same logic we showed above:

class EventAdmin(admin.ModelAdmin):
    # ...

    # Override add view so we can peek at the timezone they've entered and
    # set the current time zone accordingly before the form is processed
    def add_view(self, request, form_url='', extra_context=None):
        if request.method == 'POST':
            tz_form = TimeZoneForm(request.POST)
            if tz_form.is_valid():
        return super(EventAdmin, self).add_view(request, form_url, extra_context)

That handles submitting a new event. When editing an existing event, we also need to display the existing time values according to the event's time zone. To do that, we override the change view:

class EventAdmin(admin.ModelAdmin):
    # ...

    # Override change view so we can peek at the timezone they've entered and
    # set the current time zone accordingly before the form is processed
    def change_view(self, request, object_id, form_url='', extra_context=None):
        if request.method == 'POST':
            tz_form = TimeZoneForm(request.POST)
            if tz_form.is_valid():
            obj = self.get_object(request, unquote(object_id))
        return super(EventAdmin, self).change_view(request, object_id, form_url, extra_context)

One more thing. Since the single time zone field is applied to all the times in the event, if someone changes the time zone, they might need to also adjust one or more of the times. As a reminder, we added help text to the time zone field:

event_time_zone = models.CharField(
    help_text=_('All times for this event are in this time zone. If you change it, '
                'be sure all the times are correct for the new time zone.')

Thanks to Vinod Kurup for his help with this post.

Caktus GroupCaktus is hiring a Web Designer-Contractor

Caktus is actively seeking local web design contractors in the North Carolina Triangle area. We’re looking for folks who can contribute to our growing team of designers on a per-project basis. Our team is focused on designing for the web using HTML5, CSS3, LESS, and responsive design best practices. We take an iterative approach with our clients involving them early and often. So, if you’re a designer looking for some extra work and want to sit in with our sharp team check out the job posting. It has more information about the types of projects you would be working on and some of the skills we hope to find in your toolbox. If it sounds like a good fit, drop us a line—we’d love to chat!

Josh JohnsonBuilding A DNS Sandbox

I’m developing some automation around DNS. Its imperative that I don’t break anything that might impact any users. This post documents the process I went through to build a DNS sandbox, and serves as a crash-course in DNS, which is, like most technology that’s been around since the dawn of time, a lot simpler than it seems at first glance.

Use Case

When a new machine is brought up, I need to register it with my internal DNS server. The DNS server doesn’t allow remote updating, and there’s no DHCP server in play (for the record, I’m bringing up machines in AWS). I have access to the machine via SSH, so I can edit the zone files directly. As SOP, we use hostdb and a git repository. This works really well for manual updates, but it’s a bit clunky to automate, and has one fatal flaw: there can be parity between what’s in DNS and what’s in the git repo. My hope is to eliminate this by using the DNS server as the single source of truth, using other mechanisms for auditing and change-management.

So the point of the DNS sandbox is to make testing/debugging of hostdb easier and to facilitate rapid development of new tools.


Please be aware that this setup is designed strictly for experimentation and testing – it’s not secure, or terribly useful outside of basic DNS functionality. Please take the time to really understand what it takes to run a DNS server before you try to set something like this up outside of a laboratory setting.

Contributions Welcome

If you have any trouble using this post setting up your own DNS sandbox, please leave a comment.

If you have any suggestions or corrections, again, leave a comment!

Together we can make this better, and help make it easier to put together infrastructure for testing for everybody.

Coming Soon

There are a lot of features of DNS that this setup doesn’t take into account. I’m planning on following up this post as I add the following to my sandbox setup (suggestions for other things are welcome!)

  • Remote updates (being able to speak DNS directly to the server from a script)
  • Chrooting the bind installation for security.
  • DNS Security (DNSSEC) – makes remote updates secure.
  • Slaves – DNS can be configured so that when one server is updated, the changes propagate to n number of servers.

Server Setup

I started with stock Ubuntu 12.04 server instance.

I installed bind9 using apt, and created a couple of directories for log and zone files.

$ sudo apt-get install bind9
$ sudo mkdir -p /var/log/named
$ sudo chown bind:bind /var/log/named
$ sudo mkdir /etc/bind/zones

Zone Files


See: http://www.zytrax.com/books/dns/ch8

  • Zone files have a header section called the SOA
  • Fully-qualified domain names end with a period (e.g. my.domain.com.), subdomains relative to an FDQN do not (e.g. my, www).
  • Zone files have an origin – a base added to every name in the file. The origin is defined using the $ORIGIN directive (see: http://www.zytrax.com/books/dns/ch8/#directives)
  • The @ symbol is used as a shortcut for $ORIGIN.
  • Each zone file is referenced in the named.conf file (see Configuration below for details)
  • The name of the file itself is immaterial – there are many standards in the wild – I’m opting to keep them consistent with the name of the zone in named.conf.
  • Zone files have a serial number, which consists of the current date and an increment number. Example: 20131226000 (YYYYMMDDXXX). This number must be incremented every time you make a change to a zone file, or bind will ignore the changes.
  • Comments start with a semi-colon (;) and run to the end of the line.

DNS lookups can happen in two directions: forward and reverse. Forward lookups resolve a doman name to an IP address (my.domain.com -> Reverse lookups resolve an IP address to a domain name ( -> my.domain.name). Each type of lookup is controlled from a separate zone file, with different types of records.

See http://www.zytrax.com/books/dns/ch8/#types for details about the different types of records. This post only deals with SOA, NS, A, CNAME and PTR records.

Note that reverse lookup is not required for a functioning DNS setup, but is recommended.

Forward Lookup


Our domain name is example.test.

$ORIGIN example.test.
$TTL 1h

@    IN    SOA    ns.example.test.    hostmaster.example.test. (
    20131226000  ; serial number
             1d  ; refresh
             2h  ; update retry
             4w  ; expiry
             1h  ; minimum

@               NS     ns

ns               A
box1             A
alt              CNAME  box1

  • Line 1 sets the origin. All entries will be a subdomain of example.test. You can put whatever you want in this stanza, but keep it consistent in the other areas.
  • Line 2 sets the Time To Live for records in this zone.
  • Lines 4-10 are the SOA.
  • On line 5, We use @ to stand in for the $ORIGIN directive defined on line 1. We specifify the authoritative server (ns.example.test.), which we will define in an A record later. Finally, we specify the e-mail address of a person responsible for this zone, replacing the at symbol (@) with a period.
  • Line 5 contains the serial number. This will need to be incremented every time we make a change. In this example, I’m starting with the current date and 000, so we’ll get 999 updates before we have to increment the date.
  • Line 12 is a requirement of Bind – we must specify at least one NS record for our DNS server. The @ symbol is used again here to avoid typing the origin again. The hostname for the NS record is ns, which means ns.example.test, defined in an A record on line 14.
  • Line 14 defines our DNS server for the NS record on line 12. We’re using localhost here to point back to the default setup we got from using the ubuntu packages.
  • Line 15 is an example of another A record, for a box named box1.example.test. Its IP address is Note that the actual IP addresses here do not need to be routable to the DNS server; all it’s doing is translating a hostname to an IP address. For testing purposes, this can be anything. Just be aware that reverse lookups are scoped to a given address range, so things will need to be consistent across the two zones.
  • Finally on line 16, we have an example of a CNAME record. This aliases the name alt.example.test to box1.example.test, and ultimately resolves to
  • Reverse Lookup


    We’re setting up reverse lookups for the 192.168.0.x subnet (CIDR

    $ORIGIN 0.168.192.in-addr.arpa.
    $TTL 1h
    @   IN  SOA     ns.example.test     hostmaster.example.test (
            20131226000  ; serial number
                     1d  ; refresh
                     2h  ; update retry
                     4w  ; expiry
                     1h  ; minimum
        IN      NS      ns.example.test.
    1   IN      PTR     box1.example.test
    • Lines 1-10 are the SOA, and are formatted the exact same way as in our forward zone file.

      Note that the $ORIGIN is now 0.168.192.in-addr.arpa.. The in-addr.arpa domain is special; used for reverse lookups. The numbers before the top level domain are simply the subnet octets, reversed (192.168.0 becomes 0.168.192).

      Remember, this serves as shorthand for defining the entry records below the SOA.

    • Line 12 is the required NS record, pointing at the one that we set up an A record for in the forward zone file.
    • Finally, line 13 is a typical PTR record. It associates with box1.example.test.


    In the default ubuntu setup, local configuration is handled in /etc/bind/named.conf.local (this is just simply included into /etc/bind/named.conf).

    See http://www.zytrax.com/books/dns/ch7/ for details about the named.conf format and what the directives mean.

    zone "example.test." {
            type master;
            file "/etc/bind/zones/example.test";
            allow-update { none; };
    zone "0.168.192.in-addr.arpa." {
            type master;
            file "/etc/bind/zones/0.168.192.in-addr.arpa";
            allow-update { none; };
      channel simple_log {
        file "/var/log/named/bind.log" versions 3 size 5m;
        severity debug;
        print-time yes;
        print-severity yes;
        print-category yes;
      category default{
    • Lines 1-5 set up our forward zone “example.test.”. Note that allow-update is set to none. This simplifies our configuration and prevents updates to this zone from other servers.
    • Lines 7-11 set up the reverse zone “0.168.192.in-addr.arpa.”.
    • Lines 13-24 set up simple (and verbose) logging to /var/log/named/bind.log. See http://www.zytrax.com/books/dns/ch7/logging.html for details about the setting here.


    Configuration Syntax Check

    We can use the named-checkzone utility to verify our zone file syntax before reloading the configuration.

    You specify the name of the zone and then the filename (the -k fail parameter causes it to return a failed return code when an error is found, useful for automated scripts):

    $ named-checkzone -k fail example.test /etc/bind/zones/example.test
    zone example.test/IN: loaded serial 2951356816

    In the case of a reverse zone file:

    $ named-checkzone -k fail 0.168.192.in-addr.arpa /etc/bind/zones/0.168.192.in-addr.arpa
    zone 0.168.192.in-addr.arpa/IN: loaded serial 2951356817

    Reloading Config

    Configuraiton can be reloaded with the rndc reload command.

    $ sudo rndc reload

    It’s helpful to run tail -f /var/log/named/bind.log in another terminal window during testing.

    Testing DNS Queries

    The definitive tool is dig. nslookup is also useful for basic queries.

    With both tools, its possible to specify a specific DNS server to query. In this case, it’s assumed that we’re logged in to the sandbox DNS server, so we’ll use for the server to query.

    With dig

    Note: remove the +short parameter from the end of the query to get more info.

    Forward Lookup

    The A record:

    $ dig @ box1.example.test +short

    The CNAME:

    $ dig @ alt.example.test +short

    Reverse Lookup

    $ dig @ -x +short

    With nslookup

    Forward Lookup

    The A record:

    $ nslookup box1.example.test
    Name:	box1.example.test

    The CNAME:

    $ nslookup alt.example.test
    alt.example.test	canonical name = box1.example.test.
    Name:	box1.example.test

    Reverse Lookup

    $ nslookup
    Address:	name = box1.example.test.0.168.192.in-addr.arpa.

    Using Your Sandbox

    Now that the DNS sandbox is built and working correctly, you may want to add it
    to your list of DNS servers.

    This process will vary depending on what operating system you use, and is an
    exercise best left to the user. However, here are some pointers:

    Note: depending on your setup, you will likely need to put your sandbox DNS server
    first in the list.

    Mac OS X: https://www.plus.net/support/software/dns/changing_dns_mac.shtml

    Ubuntu: http://www.cyberciti.biz/faq/ubuntu-linux-configure-dns-nameserver-ip-address/

Joe GregorioThe shiny parabolic broadcast spreader of vomit

Due to suspected food poisoning I checked into the local emergency room last night around 2 AM, trusty 13 gallon plastic garbage bag in hand, because, I've been throwing up. Once they get me into a room the nurse offers me a shallow pink plastic pan in exchange for my plastic garbage bag, and I'm thinking to myself, "Really, have you never seen anyone vomit in your entire life?". Why on earth would you offer me a shallow round bottom plastic pan as an alternative to a 13 gallon plastic garbage bag? This is vomit we're trying to contain here. This reminds me of a previous visit to the same ER with my son when he had appendicitis, this time we came in with a kitchen garbage pail and the nurse laughed at us and handed him a small metal kidney dish. My son held it in his hands, looking at it perplexedly for about five seconds before he started vomiting again, turning it from a kidney dish into a shiny metal parabolic broadcast spreader of vomit.

I don't know what to make of this phenomenon, as I thought that working in an ER would expose you to a lot of puking people, and thus you'd know better than to give someone a kidney dish to throw up in. I can only come up with two possible conclusions, the first that the ER folks are just aren't quick learners and haven't picked up on the association:

    kidney dish : vomit :: broadcast spreader : grass seed

The other possibility is that my family is unique, maybe my wife and I are both descended from long lines of projectile vomitters, a long and honorable tradition of high velocity emesis, and that the rest of population is filled with polite, low volume, low velocity vomitters. If so, you people make me sick.

Joe GregorioSnow

Reilly has been learning Javascript, and one of the projects he wanted to do was a snow simulation. I guess growing up in the south snow is a rare and wonderous event for him.

Tim HopperNoisy Series and Body Weight, Take 2

Back in July, I posted some analysis of my attempt at weight loss. Now that I'm four months further down the line, I thought I'd post a follow-up.

I continue to be fascinated with how noisy my weight time series is. While I've continued to lose weight over time, my weight goes up two out of five mornings.

Here's a plot of the time series of my change in weight. Note how often the change is positive, i.e. I appear to have gained weight:

This volatility can hide the fact that I'm making progress! When I put a regression line through the points, you can see that the average change slightly below zero:2

I have wondered recently if my average change in weight is correlated with the day of the week. My hypothesis is that my weight tends to go up over the weekends, so I created a boxplot of my change in weight categorized by day.

Indeed, on Sundays and Mondays (i.e. weight change from Saturday morning to Sunday morning and Sunday morning to Monday morning) my median weight change is slightly above zero. This makes sense to me: on Saturdays, I'm more likely to be doing things with friends, and thus I have less control over my meals.1

I wish I had a good explanation for why the change on Friday is so dramatic, but I don't. Any guesses?

  1. Also, beer. 

  2. I mentioned this to my college roommate who is a financial planner. He noted how similar this is to investing; it's a constant battle for him to convince his clients to look at average behavior instead of daily changes.  

Josh JohnsonCan Scrum Scale Down?

Prompted by a discussion on a LinkedIn group, I was reminded of a presentation deck I put together a couple of years ago to capture what my cohorts and I were doing for project management at the time.

The short answer is: “why yes, yes scrum can most certainly scale down”. How far down? I think with the right frame of mind, it can scale down to a singe individual.

Here are the slides as they stand. They’re a few years old, entitled “The blitZEN Method”.

I’ve seen this process work, in practice, with as few as 2 people. It’s worked with a cross-functional team of 5. I’ve applied the concepts to individual work as well. So, I’d say I’ve proven it can scale down. But is it really Scrum?

When it was written, the only experience I had with Agile development methodologies was on the team from which The blitZEN Method was born. It shows a bit in my terminology, and the simplicity of the overall approach.

Since then, I’ve left UNC and I’ve worked in several different so-called “Agile” shops. I’ve yet to see any as effective as The blitZEN Method was. These were organizations filled with a zeal for core Agile values, who had consultants and coaches and trainers – folks paid unfathomable amounts of money, all for nothing. At best, people would bypass core values just to get work done. At worst, low-quality code would get rushed through to production in spite of it all – ceremony for the sake of ceremony. Not just Scrum – I’ve seen Kanban fail too.

So, maybe everyone is just doing it wrong. Maybe there’s something really special about The blitZEN Method. Maybe the people I worked with at the time were what was really special. It’s a tough call, even in hindsight.

When you’ve seen something evolve into a proven methodology, and you go out in the world that spawned it, and find yourself constantly bombarded with contradictory information from highly dogmatic sources, you start to wonder what happened – is it me, or is it Agile? Are we all kidding ourselves?

Take a look, and let me know what you think. Feedback is greatly appreciated. I’m especially interested in hearing of any applications of the approach – I haven’t been in a position to try myself for several years.

I hope to update these slides soon, I would love to incorporate your insights.

Note: I’m also working on a project to expand on some of the concepts – it’s been on my back burner for a while, but keep an eye on How I Develop Web Apps in my github.

Tim HopperTweeting Primes

I recently discovered the Twitter account @primes. Every hour, they tweet the subsequent prime number. This made me wonder two things. First, what is the largest prime that you can tweet (in base-10 encoding in 140 characters).1 Second, how long until they get there.

Doing some quick calculations in Mathematica, I believe the largest 140 digit prime is the following:

9999999999999999999999999999999999999999999999 9999999999999999999999999999999999999999999999 999999999999999999999999999999999999999999999997

Wolfram Alpha confirms that this is prime and that the next prime is 141 characters.

As for how long it would take, recall that the number of primes less than $n$ is approximately $\frac{n}{\ln n}$. The number of primes less than $10^141$ is approximately

$$\pi(10^140) = \frac{10^140}{140\cdot \ln 10} = 3.1\cdot 10^{137}.$$

That's $3\cdot 10^{57}$ times the estimated number of atoms in the universe. Looks like @primes should be able to tweet for a while.

  1. The largest known prime is $2^{57,885,161} − 1$ and has 17,425,170 digits.  

Caktus GroupUsing strace to Debug Stuck Celery Tasks

Celery is a great tool for background task processing in Django. We use it in a lot of the custom web apps we build at Caktus, and it's quickly becoming the standard for all variety of task scheduling work loads, from simple to highly complex.

Although rarely, sometimes a Celery worker may stop processing tasks and appear completely hung. In other words, issuing a restart command (through Supervisor) or kill (from the command line) doesn't immediately restart or shutdown the process. This can particularly be an issue in cases where you have a queue setup with only one worker (e.g., to avoid processing any of the tasks in this queue simultaneously), because then none of the new tasks in the queue will get processed. In these cases you may find yourself resorting to manually calling kill -9 <pid> on the process to get the queue started back up again.

We recently ran into this issue at Caktus, and in our case the stuck worker process wasn't processing any new tasks and wasn't showing any CPU activity in top. That seemed a bit odd, so I thought I'd make an attempt to discover what that process was actually doing at the time that it became non-responsive. Enter strace.

strace is a powerful command-line tool for inspecting running processes to determine what "system calls" they're making. System calls are low level calls to the operating system kernel that might involve accessing hard disks, the network, creating new processes, or other such operations. First, let's find the PID of the celery process we're interested in:

ps auxww|grep celery

You'll find the PID in the second column. For the purposes of this post let's assume that's 1234. You can inspect the full command in the process list to make sure you've identified the right celery worker.

Next, run strace on that PID as follows:

sudo strace -p 1234 -s 100000

The -p flag specifies the PID, and the -s flag specifies the size of the output. By default it's limited to 32 characters, which we found isn't very helpful if the system call being made includes a long string as an argument. You might see something like this:

Process 1234 attached - interrupt to quit
write(5, "ion id='89273' responses='12'>\n     ...", 37628

In our case, the task was writing what looked like some XML to file descriptor "5". The XML was much longer and at the end included what looked like a few attributes of a pickled Python object, but I've shortened it here for clarity's sake. You can see what "5" corresponds to by looking at the output of lsof:

sudo lsof|grep 1234

The file descriptor shows up in the "FD" column; in our version of strace, that happens to be the 4th column from the left. You'll see a bunch of files that you don't care about, and then down near the bottom, the list of open file descriptors:

python    1234   myuser    0r     FIFO                0,8      0t0    6593806 pipe
python    1234   myuser    1w     FIFO                0,8      0t0    6593807 pipe
python    1234   myuser    2w     FIFO                0,8      0t0    6593807 pipe
python    1234   myuser    3u     0000                0,9        0       4738 anon_inode
python    1234   myuser    4r     FIFO                0,8      0t0    6593847 pipe
python    1234   myuser    5w     FIFO                0,8      0t0    6593847 pipe
python    1234   myuser    6r      CHR                1,9      0t0       4768 /dev/urandom
python    1234   myuser    7r     FIFO                0,8      0t0    6593850 pipe
python    1234   myuser    8w     FIFO                0,8      0t0    6593850 pipe
python    1234   myuser    9r     FIFO                0,8      0t0    6593916 pipe
python    1234   myuser   10u     IPv4            6593855      0t0        TCP ip-10-142-126-212.ec2.internal:33589->ip-10-112-43-181.ec2.internal:amqp (ESTABLISHED)

You can see "5" corresponds to a pipe, which at least in theory ends up with a TCP connection to the amqp port on another EC2 server (host names are fictional).

RabbitMQ was operating properly and not reporting any errors, so our attention turned to the Celery task in question. Upon further examination, an object we were passing to the task included a long XML string as an attribute, which was being pickled and passed to RabbitMQ. Issues have been reported with long argument sizes in Celery before, and while it appears they should be supported, an easy workaround for us (and Celery's recommended approach) was to pass an ID for this object rather than the object itself, greatly reducing the size of the task's arguments and avoiding the risk of overwriting any object attributes.

While there may have been other ways to fix the underlying issue, strace and lsof were crucial in helping us figure out the problem. One might be able to accomplish the same thing with a lot of logging, but if your code is stuck in a system call and doesn't appear to be showing any noticeable CPU usage in top, strace can take you immediately to the root of the problem.

Josh JohnsonWhat does it mean to be a “python shop”?

Python developers: Do you call your team or business a “python shop”? If so, what do you mean? If not, why not?

Caktus GroupShipIt Day 4: SaltStack, Front-end Exploration, and Django Core

Last week everyone at Caktus stepped away from client work for a day and a half to focus on learning and experimenting. This was our fourth ShipIt day at Caktus, our first being almost exactly a year ago. Each time we all learn a ton, not only by diving head first into something new, but also by hearing the experiences of everyone else on the team.

DevOps: Provisioning with SaltStack & LXC+Vagrant

We have found SaltStack to be useful in provisioning servers. It is a Python based tool for spinning up and configuring servers with all of the services that are needed to run your web application. This work is a natural progression from the previous work that we have done at Caktus in deploying code in a consistent way to servers. SaltStack really shines with larger more complicated systems designed for high availability (HA) and scaling where each service runs on its own server. Salt will make sure that the setup is reproducible.

This is often important while testing ideas for HA and scaling. The typical cycle looks like:

  • Work on states in Salt

  • Run Salt through Fabric building the whole system, locally through vagrant, on a cloud provider, or on physical hardware.

  • Pound on the system using benchmarking tools narrowing in on bottlenecks and single points of failure.

  • Start fixing the problems you uncovered in your states starting the cycle over again.

Victor, Vinod, David, and Dan all worked on learning more about SaltStack through scratching different itches they’ve felt while working on client projects. Some of the particular issues folks looked at included understanding the differences between running Salt with and without a master, how to keep passwords safe on the server while sharing them internally on the development team, and setting up Halite, a new web interface for Salt.

In order to test these complicated server configurations, we often rely on Vagrant to run these full system configurations on developer’s laptops. This is the quickest way to start building out systems. The problem with this though is that our laptops are not as fast as the hardware that the services will eventually be provisioned on. In order to reduce the time waiting in the code-rebuild-test cycle, Scott our friendly system administrator delved into running Vagrant with LXC containers on the backend with our development laptops. LXC containers are more lightweight than VirtualBox virtual machines and can be created more quickly on the fly. This involved learning about how to upgrade the development laptop’s kernels on Ubuntu long term support image. Scott was successful and there was a response of “Oooh” and “Ahhh” from the developers when he demoed the speed of creating a new VM with LXC through vagrant.

Front-end Web + Visualizations: Realtime Web, Cellular Automata, Javascript

MVC, & SVG Splines

Caktus has a growing team of front-end developers and folks interested in user interaction. There were a number of projects this ShipIt day that explored different tools for designing visualizations and user experiences.

Mark dove into WebRTC building a new project, rtc- demo with the actual demo hosted on Github static pages (note: it only works on recent Firefox builds so far). It’s neat that this is hosted on a static web assets host since the project does allow users to interact with one another. Red & Blue users go to the static site and create links that they share with one another and connect directly through their browsers without any server in between. Both users see a tic-tac-toe board and can alternate taking turns. The moves are relayed directly to their opponent. This exploration allowed Mark to evaluate some new technologies and their support in different browsers. He was able to play around with new modes of interaction on the web where the traditional server-client model can be challenged and hybrid peer- to-peer apps can be built.

Caleb has been taking a class on programming in C outside of work and wanted to continue to work on one of his personal projects during the ShipIt day. There’s something about pointer arithmetic and syntax that Caleb finds fascinating Caleb extended some of the features of his cellular automata simulation engine, gameoflife. This included some of his rule files, initial states, and rule file engine. The result was experiments with the Wireworld simulation including a working XOR gate using simple cellular automata rules and Brian’s Brain which produces a visually interesting chaotic simulation. Caleb’s demo was a crowd pleaser with everyone enthralled by the puffers, spaceships, and gliders moving around and interacting on the screen.

Rebecca made some amazing usability improvements to our own django- timepiece. Timepiece is part of the suite of tools that we use internally for managing contracts, hours, timesheets, and schedules. Rebecca focused on some of the timesheet related workflows including verifying timesheets. She made these processes more fluid by building an API and making calls to it in using Backbone. This allowed users to do things like delete or edit entries on their timesheet before verifying it without leaving the page.

Wray was also decided to work on a front-end project. His project focused on building SVGs on the fly in raw Javascript without any supporting libraries. Wray started by showing us his passion for how to mathematically define curves and how useful the different definitions are from a user experience point of view as designer working in a vector drawing program. The easier it is for the designer to get from what they meant to draw to what is represented on the screen the better the resulting design will be. He showed a particular interest in the Spiro curve definition. Wray dug into this further by looking at the definition of curves supported by the SVG standard and built an interactive tool for drawing curves by editing the string defining the SVG element on the fly. The resulting project is still experimental at this point, but is an interesting exploration into what can be done with SVGs in Javascript without any supporting libraries.

Django Core

Karen and Tobias waded into some internals of bleeding edge of Django. In particular, Tobias offered up comments on a pull request related to ticket #21271. Tobias gave feedback to the original pull request author and Django Core committer Tim Graham on the difference between instance and class variables and when it’s appropriate to use class variables in Python.

Karen gave an illuminating talk on transitioning a particular client project to Django 1.6 from Django 1.5. Her slides are available to view online. She generalized the testing, debugging, and eventually fixing strategies she used while discussing the particular problems she encountered on the project. The strategy Karen used was running manage.py check to check for new settings that may have changed. This is a new management comment added in 1.6 and can be used from now on to ease upgrading versions.

She found the following issues when upgrading:

  • UNUSABLE_PASSWORD was part of an internal API that changed. We used this internal in the project as sometimes you must, but special care must be taken when upgrading code that relies on unstable APIs.

  • MAXIMUM_PASSWORD_LENGTH was used on this project after the recent Django security fix. Based on subsequent security updates, this is no longer needed and can be removed in 1.6.

  • BooleanFields now do not automatically default to False. This was changed to encourage coder to be explicit and to increase the level of standards within the code.

  • Django debug toolbar broke in 1.6! This is a shame, but will hopefully be updated by the time 1.6 comes out or soon after.

After this, Karen ran the project’s test suite and discovered the following remaining changes:

Karen urged us to check out the Django 1.6 release notes and remember where you’re using unsupported internals and to check and write tests for that code particularly carefully. Also, she encouraged everyone to run python -Wall manage.py test to help expose more deprecation warnings and bugs.

Estimation Process Review

Ben and Daryl, our Project Managers at Caktus, worked up a full proposal and project schedule for an internal process of reworking our estimation process. We don’t shy away from large fixed bid projects and pride ourselves on meticulous estimates informed by careful client communication and requirements gathering. Our PMs wanted to help us look at this process and make it more formal as we grow to a larger group with lots of people leading the same process for different incoming projects.

Wrap Up

We had a great time with our latest ShipIt day. Each time we learn a ton from each other and by digging into some tough technical problems. We also learn a little bit more about what makes a good ShipIt day project. There were a number of crowd pleasers in this batch of projects and while some folks decided to try a completely new library or technology others decided to stay a bit closer to home by extending learning they did not have time look into more during a client project. We had a lot of fun and are looking forward to our next ShipIt day.

Josh JohnsonStringList – Is It A String? Is It A List? Depends On How You Use It!

At my job, I had an API problem: I had a object property that could contain one or more strings. I assumed it would be a list, some of my users felt that ['value',] was harder to write than 'value'. I found myself making the same mistake. So, I solved the problem, then I took the rough class I wrote and polished it. It’s up on my github, at https://github.com/jjmojojjmojo/stringlist

So what is it?

It’s a class that tries to make a slick API when there can be one, or many values for a property. The class can be instantiated with either a single value or many. If you use many, it will act like a list. If you use a single value (a string), it will act like a string. If it’s a string, and you use the .append() method, it becomes a list. Bam.

How would I use it?

Like so (see the README for a more detailed example, and the tests for comprehensive usage):

class Thing(object):
    Arbitrary class representing a single thing with one or more roles.

    role = StringList()

    def __init__(self, role=None):
        self.role = role

# initialize with a single string value        
obj = Thing('single role')

# initialize with multiple string values
obj = Thing(('one', 'two', 'three'))

# set to a single string value
obj.roles = 'new role'

# set to many string values
obj.roles = 'primary', 'secondary'

# when it's a string, it works like a string
obj = Thing('role1')

obj.roles = obj.upper()

# convert to a list using .append()

# now its a list
obj.roles[1] == 'role2'


  • The module sports 100% test coverage.
  • There is a buildout in the repo. Just run python bootstrap.py; bin/buildout and then you can run the tests with bin/nosetests
  • It was originally developed to avoid ['h', 'e', 'l', 'l', 'o'] mistakes when a single string was used instead of a list.
  • This is the first time I’ve used descriptors in Python. Very cool.
  • It would not have been possible without this marvelous post about Python’s magic methods (big thanks to Rafe Kettler).
  • Some things were not as straight-forward as they could have been, given Python’s string implementation. Of special interest is the implementation of the __delitem__ and __iter__ methods. In both cases, the base str class doesn’t have either method, so instead of doing the sane thing and proxying the call, I had to fall back to re-doing the operation in the method.

Tim HopperSublime Text and Markdown

I have largely moved from Textmate to Sublime Text 2 for text editing. Among other reasons, Sublime Text is cross platform, and I use Windows at work and a Mac at home. I have also started writing as much as I can in Markdown.

I intended to write a blog post about using Sublime Text as a tool for writing Markdown. However, the inimitable Federico Viticci, of macstories.net, has already written that post, so I will simply refer you there.

Tim HopperWalmart's Command of Logistics

Steve Sashihara, The Optimization Edge:

What makes Walmart unique is its command of logistics. It continually deconstructs its entire supply chain, from supplier to distribution centers to customers, and treats each link as a decision point, asking a battery of microquestions: Where and how much to buy and at what price? Where to route goods? How to resupply and reorder?

Caktus GroupUNC, Duke Team up with Carrboro-based Caktus Group on HIV Gaming App

The following is a press release posted in partnership with our team at UNC  and Duke.

The web application development company Caktus Group has teamed up with researchers at the UNC Institute for Global Health & Infectious Diseases and the Duke Global Health Institute to develop a mobile phone app that may help patients better adhere to their medication regimens.

The new venture is funded by a $150,000 Phase I Small Business Innovation Research (SBIR) grant from the National Institutes of Health. The study team will develop a novel mobile phone game app with the goal of improving medication adherence among HIV-infected young black men who have sex with men. In the United States, this is the demographic with the highest number of new HIV infections. 

The Daily Dose app will utilize game mechanics and social networking features to improve adherence to HIV medication. In addition to scheduled medication reminders and adherence tracking, Daily Dose will use gaming features to create a compelling and engaging experience that will motivate and support behavior change. The app will also promote social interaction, which will allow users to share their successes and encourage others to maintain their adherence goals. Gamers will be awarded points for sharing their own successes and providing support and encouragement to their fellow gamers. The study team hopes this anonymous social network of peer support will drive patients to maintain their medication schedules. 

While Caktus Group leads the development of the game, the usability studies will be conducted by Lisa Hightow-Weidman, MD   associate professor of medicine at UNC and Sara LeGrand, PhD   assistant research professor at the Duke Global Health Institute. The researchers will organize focus groups and conduct a series of in-person interviews with potential users to get feedback on the app’s design throughout the development process. 

“Because the game will be designed, developed and refined based on consistent user feedback, we are confident that Daily Dose will be both engaging and effective,” said Tobias McNulty, principlal investigator and managing member at Caktus Group.

Ultimately, Daily Dose aims to improve drug adherence among a population disproportionately affected by HIV by developing an effective antiretroviral therapy adherence app tailored specifically for this group. Recent research has shown that treating HIV makes people less contagious and drastically reduces the spread of the virus to sexual partners. Through careful design and development, this new project hopes to improve patient outcomes and reduce the spread of HIV among a vulnerable population.

About Caktus Group

Caktus Consulting Group, LLC is a growing team of creative developers and designers based in Carrboro, North Carolina. The company was founded in August 2007 to serve the web needs of startups, researchers, health care organizations, and established businesses in the North Carolina Triangle region and beyond. Caktus Group’s specialty is creating web and mobile applications using Django, an open source web framework that is business friendly and easily customized. By listening carefully, Caktus Group develops products that clients consistently say are intuitive and fill needs they had not yet discovered.

About the UNC Institute for Global Health & Infectious Diseases 

Founded in 2007, the Institute for Global Health & Infectious Diseases at UNC  harnesses the resources of the University and its partners to solve local and global health problems, reduce the burden of disease, and inspire and train the next generation of leaders in global health.  

About the Duke Global Health Institute

The Duke Global Health Institute established in 2006, brings knowledge from every corner of Duke University to bear on the most important global health issues of our time. DGHI was established as a University-wide institute to coordinate, support, and implement Duke’s interdisciplinary research, education, and service activities related to global health. DGHI is committed to developing and employing new models of education and research that engage international partners and find innovative solutions to global health challenges. 

Caktus GroupSkipping Test DB Creation

We are always looking for ways to make our tests run faster. That means writing tests which don't preform I/O (DB reads/writes, disk reads/writes) when possible. Django has a collection of TestCase subclasses for different use cases. The common TestCase handles the fixture loading and the creation the of TestClient. It uses the database transactions to ensure that the database state is reset for every test. That is it wraps each test in a transaction and rolls it back once the test is over. Any transaction management inside the test becomes a no-op. Since TestCase` overrides the transaction facilities, if you need to test the transactional behavior of a piece of code you can instead use TransactionTestCase. TransactionTestCase resets the database after the test runs by truncating all tables which is much slower than rolling back the transaction particularly if you have a large number of tables.

There is also SimpleTestCase which is the base class for the previous to classes. It has some additional assertions for testing HTML and overriding Django settings but doesn't manage the database state. If you are testing something that doesn't need to interact with the database such as form field/widget output, utility code or have mocked all of the database interactions you can save the overhead of the transaction by using SimpleTestCase.

Now what if you are running a set of tests which are only using SimpleTestCase or the base unittest.TestCase? Then you don't really need the test database creation at all. Depending on the backend you are using, the number of tables you have and the number of tests you are running the database creation can take many times longer than running the test themselves.

Our solution for this was to extend the default test runner. A quick examination of the build-in test runner reveals a solution.

def run_tests(self, test_labels, extra_tests=None, **kwargs):
    Run the unit tests for all the test labels in the provided list.

    Test labels should be dotted Python paths to test modules, test
    classes, or test methods.

    A list of 'extra' tests may also be provided; these tests
    will be added to the test suite.

    Returns the number of tests that failed.
    suite = self.build_suite(test_labels, extra_tests)
    old_config = self.setup_databases()
    result = self.run_suite(suite)
    return self.suite_result(suite, result)

The test suite is discovered before the test db is created. That means we can look at the set of tests which are going to be run and if none of them are using TransactionTestCase (TestCase is a subclass of TransactionTestCase) then we can skip the database creation/teardown. Here's what that looks like:

from django.test import TransactionTestCase
    from django.test.runner import DiscoverRunner as BaseRunner
except ImportError:
    # Django < 1.6 fallback
    from django.test.simple import DjangoTestSuiteRunner as BaseRunner

from mock import patch

class NoDatabaseMixin(object):
    Test runner mixin which skips the DB setup/teardown
    when there are no subclasses of TransactionTestCase to improve the speed
    of running the tests.

    def build_suite(self, *args, **kwargs):
        Check if any of the tests to run subclasses TransactionTestCase.
        suite = super(NoDatabaseMixin, self).build_suite(*args, **kwargs)
        self._needs_db = any([isinstance(test, TransactionTestCase) for test in suite])
        return suite

    def setup_databases(self, *args, **kwargs):
        Skip test creation if not needed. Ensure that touching the DB raises and
        if self._needs_db:
            return super(NoDatabaseMixin, self).setup_databases(*args, **kwargs)
        if self.verbosity >= 1:
            print 'No DB tests detected. Skipping Test DB creation...'
        self._db_patch = patch('django.db.backends.util.CursorWrapper')
        self._db_mock = self._db_patch.start()
        self._db_mock.side_effect = RuntimeError('No testing the database!')
        return None

    def teardown_databases(self, *args, **kwargs):
        Remove cursor patch.
        if self._needs_db:
            return super(NoDatabaseMixin, self).teardown_databases(*args, **kwargs)
        return None

class FastTestRunner(NoDatabaseMixin, BaseRunner):
    """Actual test runner sub-class to make use of the mixin."""

There are a couple of things to note. Like the previous temporary MEDIA_ROOT runner, this is written as a mixin so that it can be combined with other test runner improvements. Second it uses mock to ensure that any attempts to connect to the database will fail. This idea is borrowed from Carl Meyer's Testing and Django Talk from PyCon 2012. To make use of this you would need to include this runner on your Python path and change the TEST_RUNNER setting to the full Python path to the FastTestRunner class.

With this in place if you had the follow tests

# myapp.tests.py
from django.test import TestCase, SimpleTestCase

class DbTestCase(TestCase):
    """Does something with the DB."""

class NoDbTestCase(SimpleTestCase):
    """Does something with the DB."""

in your app called myapp. If you were to run:

python manage.py test myapp

it would create the DB. However, if you run:

# For Django < 1.6
python manage.py test myapp.NoDbTestCase
# For Django 1.6+ with the DiscoverRunner
python manage.py test myapp.tests.NoDbTestCase

it would skip the test DB creation. Hooray for faster tests!

Tim HopperThe Incessant Commentary on Being Tall

Ralph Keyes, The Height of Your Life:

I've heard this sort of thing repeatedly from tall men. It's not the incessant commentary about their height that is so annoying, it's the stupefying boredom of it all. Were anyone to say something original or witty or different in any way, the constant chatter thrown their way might at least be entertaining. But soon after reaching their full height, tall people realize to their horror that the lifetime's commentary to which they've been sentenced comes mostly from those with least to say.

Tim HopperShould I Do a Ph.D.? Laura McLay

Continuing my series, I talked to Laura McLay, a professor at University of Wisconsin-Madison.

A 22-year old college student has been accepted to a Ph.D. program in a technical field. He's academically talented, and he's always enjoyed school and his subject matter. His acceptance is accompanied with 5-years of guaranteed funding. He doesn't have any job offers but suspects he could get a decent job as a software developer. He's not sure what to do. What advice would you give him, or what questions might you suggest he ask himself as he goes about making the decision?

Laura: I would recommend visiting Tough love: An insensitive guide to thriving in your Ph.D. by Chris Chambers for guide for knowing if you are ready to pursue a Ph.D. I don't have too much to add. If this list doesn't frighten you much, then I highly recommend relocating to attend a top Ph.D. program in your field, such as the Department of Industrial and Systems Engineering at the University of Wisconsin-Madison.

It's important to think about how a Ph.D. fits in with other life decisions. I definitely felt like it would be hard to go back to graduate school if I started another career. And if I did go back, I was afraid that a two-body problem would mean that it would be easiest to go to the local Ph.D. program rather than move to go to a top Ph.D. program. Relationships need a lot of compromise and mutual sacrifice to work, and a Ph.D. doesn't always fit in nicely. I decided to go straight into graduate school so that life didn't get in the way later on. That's not the right decision for everyone, but it might be harder to go back than you may think.

Many people go and get Ph.D.'s later in their career as part-time students if their employers pay for them. This is a nice perk, and it is rare. This experience will be very different from that of the 22-year old student jumping straight into a Ph.D. program. The part-time career student may just be interested in getting a Ph.D. to qualify for moving up the corporate ladder. The full-time student should be more interested in building a set of skills and expertise to challenge important problems in their field for life. I believe that the motivation to get a Ph.D. should be more than just getting the diploma.

And it's worth saying that five years of guaranteed funding is a sweet deal. Think about it.

Our hypothetical student specifically aspires to be an academic and sees a Ph.D. as essential to getting there. Do you any words of encouragement or caution about that goal?

Laura: It's a great career, but it's not for everyone. Graduate students should have plenty of time to discover if academia is right for them. I am introverted and was painfully shy at age 22. I did not enter graduate school with the goal of becoming an academic. Luckily, I was converted along the way.

Do you have any thoughts on going from undergrad into a Ph.D. program verses first enrolling in a masters program?

Laura: There are many more funding opportunities for Ph.D. students than for Masters programs. If you're on the fence, apply to Ph.D. programs.

Let me be clear: a Ph.D. is not a Masters degree plus a little more coursework and a small project. This is a sketch of what my colleague Jason Merrick uses to explain the concept of a Ph.D. to prospective Ph.D. students. The Ph.D. student first sees the "hill" of coursework, which seems like a lot of work. From where the Ph.D. student stands when he/she starts a Ph.D. program, the second, bigger "hill" of research is not visible. But it is there. "All but dissertation" (ABD) is not very close to finishing. If the idea of tackling that hill of research is daunting, maybe consider a Masters instead of a Ph.D.

Admission and funding aside, what should a potential Ph.D. student look for in a graduate program?

All graduate students should look for top programs that have friendly faculty who have interesting blogs and/or engage in human pyramids. There should also be several nearby lakes for kayaking on days off, bike trails everywhere, the best union you've ever seen, beer aplenty (and better yet - the birthplace of kegball), and a mascot who is a lovable woodland creature.1

I have a slideshare presentation on applying to Ph.D. programs that has more tips and advice.

What should a Ph.D. student look for in an advisor?

Laura: Choosing an advisor is a two way street. One thing prospective students may not know is that no faculty member has to oversee their dissertation research. It's important to be polite, diligent, and responsible with faculty. Even in a big program, there may be only a couple of faculty members whose interests may match yours. You will not finish if one of them isn't on your side. Chris Chambers put it well: "Above all, remember that you and your supervisor are in this together. Those three years can be an energizing, productive, and career-making partnership. But they can also be a frustrating waste of time and energy. If you want your supervisor to go above and beyond for you, then lead by example and work your butt off."

Laura McLay holds a Ph.D. in industrial engineering from University of Illinois at Urbana-Champaign. She is is Associate Professor of Industrial & Systems Engineering at University of Wisconsin-Madison. She blogs at Punk Rock Operations Research and is active on Twitter.

  1. If Tim left in this shameless plug for my department, I will be extremely grateful! 

Caktus GroupCentral logging in Django with Graylog2 and graypy

Django's logging configuration facilities, which arrived in version 1.3, have greatly eased (and standardized) the process of configuring logging for Django projects. When building complex and interactive web applications at Caktus, we've found that detailed (and properly configured!) logs are key to successful and efficient debugging. Another step in that process—which can be particularly useful in environments where you have multiple web servers—is setting up a centralized logging server to receive all your logs and make them available through an easily accessible web interface. There are a number useful tools to do this, but one we've found that works quite well is Graylog2. Installing and configuring Graylog2 is outside the scope of this post, but there are plenty of tutorials on how to do so accessible through your search engine of choice.

Once you have it setup, getting logs flowing to Graylog2 from Django is relatively straightforward. First, grab a copy of the graypy package from PyPI and add it to your requirements file:

pip install -U graypy

Next, add the following configuration inside the LOGGING['handlers'] dictionary in your settings.py, where graylog2.example.com is the hostname of your Graylog2 server:

    # ...
    'handlers': {
        # ...
        'gelf': {
            'class': 'graypy.GELFHandler',
            'host': 'graylog2.example.com',
            'port': 12201,

You'll most likely want to tell your project's top-level logger to send logs to the new gelf handler as well, like so:

    # ...
    'loggers': {
        # ...
        'projectname': {
            # mail_admins will only accept ERROR and higher
            'handlers': ['mail_admins', 'gelf'],
            'level': 'DEBUG',

With this configuration in place, log messages with a severity of DEBUG or greater that are sent to the projectname logger should begin flowing to Graylog2. You can easily test this by opening Django's python manage.py shell, grabbing the logger manually, and sending a log message:

import logging
logger = logging.getLogger('projectname')
logger.debug('testing message to graylog2')

You should see the message show up in Graylog2 almost immediately.

Now, this is all well and good, but if you want to use your Graylog2 server for multiple projects, you'll quickly find that all the log messages are interspersed and it can be difficult to tell what messages are coming from what projects. To address this issue, Graylog2 supports the concept of "streams," that is, filters that you can setup (which work only on incoming messages, not existing messages) to show messages that match only certain criteria. A simple solution here could be to filter on the hostname of the originating web servers, but this may not scale well in environments like Amazon Web Services' EC2 where you're often adding or removing web servers. As a better alternative, you can add metadata to log messages at the Python level prior to sending them to Graylog2 that will help you more easily identify the messages for different projects.

To do this, you need to use a feature of Python logging filters. While filters are most commonly used to filter out certain types of messages from being emitted altogether (as discussed in the Django documentation), they can also be used to modify the log records in transit and impart contextual metadata to be transmitted with the original message. To add this to our logging configuration, first create the following filter class in a Python module accessible from your project:

class StaticFieldFilter(logging.Filter):
    Python logging filter that adds the given static contextual information
    in the ``fields`` dictionary to all logging records.
    def __init__(self, fields):
        self.static_fields = fields

    def filter(self, record):
        for k, v in self.static_fields.items():
            setattr(record, k, v)
        return True

Next, we need to load this filter in our logging configuration and tell the gelf logger to pass records through it:

    # ...
    'filters': {
        # ...
        'static_fields': {
            '()': 'projectname.core.logfilters.StaticFieldFilter',
            'fields': {
                'project': 'projectname', # CHANGEME
                'environment': 'staging', # can be overridden in local_settings.py
    'handlers': {
        # ...
        'gelf': {
            'class': 'graypy.GELFHandler',
            'host': 'graylog2.example.com',
            'port': 12201,
            'filters': ['static_fields'],
        # ...

The configuration under filters instantiates the StaticFieldFilter class and passes in the static fields that we want to attach to all of our log records. In this case, two fields are attached, a 'project' field with value 'projectname' and an 'environment' field with value 'staging'. The configuration for the gelf logger is the same, with the addition of the static_fields filter on the last line.

With these two items in place, you should be able to create streams via the Graylog2 web interface to trap and display records that match the combination of project and environment names that you're looking for.

Lastly, as an optional addition to this logging configuration, it may be desirable to filter out Django request objects from being sent to Graylog2. The request is added to log messages created by Django's exception handler and may contain sensitive information or in some cases may not be capable of being pickled (which is necessary to encode and send it with the log message). You can remove them from log messages with the following filter:

class RequestFilter(logging.Filter):
    Python logging filter that removes the (non-pickable) Django ``request``
    object from the logging record.
    def filter(self, record):
        if hasattr(record, 'request'):
            del record.request
        return True

and this corresponding filter configuration:

    # ...
    'filters': {
        # ...
        'django_exc': {
            '()': 'projectname.core.logfilters.RequestFilter',
    'handlers': {
        # ...
        'gelf': {
            'class': 'graypy.GELFHandler',
            'host': 'graylog2.example.com',
            'port': 12201,
            'filters': ['static_fields', 'django_exc'],
        # ...

With this configuration in place, you can have log messages flowing to Graylog2 from any number of project and server environment combinations, limited only by the resources of the log server itself.

Tim HopperShould I Do a Ph.D.? Paul Harper

This week, I talked to Paul Harper, professor of math and O.R. at Cardiff University. Paul brings perspective from inside the British university system where most Ph.D. programs are typically shorter (and involve less coursework).

A 22-year old college student has been accepted to a Ph.D. program in a technical field. He's academically talented, and he's always enjoyed school and his subject matter. His acceptance is accompanied with 5-years of guaranteed funding. He doesn't have any job offers but suspects he could get a decent job as a software developer. He's not sure what to do. What advice would you give him, or what questions might you suggest he ask himself as he goes about making the decision?

Paul: Put simply, students should first and foremost ask themselves if they actually need a Ph.D. If they aspire to an academic career, then almost certainly a Ph.D. will be required. An initial word of warning though; for many straight out of their first degree, knowing what an academic career actually involves or indeed how to climb onto the academic ladder is often not fully understood. This isn’t surprising, after all most undergraduate students only see their lecturers in the lecture room or in tutorials, and typically don’t appreciate the wide range of responsibilities and pressures they have across teaching, research and administration. My advice would be to those curious or aspiring to an academic post and thus getting the Ph.D., just as one would seek career guidance for any other job in industry, is seek advice from a variety of sources such as chatting to range of academic staff, research the prospects of employment in your field (see below), grab any opportunity to work on research with potential supervisors to ensure it is something that you’re both passionate about and intellectually able to fulfill (there might well be paid summer internships as an undergraduate, or just offer to work for free if you’re that serious!). There is a reality check though, that permanent academic posts are increasingly hard to come by. Typically (in the UK at least), after a Ph.D. you would still be required/expected to have completed a 2-3 year post-doctoral position before even being considered for a lectureship. In reality you might even end up on multiple post-docs before an opportunity arises. It’s tough and most certainly will require you to be willing to move locations in your quest for that first step on the ladder. That said, if deep down you know that this is your true desire, then coupled with the right attitude and willingness to put in the effort for potential reward, then go for it!

Based on my own experiences of chatting to Ph.D. applicants, the majority don’t actually know what they want to do by way of career and wish to keep open the possibility of an academic career as well as a stepping-stone in to industry. In this instance the need for the Ph.D. isn’t so clear. There are some industrial jobs that do require or at least highly desire applicants to have a Ph.D., in which case the Ph.D. will again be necessary. Clearly the student should research what qualifications are required for the different careers they are considering. I am also aware of industrial jobs where those entering with a Ph.D. are fast-tracked to more senior posts with overall better prospects, hence the benefits of the Ph.D. might begin to outweigh the negatives such as 5 years of lost salary and industrial experience.

Returning to the case at hand, this student suspects he could get a decent job as a software developer, and so it seems the Ph.D. isn’t required for the job. Here the student should ask himself if he wants to get a Ph.D. (as distinct from an actual need to get it). This is more complex with multiple factors to consider. Going straight into industry should provide 5 years of good income, experience and potential to rise up the career ladder. Staying for the Ph.D. will provide 5 years of much lower income and typically a completely different life and way of working to those in industry (I worked myself in industry for 2 years as a Management Consultant before retuning to University for a Ph.D., so I have some first hand experience of the differences). Ph.D.'s require much more time working independently and in isolation, thus for sure you need to be able to motivate and organise yourself and be able to bounce back from the inevitable lows. It will be tough work, long hours etc, although that can be true of course in industry (as a management consultant I’d spend silly hours at work too). On the plus side, the highs of achievement (a breakthrough in your research) are hard to eclipse, and provide enormous amounts of personal satisfaction. For me at least, you can’t put a price on those moments, and I never personally got anything like this level of satisfaction working for those 2 years in industry. Perhaps the best way to summarise the life of a Ph.D. student is to look at the awesome Ph.D. comics (phdcomics.com) by Jorge Cham, which are spot on!

What are your thoughts on getting a masters first versus going straight to a Ph.D. in a British university? Would you differentiate that advice for an American-style Ph.D.?

I would suggest that whether to obtain a masters first versus going straight to the Ph.D. largely depends on the subject area. For instance I obtained my masters first (and moved University to do this) because Operational Research (OR), at least in the UK, is mostly taught at masters . (I only studied one OR module in the final year of my undergraduate degree which sparked my interest.) Masters can therefore offer intensive training (in the UK they are usually 1 year full-time courses) in a particular field that can then be a stepping-stone to the Ph.D., helping to decide on which aspect of the field to focus on for the Ph.D. As Director of MSc programmes at Cardiff University (OR and Applied Statistics), I get many students asking for my advice on the masters versus industry, or masters versus straight to the Ph.D. My advice would be that for those not certain on the Ph.D. (as discussed above) the masters would make more sense as it will improve prospects for both jobs in industry and Ph.D. scholarships. For example the majority of big employers in OR in the UK tend to recruit first from MSc (there are exceptions and good students with exceptional first degrees are highly employable as well without the MSc).

For those students wishing to do a Ph.D. and have high grades, my advice usually differs and that the masters might not be the best investment of their time (and potentially money). An exceptional student could start the Ph.D. and for instance sit in on masters modules as required or attend national training events. Here the differences between the UK Ph.D. and elsewhere become more important. In the UK, a Ph.D. usually lasts for 3.5 years (and certainly not 5 as in the case at hand). Increasingly first year Ph.D. students are required to take taught courses (could be internal modules or as part of national taught course schemes we have in the UK). When I was undertaking my Ph.D. (1998-2001) there were no such programmes or expectations, hence the time to completion was less, typically around 3 years. Introducing the requirement to sit on taught courses is a good move in my view, as it allows the Ph.D. student to widen their knowledge (they may well be asked to study something not directly related to their Ph.D. for instance). So for the US system, with more formal training in the first 2 years of Ph.D., my advice to go straight to the Ph.D. rather than the masters would probably be strengthened. In the case where the student was absolutely sure they needed or wanted the Ph.D., the masters would seem somewhat unnecessary.

I haven’t yet mentioned scholarships/bursaries, and of course these may play a large part in the decision-making processes too. Typically funds for Ph.D.'s are much harder to obtain, hence the masters first (where possibly more scholarships are available) might at some Universities improve one’s chances of subsequent funding for the Ph.D. Some Universities nowadays in the UK are also offering funded places for a 1+3 scheme (Masters + Ph.D.) so the bundle of funding covers you for both. This is increasingly the preferred option of some of the major UK funding councils, so in future going straight to the Ph.D. might in fact not be an option at all but there will the necessity to complete the Masters first.

Would you advise an American student to do a Ph.D. at an British university? Visa versa?

Paul: I suppose the most immediate difference between the US and UK Ph.D.'s is the duration (3 to 4 years compared to more like 5) largely resulting from the necessity of the taught programme components (as discussed above). In practice I imagine the decision would be largely driven by ensuring you have the right supervisor as an expert in their field (whether they happen to be based at a UK or US University), financial considerations (scholarships, cost of living etc) and how much weight one places on how 5 years of a Ph.D. (including the benefits of taught component) in the US compared to less of a time commitment in the UK.

Admission and funding aside, what should a potential Ph.D. student look for in a graduate program?

Paul: First and foremost, ensure you find a supervisor/advisor with similar research interests willing to take you on, after all they’ll be a big part of your life for the next several years, and it is very important that your research interests mesh! Consider the reputation of the department, read publications to get a better feel of potential supervisor’s interests and track-record etc. Ask what types of training programmes are available, what teaching/tutorial duties are possible, conference funding, other activities of the research group (seminar series etc) and destinations of recent graduate students. Consider the number of other graduate students under their supervision and indeed within the department as a whole since working within a larger team of fellow grad students can be more supportive and better chances that others are working on similar research work to yourself, although conversely if you are one of many grad students being supervised by the same person, you might expect to have to fight more for their time and attention!

A very common misconception is that applicants can simply pick a supervisor of their choice, but this requires mutual consent. So approach potential supervisors with the courtesy it merits and to impress them both with your research ideas and interests, but also the right attitude to be willing to work hard and to learn. One of my favourite quotes (courtesy of a colleague, Dr. Vince Knight, from whom I first heard this from) is by Dr Seuss: "It is better to know how to learn than to know." Hold fast to this principle and it will keep you in good stead!

Paul Harper holds a Ph.D. in mathematics from the University of Southampton. He is Professor of Operations Research and Mathematics at Cardiff University in Wales. He is active on Google+.

Og MacielUsing Python to Control Katello

Emacs editor with python code

I usually like to use python to script my day to day tests against Katello (you may have seen some of my previous posts about using the Katello CLI for the same purpose) and I figured I’d start showing some basic examples for anyone else out there who may be interested.

Assuming you have already installed and configured your Katello instance (learn how to do this here) with the default configurations, we now have a few options to proceed:

  1. write and run your scripts in the same environment as your server
  2. install the katello-cli package (pip install katello-cli)
  3. Use git to clone the katello-cli repository (git clone https://github.com/Katello/katello-cli.git) and make sure to include it into your PYTHONPATH.

Option 1 is by far the easiest approach since you should have all the dependencies (namely kerberos and M2Crypto) already installed, but I like Option 3 as it allows me to always have the latest code to play with.

Now we’re ready to write some code! The first thing we’ll do is import some of the Katello modules:

 from katello.client import server
 from katello.client.server import BasicAuthentication
 from katello.client.api.organization import OrganizationAPI
 from katello.client.api.system_group import SystemGroupAPI

Next, we establish a connection to the Katello server (qetello01.example.com in my case), using the default credentials of admin/admin:

katello_server = server.KatelloServer(host='qetello01.example.com', path_prefix='/katello/', port=443)
katello_server.set_auth_method(BasicAuthentication(username='admin', password='admin'))
Let’s now instantiate the Organization API object and use it to fetch the “ACME_Corporation" that gets automatically created for a default installation:
organization_api = OrganizationAPI()
org = organization_api.organization('ACME_Corporation')
print org
{u’apply_info_task_id’: None,
u’created_at’: u’2013-09-12T20:15:06Z’,
u’default_info’: {u’distributor’: [], u’system’: []},
u’deletion_task_id’: None,
u’description’: u’ACME_Corporation Organization’,
u’id’: 1,
u’label’: u’ACME_Corporation’,
u’name’: u’ACME_Corporation’,
u’owner_auto_attach_all_systems_task_id’: None,
u’service_level’: None,
u’service_levels’: [],
u’updated_at’: u’2013-09-12T20:15:06Z’}

Lastly, let’s create a brand new organization:
new_org = organization_api.create(name='New Org', label='new-org', description='Created via API')
print new_org
{u’apply_info_task_id’: None,
u’created_at’: u’2013-09-12T21:48:55Z’,
u’default_info’: {u’distributor’: [], u’system’: []},
u’deletion_task_id’: None,
u’description’: u’Created via API’,
u’id’: 283,
u’label’: u’new-org’,
u’name’: u’New Org’,
u’owner_auto_attach_all_systems_task_id’: None,
u’service_level’: None,
u’service_levels’: [],
u’updated_at’: u’2013-09-12T21:48:55Z’}

As you can see, it is pretty straight forward to use python to create some useful scripts to drive a Katello server, whether you want to populate it with a pre-defined set of data (e.g. default users, roles, permissions, organizations, content, etc) or to test core functionality as I do with Mangonel, my pet project.
Here’s a Gist of the code mentioned in this post, and let me know if this was useful to you.

Tim HopperShould I Do a Ph.D.? Melissa Santos

In this interview, I talked with Melissa Santos who is a software engineer at Etsy.

A 22-year old college student has been accepted to a Ph.D. program in a technical field. She's academically talented, and she's always enjoyed school and her subject matter. Her acceptance is accompanied with 5-years of guaranteed funding. She doesn't have any job offers but suspects she could get a decent job as a software developer. She's not sure what to do. What advice would you give her, or what questions might you suggest she ask herself as she goes about making the decision?

Melissa: Make sure to get an industry internship every summer to get a view of the real world. Have a plan to get a masters degree midway, and reassess the Ph.D. decision at that point.

Part of not being sure is not having firmed up other options - apply to some of those programming jobs! What do you learn in those interviews? How does that stipend look with a salary offer in hand?

The only reason you HAVE to do a Ph.D. is to become a professor. That is also the aim of the people training you in a Ph.D. program, which can make it hard to be realistic about your goals outside of academia - the entire structure around you will be telling you that everything else is lesser.

Do you have any thoughts on going from undergrad into a Ph.D. program verses first completing a masters?

Melissa: I did a masters first, and I don't regret it - it's part of what has let me have such a wide variety of academic experience. My thought at the time was that it was helping me decide if I like grad school enough to go on, but it didn't help me appreciate how much different a Ph.D. program is for a masters. The jump from coursework to research was a sharp break, and you will want to talk to your potential adviser's other students to learn how much help you'll get in that transition.

What benefit(s) does having a Ph.D. bring to your work industry? Is a Ph.D. necessary for that kind of work?

Melissa: I work in tech. The Ph.D. might make my resume stand out a bit, but it's not necessary. To some extent, the process of getting the Ph.D. helped me have the mindset of putting together methods and being creative in my approach to problems that I'm not sure I would have with just the masters degrees. Masters degrees gave me toolboxes but the Ph.D. enforced that the tools come from people like me, and I can be part of building them.

Melissa Santos has a Ph.D. in applied math and statistics from the University of Colorado at Denver. a Senior Software Engineer at Etsy. You can find her at @ansate.

Joe GregorioInternet of Things

So while everyone was losing their minds over the addition of the word twerk to the Oxford Dictionaries Online, I'm actually more upset that the Internet of Things was added with the wrong definition, or at least an incomplete definiton.

a proposed development of the Internet in which everyday objects have network connectivity, allowing them to send and receive data

The first definitions of the Internet of Things I heard was from Bruce Sterling and it went beyond just having network connectivity, it was about lifecycle management, tracking objects from cradle to grave. Now I understand that the ODO tracks popular usage of a term, and that literally everyone is using the term wrong, so the wrong term gets added to the dictionary (did you see what I did there?). I'm OK with the simpler definition being added to the ODO, but it unfortunately means that we don't have a word for the richer definition of the Internet of Things, which is an important concept that we shouldn't lose sight of.

Tim HopperShould I Do a Ph.D.? Carl Vogel

For my next interview, I talked to Carl Vogel, an economic consultant in New York.

A 22-year old college student has been accepted to a Ph.D. program in a technical field. She's academically talented, and she's always enjoyed school and her subject matter. Her acceptance is accompanied with 5-years of guaranteed funding. She doesn't have any job offers but suspects she could get a decent job as a software developer. She's not sure what to do. What advice would you give her, or what questions might you suggest she ask himself as she goes about making the decision?

Carl: The world is full of miserable grad students. Stressed-out, depressed, uncertain about when or if they'll graduate and what will happen to them when they do. Far more people go into Ph.D. programs than should. There are a two main reasons for this, I think. One is that for kids who've only really ever gone to school, and have been successful at it, grad school seems like a natural next step, and a career in academia seems pretty great. The other is that bright, academically talented twenty-two year olds just don't know themselves very well; they tend to be overly-optimistic about their abilities and their prospects. They've never really known failure or crippling self-doubt, and just can't imagine it as a real possibility.

So I'd suggest the healthiest way to think of grad school is not as a default next step---as "undergrad 2.0"---but to realize that it's a tremendous commitment in terms of time and psychological endurance and lost income. She's going to be giving up a huge chunk of her prime years. And during this time, the positive feedback she's gotten from professors and peers is going to disappear. The cycle of challenges and accomplishments she's been used to is going to be replaced by an intangible but ever expanding nebula of expectations and her every victory will be fleeting, unacknowledged, and Pyrrhic.

The needle on the Ph.D. gauge should start at "No." If she isn't really aware of what jobs she could do in her field, and what those look like in terms of career progression, she should definitely do that research. Software development is fine, especially if it's in a context related to her intellectual interests, and there's a possibility of learning, growing, and doing a variety of interesting work. She shouldn't take a job that doesn't excite her intellectually, unless she's really strapped for the cash or has a pile of school debt. At this age, she's got the chance to take a little financial risk for the chance to learn and get new experiences. There are a ton of interesting problems to work on in industry. She should make a real effort to see if any of those excite her. School isn't the only place to learn and do research.

To move the needle on the Ph.D. gauge to "Yes," I'd propose a 3-part process. Let's call the parts the Reality Check, the Personality Test, and the Skills Checklist.

The Reality Check:

I think a lot of students go into the graduate school decision with some overly rosy misconceptions. Mostly because their experience and advice to date has all been from within their department and university and is rife with selection bias. So it's important to burst a few bubbles for our hypothetical student.

  1. She's not going to be a professor. In almost every field, the odds are just strikingly against getting a full-time, tenure-track position.
  2. Attrition rates are higher than she probably realizes. There's a non-trivial probability that she'll drop out or flunk out before she finishes.
  3. Even if she does finish, it's not going to be in 4 or 5 years. Think seven.
  4. Grad school is not an intellectual salon, where she's going to be discussing the big questions and probing into deep insights about nature and the universe and all. There's some of that, to be sure. But it's large part tribal initiation (with all the gratuitous nonsense that implies), and no small part straight-up hazing.

It's important to be pessimistic when making this decision. If she imagines some tough, but not-improbable realities, and finds herself flinching, then grad school is probably not the right decision for her right now. For many, the realization that a successful academic career is unlikely is enough to deter them; if they're going to end up in industry anyway, why not start there?

If this bums her out a little, but she's so devoted to her field that she can accept these, she can go ahead and tick the needle towards "Yes" a bit.

The Personality Test:

Successful grad students aren't like normal humans. The following questions should test whether she's got the necessary personality traits.

Can she give one or, preferably, more examples of times working on a research project when she was:

  1. Inquisitive: new research questions (not necessarily original ones) came to her while studying; when learning about a tool or technique she thought of new contexts it might be applied to.
  2. Disciplined: she had set daily/weekly routines for making progress on a project; she persisted in these routines even when she was bored or burnt-out on the project.
  3. Obsessive: she couldn't stop thinking or talking about her work; she couldn't tear herself away from a project without checking every detail, or testing every possible permutation of her model or experiment.
  4. Delusional: she was sure she was going to uncover something new and important with her research.

The Skills Checklist:

Not all of these skills are necessary for someone going into a Ph.D. program---indeed, part of the point of grad school is to obtain these skills. But the more she goes in with, the less pain she's going to feel. If she's answering no to most of these, she'll want to hold off on a Ph.D.

  1. Is she comfortable with at least some analytical software used in her field? R, SAS, Stata, Matlab?
  2. Is she comfortable with at least one programming language like Python, Perl, C/C++, Java, etc.? (R or Matlab count if she's done more than just import data and run built-in functions.)
  3. Does she have an organized and efficient workflow system? See, e.g., Keiran Healy's Choosing Your Workflow Applications.
  4. Is she comfortable with mathematical proofs at the level of a first- or second-semester real analysis course?
  5. Is she comfortable reading through an upper-level undergraduate or lower-level graduate textbook in her field? Does she have experience doing self-study at this level---not for a class or required project, but for her own interest?
  6. Can she read and follow some recent empirical literature in her field?
  7. Is she a competent, conscientious writer? Does she understand how to write in a clear, concise style, using plain English? Does she understand how to structure an argument, and compose clear paragraphs and sentences? Has she consciously tried to improve her writing, either in a class or by reading writing/style guides?

If she's gone through these three tests and the needle has moved most of the way towards "Yes," then I have a whole other slew of advice for choosing a program (more importantly, avoiding bad programs). The most effective of which is to buy beers for some grad students in the department. After two rounds you'll probably know whether or not you want to be in that program.

She's decided to do her Ph.D. Would you recommend she take some time off prior to grad school, or should she jump right in?

Carl: There's a balance to strike here. On the one hand, I find people with a little more experience and maturity, as well as stability in their personal lives cope better with the stresses of a Ph.D. program (and tend to finish faster). On the other hand, the older you get and the more responsibilities you accumulate, the harder it is to bear the costs in time and income required by a Ph.D.

But if she's 22 or 23, and all she's known is high school and undergrad, then yes, I'd definitely suggest doing something else a year or two before her Ph.D. See the Skills Checklist above. Try and find a job that will let her check off some of those boxes. There are lots of them out there. A normal job also confers a lot of meta-skills useful for grad school: working on teams, putting up with assholes, communicating effectively, dealing with hard deadlines, structuring her time and work-flow.

While she's doing that, take some night classes, work on side projects, keep a notebook of research questions she'd like to look into. Also save up some money and try to get into a stable, monogomous relationship with someone whom she can go to for emotional support during her Ph.D. Do some traveling if she can.

Do you have any thoughts on going from undergrad into a Ph.D. program verses first completing a masters?

Carl: In most cases (at least in the U.S.) an M.A. isn't really a meaningful prelude to a PhD. It's a different track. Indeed, I've spoken to professors at programs who've told me they avoid M.A. students---especially their own---in their Ph.D. admissions.

I wouldn't unequivocally say you shouldn't get an M.A.---I have two of them---but I'm a little down on them. I think many departments treat their M.A. programs as cash cows, and the programs don't provide a good return-on-investment.

So, given that, don't pay for an M.A. I didn't, and wouldn't have, paid for either of mine. Get a full-time job, look for a program with a part-time curriculum, and and try to get your company to pay for most, or all, of your tuition. Alternatively, some of the better programs will provide financial support or merit scholarships.

Also, pick your courses with an eye toward Ph.D. admissions. Talk to program advisors for advice on this. This typically means spending your electives taking Ph.D.-level courses. Whenever a course has an M.A. and a Ph.D. version, take the latter.

What benefit(s) does having a Ph.D. bring for someone working in industry?

Carl: It really depends on the industry. For some it opens doors, often a lot of them; for others it's absolutely necessary to progress past a certain point. In some cases, it can actually be a negative---you get pegged as an egghead and put in the back office. Avoid those places, even if you don't have a Ph.D.

For the most part it's an easy signal to recruiters, hiring managers, bosses, colleagues, and clients that you've got a certain set of skills (even if you don't really have those skills). Obviously, you may end up using a lot of what you learned in your Ph.D. to do your job, but in my limited experience, someone without a Ph.D., but with enough experience on the job and who is a motivated learner can do pretty much the same work as someone with a Ph.D. Again, depends on the industry.

If you're working in a technical or research-oriented place where there are a lot of Ph.D.s, not having one may nix a lot of options for you, and you may have to fight to prove you can do the same work.

Carl Vogel is an economic consultant at Navigant Economics. He has a masters in in economics from the University of Toronto and and masters in statistics from Columbia University. You can find more of him on his website and on Twitter.

Caktus GroupCaktus Participates in DjangoCon 2013

Caktus is happy to be involved in this year’s DjangoCon hosted in Chicago.  We are pumped about the great lineup of speakers and can’t wait to see some of our old friends as well as meet some new folks. Beyond going to see the wonderful talks, Caktus is participating as a sponsor and Tobias McNulty will be speaking on scaling Django web apps. Come stop by our booth or see Tobias’ talk to connect with us.

Tobias’ talk "Scaling Your Write-Heavy Django App: A Case Study" will delve into some Django related scaling challenges we encountered while developing a write-heavy web app that would be deployed to millions of students and parents in the Chicago public school system. Through the lens of these specific problems he will show widely applicable solutions to forecasting web app loads, scaling, and automating the configuration so that it is completely repeatable.

We are proud to support the Django community through our sponsorship and involvement in DjangoCon. We’re all looking forward to the event and hope to see some of you there!

Tim HopperShould I Do a Ph.D.? Eric Jonas

For my third interview on this question, I talked with (soon-to-be Dr.) Eric Jonas. Eric is a Ph.D. candidate in computational neurobiology and, because of a very unique Ph.D. experience, brings a unique perspective to this question.

If you have any feedback on this series, please share it with me. I'm compiling feedback for a wrapup post at the end.

A 22-year old college student has been accepted to a Ph.D. program in a technical field. He's academically talented, and he's always enjoyed school and his subject matter. His acceptance is accompanied with 5-years of guaranteed funding. He doesn't have any job offers but suspects he could get a decent job as a software developer. He's not sure what to do. What advice would you give him, or what questions might you suggest he ask himself as he goes about making the decision?

Eric: The first thing to recognize is just how different a PhD program is from being an undergrad:

Time allocation and duration: In undergrad, your life moves in term-sized chunks -- you almost never have projects stretch over multiple terms. Something is expected of you every week. In a PhD, you are often much more at the mercy of your own motivation.

Learning: In undergrad, the focus is on learning as much new information as quickly as possible, and all of the information is well-organized. As hard as it may be to believe sometimes, there has been a real focus on pedagogy by your instructors, textbook authors, and institution. Problem set problems give you an instant sense of "getting it right", and often have well-defined answers. Graduate school classes are often quite different -- they are much smaller and often built on top of reading cutting-edge research, with ambiguity, contradition, and even politics.

School status: Most of us went to the best undergraduate institution we could get into. Undergraduate "school name" carries weight, and people (however unfortunately) pay attention to rankings such as those in US News & World Report. This is much much less the case in graduate school. What really matters is the advisor you work with, then the department, and only then the school. The smartest MIT undergraduates I knew at MIT scattered into the wind, some going to schools I had only barely heard of, to work with the best set theoreticians, or the best logicians, or the best experimental biologists, in the world.

My best advice to an undergraduate curious about the "experience" of graduate school is: work in a lab while you're an undergrad. Or even, take a year off and work at a low wage as a tech in a lab. Watch how the students work with their advisor. Observe what their lives are like. Figure out what makes someone a happy and successful graduate student.

But above all, to speak directly to your hypothetical, never ever go to graduate school as a "backup" to getting an industry job. I have never seen this work out well. In graduate school for CS, your stipend will be $30k a year and you will work 80 hours a week. Your friends working 80h/week in finance will be making literally 10x that. You say "I don't care about money" but how will you feel when you can't afford to fly home because your parents are sick?

You left your Ph.D. program and then returned. Why did you decide to go back?

Eric: I never intended to "leave" my PhD program, and was actually (unfortunately) a student the entire time I did both startups. I always intended to finish. My work at the first startup, Navia, was much closer to my core PhD research than my role as CEO at the second startup, Prior Knowledge. My thesis was roughly finished around 2010, but various aspects of running a company prohibited me from finishing the "last bits", writing up the results for publication, and assembling the committee. I would not recommend this to many people -- it caused a lot of additional cognitive burden and stress! Things worked out well for me, but I want to stress this was the result of both luck and working with the most brilliant people I have ever met -- not through my own doing!

We also had incredibly supportive investors (Peter Thiel's group, The Founders Fund), who were committed to advancing the long-term state of the art. Prior Knowledge's acquiring company, Salesforce, has also been very supportive in letting me take a little time here and there to finish my degree.

You also started a company while working on your degree. What advice would you give to a current Ph.D. student considering the same thing?

Eric: It depends -- if you're just leaving your PhD to go start a company, that can be fine. I've heard (probably apocryphal) stories of Stanford CS PhD students viewing graduate school as a holding pattern until they do a startup.

If your entrepreneurial ambitions are related to your research, FINISH FIRST. You will have more credibility with investors, and you will most likely have to deal with your school's Technology Licensing Office to resolve IP issues anyway. Patents and scientific publications can interact in many ways, some of which are good and some of which are bad. Talk to other successful graduate student entrepreneurs from your school to figure out the hoops to jump through. Feel free to e-mail me if you're in this boat, I'm happy to talk.

Finally, if you're going to ignore the above advice, check to see if your school has some sort of "all but degree" (ABD) status, which can save your advisor and department a great deal of money by reducing your tuition costs.

What benefit(s) does having a Ph.D. bring to your work industry? Is a Ph.D. necessary for that kind of work?

Eric: Beyond the credentialing problem in science (people take you much more seriously if you have a PhD), learning how to execute and direct research is the most important part of a PhD. I plan on living at the interface between science and engineering for the rest of my life, and formal training in that process has been crucial to my success.

A lot of people think of a PhD as being like an undergraduate degree in that you've "learned a lot of material". This is false. You've learned a set of skills to produce knowledge -- to iterate on the scientific process and engage with the scientific community. That's one of the reasons why people basically never have "two" PhDs: once you've learned how to "do" the science, it's supposed to transfer across disciplines.

Eric Jonas is Predictive Scientist at salesforce.com and a Ph.D. student in Computational Neurobiology at MIT. He can be found at ericjonas.com and @stochastician.

Barry Peddycord IIII entered Ludum Dare and came out alive!

I entered Ludum Dare and came out alive!:

I created a game in the Ludum Dare 48-hour game compo! I wrote it all from scratch using #8bitmooc, and I only used the documentation that I put in the help pages, without referring to the NESDev wiki! I found a few bugs in my emulator and my assembler as a result, and had plenty of fun!

Tim HopperShould I Do a Ph.D.? Paul Rubin

For the second interview, I talked to Paul Rubin, Professor Emeritus of Management Science at Michigan State University.

A 22-year old college student has been accepted to a Ph.D. program in a technical field. He's academically talented, and he's always enjoyed school and his subject matter. His acceptance is accompanied with 5-years of guaranteed funding. He doesn't have any job offers but suspects he could get a decent job as a software developer. He's not sure what to do. What advice would you give him, or what questions might you suggest he ask himself as he goes about making the decision?


  1. Do you want to be a college professor who teaches and does research, do you want to teach college but not do research, or do you want to get a job in industry (including government, non-profits etc. in that category)? The first path requires a Ph.D.. The second path may require a Ph.D. if you want to work at some private colleges, but it may also be possible to get a good teaching job with just a masters degree (which will put you into the salary pool faster). With the third category, it varies. There are quite a few industry jobs that require only a bachelor's or masters degree, while some require a Ph.D..

  2. Do you enjoy (or do you think you will enjoy) research? Getting a Ph.D. is a major commitment of time, a substantial portion of which is quite likely to be devoted to original research. Some institutions, in some fields, allow "expository" dissertations -- such as an annotated history of military purchasing patterns, from Goliath to the 21st century Pentagon -- but in technical fields and/or at reasonably prestigious schools, dissertations are very likely to require original research. Some people enjoy doing research, some hate it, some fall in between. If you do not enjoy doing research, pursuing a Ph.D. will be difficult, unfulfilling and possibly pointless (since you will not want a job with research expectations).

  3. How important is compensation to you versus how interesting the work is, personal prestige, etc.? A Ph.D. is often the ticket to jobs that are more intellectually stimulating than those requiring just a bachelor's degree, with the caveat that "often" does not equal "always". For some people, being able to identify themselves as "John Doe, Ph.D." also carries some value. The compensation side is less clear. Compensation tends to go up with job experience (at least in industry; not as reliably in academe), and a lengthy postgraduate education delays your entry into the workforce. Even if you have an assistantship, you will likely make less money as a student than you would as an employee with a bachelor's degree. Add in educational expenses you will incur and the fact that you will enter the workforce with less accrued history and experience, and possibly in a lower position or at a lower salary (depending on the job) than where you would have been at graduation date if you had gone straight into industry, and you may be better off from a purely economic perspective going straight to work. (On the other hand, you may miss out on a fair bit of fun as a graduate student. Work is, after all, a four letter word.) When I was completing my Ph.D., my intention was to go into industry. In one of my job interviews, I discovered that the employer counted a Ph.D. as the equivalent of something between two and four years of work experience for salary purposes, although it was taking me six years (post-bachelor's) and change.

  4. Are you geeked by the idea of going as far as you can in a particular field? As a high school student, I made up my mind that I would get a Ph.D., not because I had any real sense of comparative job prospects, but simply because I wanted to take my education as far as I could in my chosen discipline (mathematics).

Would you recommend he take some time off prior to grad school, or should he jump right in?

Paul: This varies according to the individual. A Ph.D. program can be a long slog, especially starting from a bachelor's degree, so if you are a bit burned out on school after getting that bachelor's, definitely take some time off and work. Working helps recharge the bank account, and is also a useful motivator. Once you have experienced the 9 - to - 5 grind, you may see that Ph.D. in a whole new light. (I experienced the wonders of office drone life working summers while I was in college, so I did not need a gap between undergraduate and Ph.D. studies to convince me that anything that delayed having to work for a living was a good idea.)

In some disciplines, having relevant industry experience will help put course work in perspective, will help enforce the relevance of some theoretical results, and will help to assess the low likelihood of satisfying the assumptions behind other theories. A special case is the desire to teach and do research in a technical field (such as Management Science) in a business school. Some business schools are skeptical of purely technical types. When it comes time to find a faculty position, having "real world business experience" in addition to that technical Ph.D. may well make you more attractive. (Having an MBA would make you still more attractive.) In part this has to do with somewhat fuzzy accreditation standards; in part it stems from a not-unreasonable belief that knowing something about how businesses actually operate will help you in the classroom, even when teaching technical subjects; and in part it reflects the desire for greater staffing flexibility (being able to plug you into a less technical, more business-oriented course if push comes to shove). That last consideration may well exist in other application areas besides business.

If you are not burnt out, your bank account is not drained, and you do not need to acquire "real world credibility", going straight into the Ph.D. program has two majors advantages. First, it gets you out the other end and into your career trajectory that much sooner. (See previous comments about the economics of the decision.) Second, you will have developed certain studying skills and habits along the way, and if your record is good enough to get you into a Ph.D. program, those skills and habits are apparently fairly effective. For many people, there is a nontrivial atrophy of their study skills while they are employed, particularly in jobs not strongly related to their primary discipline. Some students have attributed lower than anticipated scores on graduate entrance examinations to a dulling of their test-taking abilities. A seamless transition from undergraduate to graduate studies is in some ways analogous to a distance runner not having to break stride.

Do you have any thoughts on going from undergrad into a Ph.D. program verses first completing a masters?

Paul: In business, an MBA is not strictly an interior point on the line segment connecting bachelor's degree to Ph.D. It covers substantial material outside your discipline, it brings perspectives distinct from those you will have in the Ph.D. program (and very often distinct from those in the undergraduate program), and (as noted above) can be a valuable credential.

In most technical disciplines, on the other hand, a Ph.D. thoroughly subsumes a masters degree. Having both a masters and doctorate in mathematics is no better than having just a doctorate. The primary, if not exclusive, virtue of the intermediate masters degree is to allow you to dip your toes into the graduate education pool. Often, though, a doctoral program has an early exit strategy that allows students who lose interest in the Ph.D. or fail to get past some hurdle (usually comprehensive examinations) to earn a masters with little or no extra course work.

The remaining case is when the masters is in a different, complementary discipline. In my case, I earned bachelors and doctoral degrees in mathematics but a masters degree in probability and statistics. For someone in operations research, even if they work primarily on deterministic models (as in my case), and understanding of statistics is useful when estimating model parameters from actual data, and understanding of hypothesis testing is useful when trying to prove that your algorithm is better than the previous benchmark, and an understanding of probability theory is useful in a variety of contexts. I also considered a masters in computer science -- knowing data structures, database management principles and the essence of how compiled code works have all proven useful -- but ended up filling those gaps just by course work.

Admission and funding aside, what should a potential Ph.D. student look for in a graduate program?

Paul: If you are aiming at an academic career, look at the job placements for recent graduates. Take a look at publication records for faculty (these days, you can usually find that on a web site at the school), and see if they are active in research. If you know you are interested in a particular subdiscipline (integer programming, compiler design, bionic limbs), look specifically for faculty active in that area. Not only will you want courses or seminars of relevance, but you will need a dissertation chair who has some clue about your topic. Finally, if you can get in touch with a few of their students, try to sniff out what the culture of the relevant department is like, and whether the faculty you most likely would prefer as mentors are accessible, pleasant to work with, and willing to take on students. (A somewhat common misconception among applicants is that you can pick your dissertation advisor. It generally requires mutual consent.) Senior students (the kind you want to quiz) often attend professional meetings (for job hunting as well as for the other attractions of the meeting); that might be a good place to try to make new friends.

If you could do it over again, would you make any changes to the way you went about getting a Ph.D.?

Paul: There are small glitches that I would repair if I could -- for instance, my advisor took a sabbatical just as I was finishing my opus, and it sat unread for a full term -- but overall, I have no significant regrets, and there is nothing substantial I would change.

Paul Rubin holds a Ph.D. in mathematics from Michigan State University. He is Professor Emeritus at Michigan State where he started in 1980. He blogs at OR in an OB World and is active on Twitter.

Tim HopperShould I Do a Ph.D.? John D. Cook

To start my series of interviews on the question "Should I do a Ph.D.?", I asked John Cook several questions.

A 22-year old college student has been accepted to a Ph.D. program in a technical field. He's academically talented, and he's always enjoyed school and his subject matter. His acceptance is accompanied with five years of guaranteed funding. He doesn't have any job offers, but he suspects he could get a decent job as a software developer. He's not sure what to do. What advice would you give him?

John: There are basically two reasons to get a Ph.D.: personal satisfaction, and credentials for a job requiring a Ph.D. It's hard to say what the financial return is on a Ph.D. Some say it lowers your earning potential, but that's confounded by the fact that people with Ph.D.'s tend to seek high security rather than high compensation employment. Some say a Ph.D. increases your earning potential, but they often don't account for the work experience you could have gained by working during the time it takes to complete a degree.

Would you recommend he take some time off prior to grad school, or should he jump right in?

John: Whether to go straight into grad school would depend on personal circumstances. It could be good to take some time to evaluate whether you really want to go to grad school or whether you are just doing that just because it's a natural continuation of what you've been doing for the past 17 years. If you're sure you want to do grad school, it's probably best to keep going and not lose momentum.

Do you have any thoughts on going from undergrad into a Ph.D. program verses first completing a masters?

John: There's more difference between a Ph.D. and a masters degree than many people realize. A masters program is a fairly direct continuation of an undergraduate degree. It's more specialized and more advanced, but it's still mostly based on class work.

The goal of a Ph.D. program is to produce original research and prepare for a career as a researcher. You won't necessarily even become very knowledgeable of the area you get your degree in. But you will develop the discipline of working on a long-term project with little external guidance.

In the humanities, you're often required to complete a master's degree on the way to a Ph.D. That's not as common in the sciences. By going straight for a Ph.D., you might graduate a semester sooner. It's nice to grab a master's degree along the way so that you have something to show for yourself if you don't finish the Ph.D.. But I think most schools will give you a masters on your way out if you go straight for a Ph.D. but don't finish.

If you could do it over again, would you make any changes to the way you went about getting a Ph.D.?

John: If I were to do my Ph.D. over again, and if I wanted to be an academic, I'd learn more about how the academic game is played. When I was in grad school, I learned a lot of math, but I didn't learn much about how to succeed in an academic career. I knew nothing about grants, for example, or strategies for publishing papers. I also would be more selective of my research topic, picking something I was more deeply interested in.

You've spent much of your career doing work different from your field of expertise. What would you say to a current Ph.D. student who is concerned that his field or research is too specialized?

John: You've got to set your own agenda for your time in grad school. In addition to what is required to graduate, you may have additional things you want to learn while you have the opportunity: access to libraries, access to people to ask questions of, and most of all, free time. (When I was in college, a pastor once told a group of students that we had more free time now than we'd ever have again. Of course we all thought he was crazy, but he was right.) You ought to have a survey knowledge of your field, even if that's not required for graduation. You also need to have some idea how your field relates to the rest of the world, and your professors might not be the best people to learn this from. If a professor has never worked outside of academia, I'd be skeptical of anything he or she says about "the real world."

John D. Cook holds a Ph.D. in mathematics from University of Texas at Austin. He is an independent consultant. He blogs at The Endeavor and is active on Twitter.

Tim HopperShould I do a Ph.D.?

Over the coming weeks, I am going to be publishing a series of interviews on the question "Should I do a Ph.D.?" These interviews are with a variety of people: Ph.D.'s and not, academics and not. It's a question I've wrestled with at great length, and I know many others have as well. I hope reading the interviews will some answer the question themselves and help others provide counsel to those asking themselves this question.

I want to qualify myself slightly. I am primarily orienting this towards those with technical backgrounds going into graduate programs in the United States. This has several implications. First, I'm assuming that the student asking this question will have his tuition waved and be provided with a humble stipend. Second, I'm assuming a typical degree programs starts with two years of classes and is followed by 2-4 years of research.

Finally, you should look to two other resources in your quest to answer this question. First, read Vivek Haldar's excellent post on the very topic.. Read it carefully, go for a long walk while thinking over it, and then read it again.

Second, read Matt Might's blog. Start with The Illustrated Guide to a Ph.D., follow with reading his entire Graduate School section, and finish by reading all the rest of his posts.1

Caktus GroupRaspberry IO open sourced

Back in March, at PyCon 2013, the PSF provided each attendee with a Raspberry Pi, a tiny credit-card sized computer meant to be paired with the Python programming language. The power and portability of the Raspberry Pi has stimulated an explosion of interest among hobbyists and educators. Their uses seem to be limited only by our collective imagination.

Along with that generous gift, the PSF contracted with Caktus to help tap into this collective imagination. Raspberry IO is a site dedicated to Python and Raspberry Pi enthusiasts. The goal of the site is to serve as a friendly place where anyone can learn and teach about using Python with a Raspberry Pi.

We are proud to announce that Raspberry IO is now an open source project. We've written tests and documentation to help you get started and contribute your ideas to the site. Please create an issue at the github repository.

We're excited about the possibilities for a site like this in the hands of the Raspberry Pi and Python communities.  If you have an interesting Raspberry Pi project, then we'd love for you to tell us about it! Razzy welcomes you!

Barry Peddycord IIIThe Commodore 64 in 64 minutes.

The Commodore 64 in 64 minutes.

Barry Peddycord IIIThe Challenges with Challenges

These past few days, I’ve been working on the auto-graded assignments for #8bitmooc. I’ve managed to get three out of the nine assignments done, two of the easy ones (doing controller input and drawing sprites) and one of the hard ones (Game of Life).

So far, each of these has taken about eight hours to work on each, since I first have to develop the game around the solution I want students to develop, and then I have to solve it myself. After writing a solution (to make sure it’s solvable and not insanely tedious) I come up with acceptance tests and then make sure my solution passes those. However, since I haven’t had the opportunity to actually test my emulator, I end up using this experience to debug my emulator, and I’ve found some particularly horrifying bugs. Luckily the emulator I’ve written is only looking at a subset of the architecture, but the last thing I want to happen is for a student to make a working solution and then wonder why they aren’t getting credit for it.

One thing I’ve been grappling with lately is this notion of “speed”. Whenever you get a correct solution, you are scored on how small and fast your program is. The size is easy to deal with, since it’s just a ROM. But the speed is another matter - how do I decide how fast your program went? Do I have a single test case for speed? Or do I average the speed across all of the test cases?  I’m leaning towards the average approach, since it seems the most fair, but I’m still on the lookout for a better idea.

The reason I’m working so quickly is because I have a workshop on NES programming coming up in Atlanta, GA next week, and I really want to use #8bitmooc as my environment. This weekend, I’m going to be doing a few more autograded exercises and bringing up the quality on some of the wiki pages - especially the pages on the hardware and memory mapped registers. The tutorial has been put on hold for the time being, since I’m questioning some of its utility (especially since there are very few interactive segments, it’s basically a written lecture). Sometimes I feel like I need very concise reference pages, but I really want the help center to be highly conversational with an approachable reading level.

Caktus GroupMigrating to a Custom User Model in Django

The new custom user model configuration that arrived in Django makes it relatively straightforward to swap in your own model for the Django user model. In most cases, Django's built-in User model works just fine, but there are times when certain limitations (such as the length of the email field) require a custom user model to be installed. If you're starting out with a custom user model, setup and configuration are relatively straightforward, but if you need to migrate an existing legacy project (e.g., one that started out in Django 1.4 or earlier), there are a few gotchas that you might run into. We did this recently for one of our larger, long-term client projects at Caktus, and here's an outline of how we'd recommend tackling this issue:

  1. First, assess any third party apps that you use to make sure they either don't have any references to the Django's User model, or if they do, that they use Django's generic methods for referencing the user model.

  2. Next, do the same thing for your own project. Go through the code looking for any references you might have to the User model, and replace them with the same generic references. In short, you can use the get_user_model() method to get the model directly, or if you need to create a ForeignKey or other database relationship to the user model, you can settings.AUTH_USER_MODEL (which is just a string corresponding to the appname.ModelName path to the user model).

    Note that get_user_model() cannot be called at the module level in any models.py file (and by extension any file that a models.py imports), due to circular reference issues. Generally speaking it's easier to keep calls to get_user_model() inside a method whenever possible (so it's called at run time rather than load time), and use settings.AUTH_USER_MODEL in all other cases. This isn't always possible (e.g., when creating a ModelForm), but the less you use it at the module level, the fewer circular references you'll have to stumble your way through.

    This is also a good time to add the AUTH_USER_MODEL setting to your settings.py, just for the sake of explicitness:

# settings.py

AUTH_USER_MODEL = 'auth.User'
  1. Now that you've done a good bit of the leg work, you can turn to actually creating the custom user model itself. How this looks is obviously up to you, but here's a sample that duplicates the functionality of Django's built-in User model, removes the username field, and extends the length of the email field
# appname/models.py

from django.db import models
from django.utils import timezone
from django.utils.http import urlquote
from django.utils.translation import ugettext_lazy as _
from django.core.mail import send_mail
from django.contrib.auth.models import AbstractBaseUser, PermissionsMixin

class CustomUser(AbstractBaseUser, PermissionsMixin):
    A fully featured User model with admin-compliant permissions that uses
    a full-length email field as the username.

    Email and password are required. Other fields are optional.
    email = models.EmailField(_('email address'), max_length=254, unique=True)
    first_name = models.CharField(_('first name'), max_length=30, blank=True)
    last_name = models.CharField(_('last name'), max_length=30, blank=True)
    is_staff = models.BooleanField(_('staff status'), default=False,
        help_text=_('Designates whether the user can log into this admin '
    is_active = models.BooleanField(_('active'), default=True,
        help_text=_('Designates whether this user should be treated as '
                    'active. Unselect this instead of deleting accounts.'))
    date_joined = models.DateTimeField(_('date joined'), default=timezone.now)

    objects = CustomUserManager()

    USERNAME_FIELD = 'email'

    class Meta:
        verbose_name = _('user')
        verbose_name_plural = _('users')

    def get_absolute_url(self):
        return "/users/%s/" % urlquote(self.email)

    def get_full_name(self):
        Returns the first_name plus the last_name, with a space in between.
        full_name = '%s %s' % (self.first_name, self.last_name)
        return full_name.strip()

    def get_short_name(self):
        "Returns the short name for the user."
        return self.first_name

    def email_user(self, subject, message, from_email=None):
        Sends an email to this User.
        send_mail(subject, message, from_email, [self.email])

Note that this duplicates all aspects of the built-in Django User model except the get_profile() method, which you may or may not need in your project. Unless you have third party apps that depend on it, it's probably easier simply to extend the custom user model itself with the fields that you need (since you're already overriding it) than to rely on the older get_profile() method. It is worth noting that, unfortunately, since Django does not support overriding model fields, you do need to copy all of this from the AbstractUser class within django.contrib.auth.models rather than simply extending and overriding the email field.

  1. You might have noticed the Manager specified in the model above doesn't actually exist yet. In addition to the model itself, you need to create a custom manager that supports methods like create_user(). Here's a sample manager that creates users without a username (just an email):
# appname/models.py

from django.contrib.auth.models import BaseUserManager

class CustomUserManager(BaseUserManager):

    def _create_user(self, email, password,
                     is_staff, is_superuser, **extra_fields):
        Creates and saves a User with the given email and password.
        now = timezone.now()
        if not email:
            raise ValueError('The given email must be set')
        email = self.normalize_email(email)
        user = self.model(email=email,
                          is_staff=is_staff, is_active=True,
                          is_superuser=is_superuser, last_login=now,
                          date_joined=now, **extra_fields)
        return user

    def create_user(self, email, password=None, **extra_fields):
        return self._create_user(email, password, False, False,

    def create_superuser(self, email, password, **extra_fields):
        return self._create_user(email, password, True, True,
  1. If you plan to edit users in the admin, you'll most likely also need to supply custom forms for your new user model. In this case, rather than copying and pasting the complete forms from Django, you can extend Django's built-in UserCreationForm and UserChangeForm to remove the username field (and optionally add any others that are required) like so:
# appname/forms.py

from django.contrib.auth.forms import UserCreationForm, UserChangeForm

from appname.models import CustomUser

class CustomUserCreationForm(UserCreationForm):
    A form that creates a user, with no privileges, from the given email and

    def __init__(self, *args, **kargs):
        super(CustomUserCreationForm, self).__init__(*args, **kargs)
        del self.fields['username']

    class Meta:
        model = CustomUser
        fields = ("email",)

class CustomUserChangeForm(UserChangeForm):
    """A form for updating users. Includes all the fields on
    the user, but replaces the password field with admin's
    password hash display field.

    def __init__(self, *args, **kargs):
        super(CustomUserChangeForm, self).__init__(*args, **kargs)
        del self.fields['username']

    class Meta:
        model = CustomUser

Note that in this case we do not use the generic accessors for the user model; rather, we import the CustomUser model directly since this form is tied to this (and only this) model. The benefit of this approach is that it also allows you to test your model via the admin in parallel with your existing user model, before you migrate all your user data to the new model.

  1. Next, you need to create a new admin.py entry for your user model, mimicking the look and feel of the built-in admin as needed. Note that for the admin, similar to what we did for forms, you can extend the built-in UserAdmin class and modify only the attributes that you need to change, keeping the other behavior intact.
# appname/admin.py

from django.contrib import admin
from django.contrib.auth.admin import UserAdmin
from django.utils.translation import ugettext_lazy as _

from appname.models import CustomUser
from appname.forms import CustomUserChangeForm, CustomUserCreationForm

class CustomUserAdmin(UserAdmin):
    # The forms to add and change user instances

    # The fields to be used in displaying the User model.
    # These override the definitions on the base UserAdmin
    # that reference the removed 'username' field
    fieldsets = (
        (None, {'fields': ('email', 'password')}),
        (_('Personal info'), {'fields': ('first_name', 'last_name')}),
        (_('Permissions'), {'fields': ('is_active', 'is_staff', 'is_superuser',
                                       'groups', 'user_permissions')}),
        (_('Important dates'), {'fields': ('last_login', 'date_joined')}),
    add_fieldsets = (
        (None, {
            'classes': ('wide',),
            'fields': ('email', 'password1', 'password2')}
    form = CustomUserChangeForm
    add_form = CustomUserCreationForm
    list_display = ('email', 'first_name', 'last_name', 'is_staff')
    search_fields = ('email', 'first_name', 'last_name')
    ordering = ('email',)

admin.site.register(CustomUser, CustomUserAdmin)
  1. Once you're happy with the fields in your model, use South to create the schema migration to create your new table:
python manage.py schemamigration appname --auto
  1. This is a good point to pause, check out your user model via the admin, and make sure it looks and functions as expected. You should still see both user models at this point, because we haven't yet adjusted the AUTH_USER_MODEL setting to point to our new model (this is intentional). You may have to delete the migration file and repeat the previous step a few times if you don't get it quite right the first time.
  2. Next, we need to write a data migration using South to copy the data from our old user model to our new user model. This is relatively straightforward, and you can get a template for the data migration as follows:
python manage.py datamigration appname --freeze otherapp1 --freeze otherapp2

Note that the --freeze arguments are optional and should be used only if you need to access the models of these other apps in your data migration. If you have foreign keys in these other apps to Django's built-in auth.User model, you'll likely need to include them in the data migration. Again, you can experiment and repeat this step until you get it right, deleting the incorrect migrations as you go.

  1. Once you have the template for your data migration created, you can write the content for your migration. A simple forward migration to simply copy the users, maintaining primary key IDs (this has been verified to work with PostgreSQL but no other backend), might look something like this:
# appname/migrations/000X_copy_auth_user_data.py

class Migration(DataMigration):
    def forwards(self, orm):
        "Write your forwards methods here."
        for old_u in orm['auth.User'].objects.all():
            new_u = orm.CustomUser.objects.create(
                        email=old_u.email and old_u.email or '%s@example.com' % old_u.username,
            for perm in old_u.user_permissions.all():
            for group in old_u.groups.all():

Since we ensure that the primary keys stay the same from one table to another, we can just adjust the foreign keys in our other models to point to this new custom user model, rather than needing to update each row in turn.

Note 1: This migration does not account for any duplicate emails that exist in the database, so if this is a problem in your case, you may need to write a separate migration first that resolves any such duplicates (and/or manually resolve them with the users in question).

Note 2: This migration does not update the content types for any generic relations in your database. If you use generic relations and one or more of them points to the old user model, you'll also need to update the content type foreign keys in these relations to reference the content type of the new user model.

  1. Once you have this migration written and tested to your liking (and have resolved any duplicate user issues), you can run the migration and verify that it did what you expected via the Django admin.
python manage.py migrate

Needless to say, we recommend doing this testing on a local, development copy of the production database (using the same database server) rather than on a live production or even staging server. Once you have the entire process complete, you can test it on a staging (and finally on the production) server.

  1. Before we switch to our new model, let's create a temporary initial migration for the auth app which we can later use as the basis for creating a migration to delete the obsolete auth_user and related tables. First, create a temporary module for auth migrations in your settings file:
    'auth': 'myapp.authmigrations',

Then, "convert" the auth app to South like so:

python manage.py convert_to_south auth

This won't do anything to your database, but it will create (and fake the run of) an initial migration for the auth app.

  1. Let's review where we stand. You've (a) updated all your code to use the generic interface for accessing the user model, (b) created a new model and the corresponding forms to access it through the admin, and (c) written a data migration to copy your user data to the new model. Now it's time to make the switch. Open up your settings.py and adjust the AUTH_USER_MODEL setting to point to the appname.ModelName path of your new model:
# settings.py

AUTH_USER_MODEL = 'appname.CustomModel'
  1. Since any foreign keys to the old auth.User model have now been updated to point to your new model, we can create migrations for each of these apps to adjust the corresponding database tables as follows:
python manage.py schemamigration --auto otherapp1

You'll need to repeat this for each of the apps in your project that were affected by the change in the AUTH_USER_MODEL setting. If this includes any third party apps, you may want to store those migrations in an app inside your project (rather than use the SOUTH_MIGRATION_MODULES setting) so as not to break future updates to those apps that may include additional migrations.

  1. Additionally, at this time you'll likely want to create the previously-mentioned South migration to delete the old auth_user and associated tables from your database:
python manage.py schemamigration --auto auth

Since South is not actually intended to work with Django's contrib apps, you need to copy the migration this command creates into the app that contains your custom model, renumbering it along the way to make sure that it's run after your data migration to copy your user data. Once you have that created, be sure to delete the SOUTH_MIGRATION_MODULES setting from your settings file and remove the unneeded, initial migration from your file system.

  1. Last but not least, let's run all these migrations and make sure that all the pieces work together as planned:
python manage.py migrate

Note: If you get prompted to delete a stale content type for auth | user, don't answer "yes" yet. Despite Django's belief to the contrary, this content type is not actually stale yet, because the South migration to remove the auth.User model has not yet run. If you accidentally answer "yes," you might see an error stating InternalError: current transaction is aborted, commands ignored until end of transaction block. No harm was done, just run the command again and answer "no" when prompted. If you'd like to remove the stale auth | user content type, just run python manage.py syncdb again after all the migrations have completed.

That completes the tutorial on creating (and migrating to) a custom user model with Django 1.5's new support for the same. As you can see it's a pretty involved process, even for something as simple as lengthening and requiring a unique email field in place of a username, so it's not something to be taken lightly. Nonetheless, some projects can benefit from investing in a custom user model that affords all the necessary flexibility. Hopefully this post will help you make that decision (and if needed the migration itself) with all the information necessary to make it go as smoothly as possible.

Barry Peddycord IIIProgress Update and Radio Silence

Hello everyone! Long time no post! I’ve been working on other projects this summer (namely the ones that keep the bills paid and food on the table) so #8bitmooc has been suffering from some radio silence. However, big things have been happening lately!

New Interface

I’ve recently overhauled and streamlined the user interface for the website to make it much more efficient for students to see where they are in the class. I’ve also decided to do away with logging in and out and offload that to Github instead so that Github can deal with account verification and whatnot - I just use them for logging in. This should help since it will not only make my life easier, I would like - at some point - to make it possible to publish your NES games to Github from #8bitmooc. I think that would be pretty awesome, since it would hopefully encourage more programming in the future. :)

Even though it seems like a waste of time to be redoing the interface when I should be working on content, I simply can’t keep my thoughts straight when I’m not happy with the interface. By redoing the interface, I become happier, and that encourages me to keep working on other things, such as the course content!

An In-Person Version

At one point, I considered doing a podcast or something to have some sort of verbal interaction with the course content. Even if there aren’t that many students, talking things out is therapeutic and helps me think about what I want folks to get out of this course. However, I thought that an even better way to help me evaluate my process would be to hold an informal, in-person version of #8bitmooc. So I was thinking that I could take over a classroom one day a week on campus and see who I could get to come to attend an in-person work session around the content. I could record the interaction, and use it as an opportunity to answer questions about the content.

The only problem with this plan is that I would need more time to publicize the course. Granted, I’ve been doing a pretty sorry job of publicizing if I want the course to go live in September, but I feel like that’s not a big deal, since I should probably have a smaller “MOOC” before trying to cast a large net.

Design Changes

I’ve made a lot of design decisions lately around simplifying what I’m trying to do. I’ve scrapped the entire levels and experience points system of the course in favor of the much simpler “gamification” that I enjoyed when I took my assembly course. I feel like the high score board that motivates students to optimize their solutions does a good enough job of requiring students to explore the language space, so I shouldn’t have to add arbitrary experience points on top of that. While it complicates my message board system (I was really looking forward to the level-locked message boards), I feel like this puts less cognitive load on the user, leaving more mental capacity for actually engaging in the course content.

I’m going to start working on the challenges now - I’ve got them drafted, but I need to try to solve them myself before I’ll be happy trying to make others do them. Also, I’m giving a workshop on NES programming in two weeks, so I’m hoping to use #8bitmooc as a demo platform for this purpose.

Tim HopperOmnifocus Todo List to Pelican Website

Being the nerd that I am, I spent several hours today writing scripts to automatically add a todo list listing posts I hope to write and papers/books/articles I'd like to read. These are automatically scraped from Omnifocus using ofexport.

To assist my future self or anyone else who might be interested in such an undertaking, I put up a Github gist giving some of the details on making this happen.

Barry Peddycord IIIhttp://boingboing.net/2013/08/01/marriage-proposal-by-way-of-a.html?utm_source=dlvr.it&utm_medium=twitter

Tim HopperOnline Labor Markets and Static Site Generators

Several months ago, I decided to move this site from Wordpress to a static site generator. I've never loved the Wordpress backend, and I'd prefer to deal more directly with text files.

For those who don't know, static site generators create static HTML pages to be loaded to the web. Unlike a service like Wordpress, they don't call a database to load the post when you go to a URL. This has several advantages: it's harder to bring down, it can be hosted very cheaply on Amazon S3, and you don't deal with the security risks (and constant security updates) that are a liability with Wordpress.

My blogposts exist on my computer as a folder of text documents (specifically Markdown documents). To update my site, I run a script that compiles the text documents into the HTML files (that's the site generation part) you see here and then pushes the updates files to S3. At some point, I'm going to put the posts in a git repo and use a post-commit hook to do the updating.

Because I know Python and it seems to be an active project, I decided to go with the Pelican site generator. Unfortunately, Pelican's Wordpress conversion tool crashed on my site. After several attempts to reconcile the problems, I gave up.

A few weeks ago on a Saturday night, it occurred to me that someone else could probably sort out my problems and would be willing to do it for reasonable compensation. I signed up for an account on the freelancer marketplace oDesk. I described the problem and put out an offer to hire someone to convert my 21 extant posts to Pelican-formatted Markdown. Within 24 hours, I had three offers to do it at a very reasonable price.

I ended up hiring @abdullahkhalids, who is a physicist and programmer in Pakistan. He'd used Pelican before and was eager to take on the task. Within two days, he sent me a first draft of the Markdown posts. By the following Saturday, I made my payment and we were done. We'd have been done sooner except that I delayed in getting back to him.

Using oDesk was an incredible experience. I'm a big believer in trade and markets, and this experience validated those beliefs. I'm impressed at how well Odesk is able to breakdown national and geographic barriers to hiring workers. As far as I can tell, Abdullah and I both came out ahead in this transaction.

I can't promise that I'll keep up the blogging habit, but I'm going to work on making the task of writing and upload posts as easy as possible. For example, I'm hoping to setup a system that will allow me to write a post in Drafts on iOS and have it pushed live with only a single click.

Tim HopperNoisy Series and Body Weight

I put on some weight during my time in grad school, and this spring, I decided to do something about it. In April, I started using MyFitnessPal to track my food intake and exercise, and I run a net calorie deficit every day. Thankfully this seems to be working.

In May, I bought a Withings WS-30 wireless scale. When I first heard about this wifi scales, I thought they sounded like a gimmick, however the Withings has become a helpful tool in the weight loss process.

Every morning, I step on the scale and my weight is automatically broadcast to MyFitnessPal, Monitor Your Weight on iOS, and a text file in my Dropbox folder (via IFTTT and Withings' API). MyFitnessPal adjusts my daily calorie limit by my weight, Monitor Your Weight is a great tool for visualizing progressing, and I use the text file to import a ggplot time series of my weight into Day One each month.

An interesting aspect of my weight time series is how noisy it is. (No doubt this is true for others as well.) On many mornings, my weight is up from the day before (despite a fairly consistent net caloric deficit). As you can see from the plot, my weight jumps up and down daily even though the overall trend is downward.

I have been wondering what percentage of days I actually lose weight, so I decided to find out. The plot below is a histogram of my weight change from day to day.1

The data appears nearly Gaussian around 0! (In fact, the p-value on the Shapiro-Wilk normality test is 0.11, arguably not small enough to reject the null hypothesis that the data are not normally distributed.) Fortunately the mean of the differences is actually about -0.24 (pounds/day), and my progress is downward.

In total, I lost weight on 48 days, gained on 33, and stayed the same on 4% of the days. That means I've steadly lost weight while only moving down on 56% of days. I guess I don't need to be depressed every time my weight jumps up slightly....

  1. This isn't 100% true. I'm hiding the fact that I missed weighing-in on some days. 

Tim HopperGuide to Monte Carlo Methods?

I have started to realize that Monte Carlo methods of various kinds keep coming up in my work. Despite significant application of Monte Carlo in my grad school research, I think I only know enough to be dangerous. I'd like to get a better grasp on Monte Carlo methods (especially MCMC and simulation).

I asked on Twitter if anyone had a recommended reference that was readable and practical. Despite my love of measure theory, what I want is Monte Carlo Methods for the Very Applied Mathematician, not a theoretical text.

I got several recommendations. I'm not sure that any are exactly what I'm looking for, but I am certainly going to look deeper into them. Interestingly, they are all Springer books.

Several people recommended Glasserman's Monte Carlo Methods in Financial Engineering. I don't work in the financial sector, so it's hard for me to evaluate the table of contents to tell how well it generalizes.

Someone else recommended both Explorations in Monte Carlo Methods and Handbook of Markov Chain Monte Carlo for two levels of MCMC.

Finally, I got a recommendation for Introducing Monte Carlo Methods with R. This might be closest to what I'm looking for. It appears to cover a breadth of topics, and it includes lots of code.

Caktus GroupMark & Julia Speaking at OSCON. “Developers + Designers: Collaborating on your Open Source project”

On Wednesday, Mark and Julia are teaming up to deliver their talk "Developers + Designers: Collaborating on your Open Source project " at OSCON 2013. We’re thrilled that their talk got accepted and know that they have a lot to contribute to the Open Source conversation which too often lacks the voice of designers. Julia and Mark will be discussing how it is important to start collaborating early on your ideas before writing code. We’re happy to have such adept developers and designers here at Caktus pushing each other to work outside of their strict job titles and collaborate to build projects that couldn’t be built anywhere else. If you are at OSCON this week, stop by and hear Mark and Julia’s talk. Or, if you miss their talk, they will be around all week and are excited about meeting new people.

Joe GregorioPlatonic Programs

One of the important types of open source that doesn't get talked about is what I call Platonic Programs, which are small programs that exhibit the base level of functionality for some domain, but no more, and have small, clear, and consise code. These projects have outsized impacts on the ecosystem and yet aren't talked about that much.

For example, look at MicroEmacs, which not only has a string of variants that are all named a variation of "MicroEmacs", but also EmACT, Jasspa emacs, NanoEmacs, mg, and vile. My contention is that there's a pletora of child projects of MicroEmacs not only because it was eventually open source, but because the source code was simple and clean enough, and the base set of functionality small enough that many people would look at the code and think, "that would be my perfect editor if only they added X, Y and Z, and the code is simple enough that I can see exactly how to add all those features."

I ran into the idea of Platonic Program shortly after I wrote Robaccia, which was an example of the rudiments of a Python web framework. I was shocked by the uptake, including ports to Groovy, and inclusion in academic class coursework. I'm not saying Robaccia is a quintessential example of the Platonic Program, but after the experience I had with Robaccia, I started to see similar patterns in other projects.

As for the future, I think Fogleman Minecraft has a lot of potential, it takes a wildly popular paradigm and boils it down to less than 1,000 lines of Python. I just scroll through the code and think to myself, I could add X, Y and Z, and it would be so easy...

Caktus GroupFactory Boy as an Alternative to Django Testing Fixtures

When testing a Django application you often need to populate the test database with some sample data. The standard Django TestCase has support for fixture loading but there are a number of problems with using fixtures:

  • First, they must be updated each time your schema changes.
  • Second, they force you to hard-code dates which can create test failures when your date, which was “very far in the future” when the fixture was created, has now just passed.
  • Third, fixtures are painfully slow to load. They are discovered, deserialized and the data inserted at the start of every test method. Then at the end of the test that transaction is rolled back. Many times you didn’t even use the data in the fixture.

What is the alternative to fixtures? It’s simple: your test cases should create the data they need. Let’s take a simple model:

# models.py
from django.db import models

class Thing(models.Model):
    name = models.CharField(max_length=100)
    description = models.CharField(max_length=100)

    def __unicode__(self):
        return self.name

Now we want to write some tests which need some Things in the database.

# tests.py
import random
import string

from django.test import TestCase

from .models import Thing

def random_string(length=10):
    return u''.join(random.choice(string.ascii_letters) for x in range(length))

class ThingTestCase(TestCase):

    def create_thing(self, **kwargs):
        "Create a random test thing."
        options = {
            'name': random_string(),
            'description': random_string(),
        return Thing.objects.create(**options)

    def test_something(self):
        # Get a completely random thing
        thing = self.create_thing()
        # Test assertions would go here

    def test_something_else(self):
        # Get a thing with an explicit name
        thing = self.create_thing(name='Foo')
        # Test assertions would go here

Instead of using a fixture we have a create_thing method to create a new Thing instance. In our tests we can get a new random thing object. In the cases where some of the fields need explicit values we can pass those into the creation. The tests create the exact number of Things that they need and any requirements about these instances is explicit in how they are created.

Writing these methods to create new instances can be rather repetitive. If you prefer, you can use something like Factory Boy to help you. Factory Boy is a Python port of a popular Ruby project called Factory Girl. It provides a declarative syntax for how new instances should be created. It also has helpers for common patterns such as sub-factories for foreign keys and other inter-dependencies. Rewriting our tests to use Factory Boy would look like this:

# tests.py
import random
import string

import factory
from django.test import TestCase

from .models import Thing

def random_string(length=10):
    return u''.join(random.choice(string.ascii_letters) for x in range(length))

class ThingFactory(factory.DjangoModelFactory):
    FACTORY_FOR = Thing

    name = factory.LazyAttribute(lambda t: random_string())
    description = factory.LazyAttribute(lambda t: random_string())

class ThingTestCase(TestCase):

    def test_something(self):
        # Get a completely random thing
        thing = ThingFactory.create()
        # Test assertions would go here

    def test_something_else(self):
        # Get a thing with an explicit name
        thing = ThingFactory.create(name='Foo')
        # Test assertions would go here

Here the create_thing method is removed in favor of the ThingFactory. Calling it is similar to how we were previously calling the method. One advantage of the ThingFactory is that we can also call ThingFactory.build() which will create an unsaved instance for tests where we don’t need the instance to be saved. Any time we can avoid a write to the database can save time for the test suite.

One handy pattern when working with models which have images is to have a test image file which is used by default. Let’s create a new model with an image.

# models.py
from django.db import models

class ImageThing(models.Model):
    name = models.CharField(max_length=100)
    image = models.ImageField(upload_to='images/')

    def __unicode__(self):
        return self.name

And a factory to test it

# tests.py
import os
import random
import string

import factory
from django.core.files import File

from .models import ImageThing

# Assumes there is a test.png next to our tests.py
TEST_IMAGE = os.path.join(os.path.dirname(__file__), 'test.png')

def random_string(length=10):
    return u''.join(random.choice(string.ascii_letters) for x in range(length))

class ImageThingFactory(factory.DjangoModelFactory):
    FACTORY_FOR = ImageThing

    name = factory.LazyAttribute(lambda t: random_string())
    image = factory.LazyAttribute(lambda t: File(open(TEST_IMAGE)))

Unlike the name this image isn’t random but by default all of your model instances will have images associated with them.

In summary, using fixtures for complex data structures in your tests is fraught with peril. They are hard to maintain and they make your tests slow. Creating model instances as they are needed is a cleaner way to write your tests which will make them faster and more maintainable. With tools like Factory Boy it can be very easy.

Joe GregorioDriving and kids these days.

This NYTime article on The End of Car Culture very mich rings true for me, none of our kids look forward to having a car like we did. It felt like we had to push the oldest one to get his learners permit. Made even odder by the fact that we don't live anywhere near any sort of useful public transportation.

Joe GregorioScreencast recording under Ubuntu

I'm sure this evolves over time, but for today, the best command-line incantation I found for recording screencasts is:

$ avconv -strict experimental -f x11grab -r 25 -s 1024x768 -i :0.0 \
 -pre medium -f alsa -ac 2  -ar 22050  -i pulse cast.webm