A planet of blogs from our members...

Caktus GroupComposting at Caktus

At Caktus we have an employee suggestion policy that has been the birthplace of tons of great ideas, from tech community yoga to a pair programming station.

The most recent addition to our workplace comes from the suggestion of Developer Rebecca Muraya: composting at Caktus! We’re partnering with Compost Now to decrease our waste. Compost Now helps to divert more than ⅓ of waste away from landfills while improving local soil for farmers by increasing the availability of all natural compost. There is nothing better for keeping soil healthy like a good supply of organic compost, and we’re thrilled to be contributing to keep worms, plants, and farmers happy.

Caktus GroupMaking Clean Code a Part of Your Build Process (And More!)

At Caktus, "clean" (in addition to "working"!) code is an important part of our delivery. For all new projects, we achieve that by using flake8. flake8 is a wrapper around several tools: pep8, pyflakes, and McCabe. pep8 checks to make sure your code matches the PEP 0008 style guidelines, pyflakes looks for a few additional things like unused imports or variables, and McCabe raises warnings about overly complex sections of code.

We usually check code formatting locally before committing, but we also have safety checks in place in case someone forgets (as I more than anyone have been known to do!). This prevents code formatting standards from slowly eroding over time. We accomplish this by making our continuous integration (CI) server "fail" the build if unclean code is committed. For example, using Travis CI, simply adding the following line to the script: section of your .travis.yml will fail the build if flake8 detects any formatting issues (the command returns a non-zero exit code if errors are found):

- flake8 .

You can adjust flake8 defaults by adding a file named setup.cfg to the top level of your repository. For example, we usually relax the 80 character limit a little and exclude certain automatically generated files:

[flake8]
max-line-length=100
exclude=migrations

As a result you not only have code that is more readable for everyone, but avoids actual errors as well. For example, flake8 will detect missing imports or undefined variables in code paths that might not be tested by your unit test suite.

Adding flake8 to an older project

This is all well and good for new projects, but bringing old projects up to speed with this approach can be a challenge. I recently embarked on such a task myself and thought I'd share what I learned.

I started by adding a setup.cfg to the project and running flake8 on my source tree:

$ flake8 --count
...
1798

The result: a whopping 1798 warnings. Many of these turned out to be pep8's "E128 continuation line under-indented for visual indent":

$ flake8 --select=E128 --count
...
1010

In other words, in a huge number of cases, we weren't indenting multi-line continuations the way pep8 wanted us to. Other errors included things like not having a space after commas (E231), or not having two spaces before inline comments (E261). While many editors do support automatically fixing errors like this, doing so manually would still be tedious: In this project we had nearly 250 Python source files.

Enter autopep8 and autoflake. These tools purport to automatically fix pep8- and pyflakes-related issues. There are two ways to use these tools; one wholesale for all errors across the project at once, and one more granular, addressing only a single group of similar errors at a time.

Addressing all errors at once

This approach is best for smaller projects or those with just a few errors (50-100) to address:

$ pip install autoflake
$ find . -name '*.py'|grep -v migrations|xargs autoflake --in-place --remove-all-unused-imports --remove-unused-variables

In my case, this resulted in changes across 39 files and reduced the number of flake8 errors from 1798 to 1726. Not a huge change, but a time saver nonetheless. autopep8 was even more impressive:

$ pip install autopep8
$ autopep8 --in-place --recursive --max-line-length=100 --exclude="*/migrations/*" .

This brought the number of changed files up to 160, and brought the number of warnings from 1726 down to 211. Note that autopep8 also supports an --aggressive option which allows non-whitespace changes. When I tried this, however, it only reduced the number of warnings from 211 to 198. I'll most likely fix those by hand.

Please note: If or when you're ready to try these commands on a project, you must first make sure you have no uncommitted changes locally. After each change, I also recommend (a) committing the results of each command as a separate commit so they're easier to unravel or review later, and (b) running your unit test suite to make sure nothing is inadvertently broken.

Addressing errors as groups (recommended)

While the above approach may work for smaller projects (not this one!), it can make code reviews difficult because all pyflakes or pep8 fixes are grouped together in a single commit. The more labor intensive but recommended approach is to address them in groups of similar errors. My colleague Rebecca Muraya recommended this approach and suggested the groups (thanks, Rebecca!):

  1. First, find and remove any unused imports:

    $ pip install autoflake
    $ find . -name '*.py'|grep -v migrations|xargs autoflake --in-place --remove-all-unused-imports
    

    After it's finished, review the changes, run your test suite, and commit the code.

  2. Now, do the same for unused variables:

    $ find . -name '*.py'|grep -v migrations|xargs autoflake --in-place --remove-unused-variables
    

    Again, once complete, review the changes, run your test suite, and commit the code.

  3. Before moving on to pep8 errors, the following command provides an invaluable summary of errors not yet addressed:

    $ flake8 --statistics --count -qq
    
  4. Finally, autopep8 can be told to fix only certain error codes, like so:

    $ pip install autopep8
    $ autopep8 --in-place --recursive --max-line-length=100 --exclude="*/migrations/*" --select="W291,W293" .
    

    This will remove trailing whitespace and trailing whitespace on blank lines. Once complete, review and commit your changes and move on to the next group of errors.

pep8's error codes are listed in detail on the autopep8 PyPI page and in the pep8 documentation. You can either group them yourself based on your preferences and the particular warnings in your project, or use the following as a guide:

  • Remove trailing whitespace, then configure your editor to keep it away:
    • W291 - Remove trailing whitespace.
    • W293 - Remove trailing whitespace on blank line.
  • Use your editor to find/replace all tabs, if any, with spaces, and then fix indentation with these error codes. This can have a semantic impact so the changes need to be reviewed carefully:
    • E101 - Reindent all lines.
    • E121 - Fix indentation to be a multiple of four.
  • Fix whitespace errors:
    • E20 - Remove extraneous whitespace.
    • E211 - Remove extraneous whitespace.
    • E22 - Fix extraneous whitespace around keywords.
    • E224 - Remove extraneous whitespace around operator.
    • E226 - Fix missing whitespace around arithmetic operator.
    • E227 - Fix missing whitespace around bitwise/shift operator.
    • E228 - Fix missing whitespace around modulo operator.
    • E231 - Add missing whitespace.
    • E241 - Fix extraneous whitespace around keywords.
    • E242 - Remove extraneous whitespace around operator.
    • E251 - Remove whitespace around parameter '=' sign.
    • E27 - Fix extraneous whitespace around keywords.
  • Adjust blank lines:
    • W391 - Remove trailing blank lines.
    • E301 - Add missing blank line.
    • E302 - Add missing 2 blank lines.
    • E303 - Remove extra blank lines.
    • E304 - Remove blank line following function decorator.
    • E309 - Add missing blank line (after class declaration).
  • Fix comment spacing:
    • E26 - Fix spacing after comment hash for inline comments.
    • E265 - Fix spacing after comment hash for block comments.
  • The following are aggressive fixes that can have semantic impact. It's best to do these one commit at a time and with careful review:
    • E711 - Fix comparison with None.
    • E712 - Fix comparison with boolean.
    • E721 - Use "isinstance()" instead of comparing types directly.
    • W601 - Use "in" rather than "has_key()".
    • W602 - Fix deprecated form of raising exception.
    • W603 - Use "!=" instead of "<>"

You can repeat steps 3 and 4 in the above with each group of error codes (or in the case of more aggressive fixes, single error codes) until they're all resolved. Once all the automatic fixes are done, you'll likely have some manual fixes left to do. Before those, you may want to see what remaining automatic fixes, if any, autopep8 suggests:

$ autopep8 --in-place --recursive --max-line-length=100 --exclude="*/migrations/*" .

Once all the errors have been resolved, add flake8 to your build process so you never have to go through this again.

Concluding Remarks

While I haven't finished all the manual edits yet as of the time of this post, I have reduced the number to about 153 warnings across the whole project. Most of the remaining warnings are long lines that pep8 wasn't comfortable splitting, for example, strings that needed to be broken up like so::

foo = ('a very '
       'long string')

Or other similar issues that couldn't be auto-corrected. To its credit, flake8 did detect two bugs, namely, a missing import (in some unused test code that should probably be deleted), and an instance of if not 'foo' in bar (instead of the correct version, if 'foo' not in bar).

My colleague Mark Lavin also remarked that flake8 does not raise warnings about variable naming, but the pep8-naming plugin is available to address this. The downside is that it doesn't like custom assertions which match the existing unittest style (i.e., assertOk, assertNotFound, etc.).

Good luck, and I hope this has been helpful!

Tim HopperOn Showing Your Work

Caktus GroupAWS load balancers with Django

We recently had occasion to reconfigure some of our existing servers to use Amazon Web Services Elastic Load Balancers in front of them. Setting this up isn't hard, exactly, but there are a lot of moving parts that have to mesh correctly before things start to work, so I thought I'd write down what we did.

All of these tools have lots of options and ways to use them. I'm not trying to cover all the possibilities here. I'm just showing what we ended up doing.

Our requirements

We had some specific goals we wanted to achieve in this reconfiguration.

  • There should be no outside requests sneaking in -- the only requests that should reach the backend servers are those that come through the load balancer. We'll achieve this by setting the backend servers' security group(s) to only allow incoming traffic on those ports from the load balancer's security group.
  • The site should only handle requests that have the right Host header. We achieve this already by Nginx configuration (server_name) and won't have to change anything.
  • Redirect any non-SSL requests to SSL. The load balancer can't do this for us (as far as I could see), so we just forward incoming port 80 requests to the server’s port 80, and let our existing port 80 Nginx configuration continue to redirect all requests to our https: URL.
  • All SSL connections are terminated at the load balancer. Our site certificate and key are only needed on the load balancer. The backend servers don't need to process encryption, nor do we need to maintain SSL configuration on them. We'll have the load balancers forward the connections, unencrypted, to a new listening port in nginx, 8088, because we're redirecting everything from port 80 to https. (We could have configured the port 80 server to figure out from the headers whether the connection came into the load balancer over SSL, but we didn't, figuring that using a separate port would be fool-proof.) If we were concerned about security of the data between the load balancer and the backend, for example if financial or personal information was included, we could re-encrypt the forwarded connections, maybe using self-signed certificates on the backend servers to simplify managing their configurations.
  • Strict-Transport-Security header - we add this already in our Nginx configuration and will include it in our port 8088 configuration.
  • We need to access backend servers directly for deploys (via ssh). We achieve this by keeping our elastic IP addresses on our backend servers so they have stable IP addresses, even though the load balancers don't need them.
  • Some of our servers use basic auth (to keep unreleased sites private). This is in our Nginx configuration, but we'll need to open up the health check URL to bypass basic auth, since the load balancers can't provide basic auth on health checks.
  • Sites stay up through the change. We achieve this by making the changes incrementally, and making sure at all times there's a path for incoming requests to be handled.

All the pieces

Here are all the pieces that we had to get in place:

  • The site's hostname is a CNAME for the elastic load balancer’s hostname, so that requests for the site go to the load balancer instead of the backend servers. Don’t use the load balancer IP addresses directly, since they’ll change over time.
  • The backend servers' security group allows incoming requests on ports 80 and 8088, but only from the load balancer's security group. That allows the load balancer to forward requests, but requests cannot be sent directly to the backend servers even if someone knows their addresses.
  • There's a health check URL on the backend server that the load balancer can access, and that returns a 200 status (not 301 or 401 or anything else), so the load balancers can determine if the backend servers are up.
  • Apart from the health check, redirect port 80 requests to the https URL of the server (non-SSL to SSL), so that any incoming requests that aren't over SSL will be redirected to SSL.
  • Get the data about the request's origin from the headers where the load balancer puts it, and pass it along to Django in the headers that our Django configuration is expecting. This lets Django tell whether a request came in securely.
  • The load balancer must be in the same region as the servers (AWS requirement).
  • Keep the elastic IP on our backend server so we can use that to get to it for administration. Deploys and other administrative tasks can no longer use the site domain name to access the backend server, since it now points at the load balancer.

Where we started

Before adding the load balancer, our site was running on EC2 servers with Ubuntu. Nginx was accepting incoming requests on ports 80 and 443, redirecting all port 80 requests to https, adding basic auth on port 443 on some servers, proxying some port 443 requests to gunicorn with our Django application, and serving static files for the rest.

To summarize our backend server configuration before and after the change:

Before

  • Port 80 redirects all requests to https://server_URL
  • Port 443 terminates SSL and processes requests
  • Server firewall and AWS security group allow all incoming connections on port 80 and 443

After

  • Port 80 redirects all requests to https://server_URL
  • Port 8088 processes requests
  • Server firewall and AWS security group allow port 80 and 8088 connections from the load balancer only, and no port 443 connections at all.

Steps in order

  • DNS: shorten the DNS cache time for the site domain names to something like 5 minutes, so when we start changing them later, clients will pick up the change quickly. We'll lengthen these again when we're done.
  • Django: if needed, create a new view for health checks. We made one at /health/ that simply returned a response with status 200, bypassing all authentication. We can enhance that view later to do more checking, such as making sure the database is accessible.
  • Nginx: We added a new port 8088 server, copying the configuration from our existing port 443 server, but removing the ssl directives. We did keep the line that added the Strict-Transport-Security header.
  • Nginx: Added configuration in our new port 8088 to bypass basic auth for the /health/ URL only.
  • Ufw: opened port 8088 in the Linux firewall.
  • AWS: opened port 8088 in the servers' security group - for now, from all source addresses so we can test easily as we go.
  • AWS: add the SSL certificate in IAM
  • AWS: create a new load balancer in the same region as the servers
  • AWS: configure the new load balancer:
  • configure to use the SSL certificate
  • set up a security group for the load balancer. It needs to accept incoming connections from the internet on ports 80 and 443.
  • instances: the backend servers this load balancer will forward to
  • health check: port 8088, URL /health/. Set the period and number of checks small for now, e.g. 30 seconds and 2 checks.
  • listeners: 80->80, 443 ssl -> 8088 non-ssl
  • Tests: Now stop to make sure things are working right so far:
  • The load balancer should show the instance in service (after the health check period has passed).
  • With the site domain set in your local /etc/hosts file to point at one of the load balancer's IP addresses, the site should work on ports 80 & 443
  • undo your local /etc/hosts changes since the load balancer IPs will change over time!

  • AWS: update the backend servers' security group to only accept 8088 traffic from the load balancer's security group

  • Test: the health check should still pass, since it's coming in on port 8088 from the load balancer.

  • DNS: update DNS to make the site domain a CNAME for the load balancer's A name

  • wait for DNS propagation
  • test: site should still work when accessed using its hostname.

Cleanup

These steps should now be safe to do, but it doesn't hurt to test again after each step, just to be sure.

  • Nginx: remove port 443 server from nginx configuration.
  • AWS: remove port 443 from backend servers’ security group. Configure ports 80 and 8088 to only accept incoming connections from the load balancer's security group.
  • Ufw: block port 443 in server firewall
  • AWS: in the load balancer health check configuration, lengthen the time between health checks, and optionally require more passing checks before treating an instance as live.
  • docs: Update your deploy docs with the changes to how the servers are deployed!
  • DNS: lengthen the cache time for the server domain name(s) if you had shortened it before.

Tim HopperDirichlet Process Notebooks

I created a dedicated Github repository for the recent posts I've been doing on Dirichlet processes.

I also added a note responding to some of Dan Roy's helpful push-back on my understanding of the term Dirichlet process.

Tim HopperReleasing to Anaconda.org via Travis-CI

I've spent the last few days figuring out how to co-opt Travis CI as a build server for Anaconda.org (the package management service previously known as Binstar). I need to be able release five conda packages to Anaconda built for both Linux and OS X. Travis is not the right tool for this job, but I have managed to make it work. Since I've talked to others who are wrestling with this same problem, I am sharing my solution here.

Each of my five projects lives in a separate repository on Github. We have each of them as a submodule in a common repository called release. Each of these projects has a build.sh and meta.yaml file (required by Anaconda.org) in a subdirectory called conda/PACKAGE-NAME.

In my release directory, I have a simple Python script called build.py. It takes the name of a submodule and an Anaconda.org channel as command line arguments. It then uses the Python sh module to call the conda build and binstar upload commands.

binstar upload requires authentication at Anaconda.org. I did this by using a Binstar token which I add to the binstar command after fetching it from an encrypted Travis environmental variable.

I call build.py for each of the five projects from the script: section of my .travis.yml file. This will build the each package on the Travis workers and then release to Anaconda.org.

There is no easy way to get Travis to build for Linux and OS X simultaneously. However, it can be tricked into building for one or the other by changing the language: specified in the .travis.yml file. (language: objective-c will force Travis to use an OS X work; by default, it uses a Linux worker.) I wrote a fabric script that provides command line commands which will modify the language value in my .travis.yml file and then push the release repository to Github. Github triggers a Travis CI build which then deploys the repository to Anaconda.org!

If I want to cut a new release for OS X, I simply call $ fab release_osx, for Linux, I call $ fab release_linux. By default, this will release to the "main" channel on Anaconda.org. I can release to a different channel (e.g. "dev") with $ fab release_linux:dev. When specifying the channel, the fabfile will modify my .travis.yml file to set an environmental variable that is picked up when calling build.py.

Finally, my .travis.yml file instructs Travis on preparing the build environment differently between operating systems by using Travis's built in $TRAVIS_OS_NAME environmental variable and calling appropriate setup scripts.1

Also, to update the submodules to origin/master, I created a fabric command: $ fab update. This command calls git submodule update --remote --rebase.

This certainly isn't a perfect solution, but it'll work for the time being. I certainly look forward to easier solutions being developed in the future!2

Thanks to my colleague Stephen Tu who laid a lot of the groundwork for this!


  1. You may notice I have files called travis.yml and .travis.yml. Originally, my fabfile just modified the latter on the fly. For clarity, I started using travis.yml as the canonical file and .travis.yml is what is generated by the fabfile commands and used by Travis. 

  2. Continuum is starting to provide paid, hosted build servers to do this very task 

Caktus GroupAnnouncing the Caktus Open Source Fellowship

We are excited to announce the creation and funding of a pilot program for open source contributions here at Caktus Group. This program is inspired by the Django Software Foundation’s fellowship as well as the Two Day Manifesto. For this program, Caktus seeks to hire a part-time developer for twelve weeks this fall for the sole purpose of contributing back to open source projects. Caktus builds web applications based on open source tools and the continued growth of these projects is important to us. Open source projects such as Python and Django have given so much to this company and this is one of many ways we are trying to give back.

We are looking for candidates who would love to collaborate with us in our offices in downtown Durham, NC to directly contribute to both open source projects created at Caktus as well as open source projects used by Caktus, such as Django. As previously mentioned, this will be a part-time position and will be taking the place of our normal fall internship program. If successful, we may expand this into a full-time position for future iterations.

I think this will be a great opportunity for a developer to get experience working with and contributing to open source. It could be a great resume builder for a relatively new developer where all of your work will be publicly visible. It could also be a fun break from the ordinary for a more experienced developer who would love to be paid to work on open source. Through our mentorship, we hope this program will empower people who otherwise would not contribute to open source.

Prior experience with Python is a general requirement, but no prior open source contributions are required. If you have specific projects you would like to contribute to, we would love to know about them during the application process.

We are looking forwarding to reviewing and selecting applicants over the next few weeks. You can find more details on the position as well as the application form here: https://www.caktusgroup.com/careers/#op-73393-caktus-open-source-fellow

Caktus GroupAnnouncing Django Girls RDU: Free Coding Workshop for Women

Django Girls Logo

We’re incredibly excited to announce the launch of Django Girls RDU, a group in NC’s Triangle region that hosts free one-day Django coding workshops for women. Django Girls is part of an international movement that’s helped 1,600 (and counting!) women learn how to code.

We originally got involved with Django Girls at the impetus of our Technical Director, Mark Lavin. While discussing our efforts to support women in tech, Mark was emphatic: there was this group, Django Girls, that was doing extraordinary work engaging women in Django. Clearly, we needed something like that locally. Luckily for us, Django Girls was coming to PyCon and we'd get to see first hand just how wonderful they are.

Mark Lavin coaching at Django Girls PyCon 2015

Four of our team members volunteered as coaches for a Django Girls workshop during PyCon 2015. There was nothing quite like seeing the impact Django Girls had on each attendee. The environment was warm and friendly. The tutorials for students, coaches, and organizers, were detailed and extremely well thought out. The passion of the Django Girls organizers was simply infectious. Out of a desire to prolong this excitement and share it with everyone we knew, we put together a team and applied to have a Django Girls RDU chapter. We’re honored to partner with such a wonderful group!

The first workshop will be on October 3rd, applications are due September 4th. We have five great coaches from Caktus and PyLadies RDU and each coach will work one-on-one with three students to build their first Django website, a blog. We’re looking for volunteers to coach and sponsor the workshop. Each additional coach means three more students we can accept. If you’d like to get involved, please email us at durham@djangogirls.org. And, of course, we’re also looking for women who want to learn how to code.

Not able to make the October 3rd meetup? You'll also find members of our team coaching at DjangoCon's Django Girls Austin. To learn about more Django Girl activities, please follow us @djangogirlsRDU or visit the DjangoGirls website.

Tim HopperA Programmer's Portfolio

I am convinced that a programming student hoping to get a job in that field should be actively building a portfolio online. Turn those class projects, presentations, and reports into Github repositories or blog posts! I felt vindicated as I read this anecdote in Peopleware:

In the spring of 1979, while teaching together in western Canada,we got a call from a computer science professor at the local technical college. He proposed to stop by our hotel after class one evening and buy us beers in exchange for ideas. That's the kind of offer we seldom turn down. What we learned from him that evening was almost certainly worth more than whatever he learned from us.

The teacher was candid about what he needed to be judged a success in his work: He needed his students to get good job offers and lots of them. "A Harvard diploma is worth something in and of itself, but our diploma isn't worth squat. If this year's graduates don't get hired fast, there are no students next year and I'm out of a job." So he had developed a formula to make his graduates optimally attractive to the job market. Of course he taught them modern techniques for system construction, including structured analysis and design, data-driven design, information hiding, structured coding, walk throughs, and metrics. He also had them work on real applications for nearby companies and agencies. But the center piece of his formula was the portfolio that all students put together to show samples of their work.

He described how his students had been coached to show off their portfolios as part of each interview:

"I've brought along some samples of the kind of work I do. Here, for instance, is a subroutine in Pascal from one project and a set of COBOL paragraphs from another. As you can see in this portion, we use the loop-with-exit extension advocated by Knuth, but aside from that, it's pure structured code, pretty much the sort of thing that your company standard calls for. And here is the design that this code was written from. The hierarchies and coupling analysis use Myers' notation. I designed all of this particular subsystem, and this one little section where we used some Orr methods because the data structure really imposed itself on the process structure. And these are the leveled data flow diagrams that makeup the guts of our specification, and the associated data dictionary. ..."

In the years since, we've often heard more about that obscure technical college and those portfolios. We've met recruiters from as far away as Triangle Park, North Carolina, and Tampa, Florida,who regularly converge upon that distant Canadian campus for a shot at its graduates.

Of course, this was a clever scheme of the professor's to give added allure to his graduates, but what struck us most that evening was the report that interviewers were always surprised by the portfolios. That meant they weren't regularly requiring all candidates to arrive with portfolios. Yet why not? What could be more sensible than asking each candidate to bring along some samples of work to the interview?

Tim HopperNonparametric Latent Dirichlet Allocation

I wrote this in an IPython Notebook. You may prefer to view it on nbviewer.

In [1]:
%matplotlib inline
%precision 2
Out[1]:
u&apos%.2f&apos

Latent Dirichlet Allocation is a generative model for topic modeling. Given a collection of documents, an LDA inference algorithm attempts to determined (in an unsupervised manner) the topics discussed in the documents. It makes the assumption that each document is generated by a probability model, and, when doing inference, we try to find the parameters that best fit the model (as well as unseen/latent variables generated by the model). If you are unfamiliar with LDA, Edwin Chen has a friendly introduction you should read.

Because LDA is a generative model, we can simulate the construction of documents by forward-sampling from the model. The generative algorithm is as follows (following Heinrich):

  • for each topic $k\in [1,K]$ do
    • sample term distribution for topic $\overrightarrow \phi_k \sim \text{Dir}(\overrightarrow \beta)$
  • for each document $m\in [1, M]$ do
    • sample topic distribution for document $\overrightarrow\theta_m\sim \text{Dir}(\overrightarrow\alpha)$
    • sample document length $N_m\sim\text{Pois}(\xi)$
    • for all words $n\in [1, N_m]$ in document $m$ do
      • sample topic index $z_{m,n}\sim\text{Mult}(\overrightarrow\theta_m)$
      • sample term for word $w_{m,n}\sim\text{Mult}(\overrightarrow\phi_{z_{m,n}})$

You can implement this with a little bit of code and start to simulate documents.

In LDA, we assume each word in the document is generated by a two-step process:

  1. Sample a topic from the topic distribution for the document.
  2. Sample a word from the term distribution from the topic.

When we fit the LDA model to a given text corpus with an inference algorithm, our primary objective is to find the set of topic distributions $\underline \Theta$, term distributions $\underline \Phi$ that generated the documents, and latent topic indices $z_{m,n}$ for each word.

To run the generative model, we need to specify each of these parameters:

In [2]:
vocabulary = ['see', 'spot', 'run']
num_terms = len(vocabulary)
num_topics = 2 # K
num_documents = 5 # M
mean_document_length = 5 # xi
term_dirichlet_parameter = 1 # beta
topic_dirichlet_parameter = 1 # alpha

The term distribution vector $\underline\Phi$ is a collection of samples from a Dirichlet distribution. This describes how our 3 terms are distributed across each of the two topics.

In [3]:
from scipy.stats import dirichlet, poisson
from numpy import round
from collections import defaultdict
from random import choice as stl_choice
In [4]:
term_dirichlet_vector = num_terms * [term_dirichlet_parameter]
term_distributions = dirichlet(term_dirichlet_vector, 2).rvs(size=num_topics)
print term_distributions
[[ 0.41  0.02  0.57]
 [ 0.38  0.36  0.26]]

Each document corresponds to a categorical distribution across this distribution of topics (in this case, a 2-dimensional categorical distribution). This categorical distribution is a distribution of distributions; we could look at it as a Dirichlet process!

The base base distribution of our Dirichlet process is a uniform distribution of topics (remember, topics are term distributions).

In [5]:
base_distribution = lambda: stl_choice(term_distributions)
# A sample from base_distribution is a distribution over terms
# Each of our two topics has equal probability
from collections import Counter
for topic, count in Counter([tuple(base_distribution()) for _ in range(10000)]).most_common():
    print "count:", count, "topic:", [round(prob, 2) for prob in topic]
count: 5066 topic: [0.40999999999999998, 0.02, 0.56999999999999995]
count: 4934 topic: [0.38, 0.35999999999999999, 0.26000000000000001]

Recall that a sample from a Dirichlet process is a distribution that approximates (but varies from) the base distribution. In this case, a sample from the Dirichlet process will be a distribution over topics that varies from the uniform distribution we provided as a base. If we use the stick-breaking metaphor, we are effectively breaking a stick one time and the size of each portion corresponds to the proportion of a topic in the document.

To construct a sample from the DP, we need to again define our DP class:

In [6]:
from scipy.stats import beta
from numpy.random import choice

class DirichletProcessSample():
    def __init__(self, base_measure, alpha):
        self.base_measure = base_measure
        self.alpha = alpha
        
        self.cache = []
        self.weights = []
        self.total_stick_used = 0.

    def __call__(self):
        remaining = 1.0 - self.total_stick_used
        i = DirichletProcessSample.roll_die(self.weights + [remaining])
        if i is not None and i < len(self.weights) :
            return self.cache[i]
        else:
            stick_piece = beta(1, self.alpha).rvs() * remaining
            self.total_stick_used += stick_piece
            self.weights.append(stick_piece)
            new_value = self.base_measure()
            self.cache.append(new_value)
            return new_value
      
    @staticmethod 
    def roll_die(weights):
        if weights:
            return choice(range(len(weights)), p=weights)
        else:
            return None

For each document, we will draw a topic distribution from the Dirichlet process:

In [7]:
topic_distribution = DirichletProcessSample(base_measure=base_distribution, 
                                            alpha=topic_dirichlet_parameter)

A sample from this topic distribution is a distribution over terms. However, unlike our base distribution which returns each term distribution with equal probability, the topics will be unevenly weighted.

In [8]:
for topic, count in Counter([tuple(topic_distribution()) for _ in range(10000)]).most_common():
    print "count:", count, "topic:", [round(prob, 2) for prob in topic]
count: 9589 topic: [0.38, 0.35999999999999999, 0.26000000000000001]
count: 411 topic: [0.40999999999999998, 0.02, 0.56999999999999995]

To generate each word in the document, we draw a sample topic from the topic distribution, and then a term from the term distribution (topic).

In [9]:
topic_index = defaultdict(list)
documents = defaultdict(list)

for doc in range(num_documents):
    topic_distribution_rvs = DirichletProcessSample(base_measure=base_distribution, 
                                                    alpha=topic_dirichlet_parameter)
    document_length = poisson(mean_document_length).rvs()
    for word in range(document_length):
        topic_distribution = topic_distribution_rvs()
        topic_index[doc].append(tuple(topic_distribution))
        documents[doc].append(choice(vocabulary, p=topic_distribution))

Here are the documents we generated:

In [10]:
for doc in documents.values():
    print doc
[&apossee&apos, &aposrun&apos, &apossee&apos, &aposspot&apos, &apossee&apos, &aposspot&apos]
[&apossee&apos, &aposrun&apos, &apossee&apos]
[&apossee&apos, &aposrun&apos, &apossee&apos, &apossee&apos, &aposrun&apos, &aposspot&apos, &aposspot&apos]
[&aposrun&apos, &aposrun&apos, &aposrun&apos, &aposspot&apos, &aposrun&apos]
[&aposrun&apos, &aposrun&apos, &apossee&apos, &aposspot&apos, &aposrun&apos, &aposrun&apos]

We can see how each topic (term-distribution) is distributed across the documents:

In [11]:
for i, doc in enumerate(Counter(term_dist).most_common() for term_dist in topic_index.values()):
    print "Doc:", i
    for topic, count in doc:
        print  5*" ", "count:", count, "topic:", [round(prob, 2) for prob in topic]
Doc: 0
      count: 6 topic: [0.38, 0.35999999999999999, 0.26000000000000001]
Doc: 1
      count: 3 topic: [0.40999999999999998, 0.02, 0.56999999999999995]
Doc: 2
      count: 5 topic: [0.40999999999999998, 0.02, 0.56999999999999995]
      count: 2 topic: [0.38, 0.35999999999999999, 0.26000000000000001]
Doc: 3
      count: 5 topic: [0.38, 0.35999999999999999, 0.26000000000000001]
Doc: 4
      count: 5 topic: [0.40999999999999998, 0.02, 0.56999999999999995]
      count: 1 topic: [0.38, 0.35999999999999999, 0.26000000000000001]

To recap: for each document we draw a sample from a Dirichlet Process. The base distribution for the Dirichlet process is a categorical distribution over term distributions; we can think of the base distribution as an $n$-sided die where $n$ is the number of topics and each side of the die is a distribution over terms for that topic. By sampling from the Dirichlet process, we are effectively reweighting the sides of the die (changing the distribution of the topics).

For each word in the document, we draw a sample (a term distribution) from the distribution (over term distributions) sampled from the Dirichlet process (with a distribution over term distributions as its base measure). Each term distribution uniquely identifies the topic for the word. We can sample from this term distribution to get the word.

Given this formulation, we might ask if we can roll an infinite sided die to draw from an unbounded number of topics (term distributions). We can do exactly this with a Hierarchical Dirichlet process. Instead of the base distribution of our Dirichlet process being a finite distribution over topics (term distributions) we will instead make it an infinite Distribution over topics (term distributions) by using yet another Dirichlet process! This base Dirichlet process will have as its base distribution a Dirichlet distribution over terms.

We will again draw a sample from a Dirichlet Process for each document. The base distribution for the Dirichlet process is itself a Dirichlet process whose base distribution is a Dirichlet distribution over terms. (Try saying that 5-times fast.) We can think of this as a countably infinite die each side of the die is a distribution over terms for that topic. The sample we draw is a topic (distribution over terms).

For each word in the document, we will draw a sample (a term distribution) from the distribution (over term distributions) sampled from the Dirichlet process (with a distribution over term distributions as its base measure). Each term distribution uniquely identifies the topic for the word. We can sample from this term distribution to get the word.

These last few paragraphs are confusing! Let's illustrate with code.

In [12]:
term_dirichlet_vector = num_terms * [term_dirichlet_parameter]
base_distribution = lambda: dirichlet(term_dirichlet_vector).rvs(size=1)[0]

base_dp_parameter = 10
base_dp = DirichletProcessSample(base_distribution, alpha=base_dp_parameter)

This sample from the base Dirichlet process is our infinite sided die. It is a probability distribution over a countable infinite number of topics.

The fact that our die is countably infinite is important. The sampler base_distribution draws topics (term-distributions) from an uncountable set. If we used this as the base distribution of the Dirichlet process below each document would be constructed from a completely unique set of topics. By feeding base_distribution into a Dirichlet Process (stochastic memoizer), we allow the topics to be shared across documents.

In other words, base_distribution will never return the same topic twice; however, every topic sampled from base_dp would be sampled an infinite number of times (if we sampled from base_dp forever). At the same time, base_dp will also return an infinite number of topics. In our formulation of the the LDA sampler above, our base distribution only ever returned a finite number of topics (num_topics); there is no num_topics parameter here.

Given this setup, we can generate documents from the hierarchical Dirichlet process with an algorithm that is essentially identical to that of the original latent Dirichlet allocation generative sampler:

In [13]:
nested_dp_parameter = 10

topic_index = defaultdict(list)
documents = defaultdict(list)

for doc in range(num_documents):
    topic_distribution_rvs = DirichletProcessSample(base_measure=base_dp, 
                                                    alpha=nested_dp_parameter)
    document_length = poisson(mean_document_length).rvs()
    for word in range(document_length):
        topic_distribution = topic_distribution_rvs()
        topic_index[doc].append(tuple(topic_distribution))
        documents[doc].append(choice(vocabulary, p=topic_distribution))

Here are the documents we generated:

In [14]:
for doc in documents.values():
    print doc
[&aposspot&apos, &aposspot&apos, &aposspot&apos, &aposspot&apos, &aposrun&apos]
[&aposspot&apos, &aposspot&apos, &apossee&apos, &aposspot&apos]
[&aposspot&apos, &aposspot&apos, &aposspot&apos, &apossee&apos, &aposspot&apos, &aposspot&apos, &aposspot&apos]
[&aposrun&apos, &aposrun&apos, &aposspot&apos, &aposspot&apos, &aposspot&apos, &aposspot&apos, &aposspot&apos, &aposspot&apos]
[&apossee&apos, &aposrun&apos, &apossee&apos, &aposrun&apos, &aposrun&apos, &aposrun&apos]

And here are the latent topics used:

In [15]:
for i, doc in enumerate(Counter(term_dist).most_common() for term_dist in topic_index.values()):
    print "Doc:", i
    for topic, count in doc:
        print  5*" ", "count:", count, "topic:", [round(prob, 2) for prob in topic]
Doc: 0
      count: 2 topic: [0.17999999999999999, 0.79000000000000004, 0.02]
      count: 1 topic: [0.23000000000000001, 0.58999999999999997, 0.17999999999999999]
      count: 1 topic: [0.089999999999999997, 0.54000000000000004, 0.35999999999999999]
      count: 1 topic: [0.22, 0.40000000000000002, 0.38]
Doc: 1
      count: 2 topic: [0.23000000000000001, 0.58999999999999997, 0.17999999999999999]
      count: 1 topic: [0.17999999999999999, 0.79000000000000004, 0.02]
      count: 1 topic: [0.35999999999999999, 0.55000000000000004, 0.089999999999999997]
Doc: 2
      count: 4 topic: [0.11, 0.65000000000000002, 0.23999999999999999]
      count: 2 topic: [0.070000000000000007, 0.65000000000000002, 0.27000000000000002]
      count: 1 topic: [0.28999999999999998, 0.65000000000000002, 0.070000000000000007]
Doc: 3
      count: 2 topic: [0.17999999999999999, 0.79000000000000004, 0.02]
      count: 2 topic: [0.25, 0.55000000000000004, 0.20000000000000001]
      count: 2 topic: [0.28999999999999998, 0.65000000000000002, 0.070000000000000007]
      count: 1 topic: [0.23000000000000001, 0.58999999999999997, 0.17999999999999999]
      count: 1 topic: [0.089999999999999997, 0.54000000000000004, 0.35999999999999999]
Doc: 4
      count: 3 topic: [0.40000000000000002, 0.23000000000000001, 0.37]
      count: 2 topic: [0.42999999999999999, 0.17999999999999999, 0.40000000000000002]
      count: 1 topic: [0.23000000000000001, 0.29999999999999999, 0.46000000000000002]

Our documents were generated by an unspecified number of topics, and yet the topics were shared across the 5 documents. This is the power of the hierarchical Dirichlet process!

This non-parametric formulation of Latent Dirichlet Allocation was first published by Yee Whye Teh et al.

Unfortunately, forward sampling is the easy part. Fitting the model on data requires complex MCMC or variational inference. There are a limited number of implementations of HDP-LDA available, and none of them are great.

Tim HopperSampling from a Hierarchical Dirichlet Process

This may be more readable on NBViewer.

In [137]:
%matplotlib inline

As we saw earlier the Dirichlet process describes the distribution of a random probability distribution. The Dirichlet process takes two parameters: a base distribution $H_0$ and a dispersion parameter $\alpha$. A sample from the Dirichlet process is itself a probability distribution that looks like $H_0$. On average, the larger $\alpha$ is, the closer a sample from $\text{DP}(\alpha H_0)$ will be to $H_0$.

Suppose we're feeling masochistic and want to input a distribution sampled from a Dirichlet process as base distribution to a new Dirichlet process. (It will turn out that there are good reasons for this!) Conceptually this makes sense. But can we construct such a thing in practice? Said another way, can we build a sampler that will draw samples from a probability distribution drawn from these nested Dirichlet processes? We might initially try construct a sample (a probability distribution) from the first Dirichlet process before feeding it into the second.

But recall that fully constructing a sample (a probability distribution!) from a Dirichlet process would require drawing a countably infinite number of samples from $H_0$ and from the beta distribution to generate the weights. This would take forever, even with Hadoop!

Dan Roy, et al helpfully described a technique of using stochastic memoization to construct a distribution sampled from a Dirichlet process in a just-in-time manner. This process provides us with the equivalent of the Scipy rvs method for the sampled distribution. Stochastic memoization is equivalent to the Chinese restaurant process: sometimes you get seated an an occupied table (i.e. sometimes you're given a sample you've seen before) and sometimes you're put at a new table (given a unique sample).

Here is our memoization class again:

In [162]:
from numpy.random import choice 
from scipy.stats import beta

class DirichletProcessSample():
    def __init__(self, base_measure, alpha):
        self.base_measure = base_measure
        self.alpha = alpha
        
        self.cache = []
        self.weights = []
        self.total_stick_used = 0.

    def __call__(self):
        remaining = 1.0 - self.total_stick_used
        i = DirichletProcessSample.roll_die(self.weights + [remaining])
        if i is not None and i < len(self.weights) :
            return self.cache[i]
        else:
            stick_piece = beta(1, self.alpha).rvs() * remaining
            self.total_stick_used += stick_piece
            self.weights.append(stick_piece)
            new_value = self.base_measure()
            self.cache.append(new_value)
            return new_value
        
    @staticmethod 
    def roll_die(weights):
        if weights:
            return choice(range(len(weights)), p=weights)
        else:
            return None

Let's illustrate again with a standard normal base measure. We can construct a function base_measure that generates samples from it.

In [95]:
from scipy.stats import norm

base_measure = lambda: norm().rvs() 

Because the normal distribution has continuous support, we can generate samples from it forever and we will never see the same sample twice (in theory). We can illustrate this by drawing from the distribution ten thousand times and seeing that we get ten thousand unique values.

In [163]:
from pandas import Series

ndraws = 10000
print "Number of unique samples after {} draws:".format(ndraws), 
draws = Series([base_measure() for _ in range(ndraws)])
print draws.unique().size
Number of unique samples after 10000 draws: 10000

However, when we feed the base measure through the stochastic memoization procedure and then sample, we get many duplicate samples. The number of unique samples goes down as $\alpha$ increases.

In [164]:
norm_dp = DirichletProcessSample(base_measure, alpha=100)

print "Number of unique samples after {} draws:".format(ndraws), 
dp_draws = Series([norm_dp() for _ in range(ndraws)])
print dp_draws.unique().size
Number of unique samples after 10000 draws: 446

At this point, we have a function dp_draws that returns samples from a probability distribution (specifically, a probability distribution sampled from $\text{DP}(\alpha H_0)$). We can use dp_draws as a base distribution for another Dirichlet process!

In [155]:
norm_hdp = DirichletProcessSample(norm_dp, alpha=10)

How do we interpret this? norm_dp is a sampler from a probability distribution that looks like the standard normal distribution. norm_hdp is a sampler from a probability distribution that "looks like" the distribution norm_dp samples from.

Here is a histogram of samples drawn from norm_dp, our first sampler.

In [152]:
import matplotlib.pyplot as plt
pd.Series(norm_dp() for _ in range(10000)).hist()
_=plt.title("Histogram of Samples from norm_dp")

And here is a histogram for samples drawn from norm_hdp, our second sampler.

In [154]:
pd.Series(norm_hdp() for _ in range(10000)).hist()
_=plt.title("Histogram of Samples from norm_hdp")

The second plot doesn't look very much like the first! The level to which a sample from a Dirichlet process approximates the base distribution is a function of the dispersion parameter $\alpha$. Because I set $\alpha=10$ (which is relatively small), the approximation is fairly course. In terms of memoization, a small $\alpha$ value means the stochastic memoizer will more frequently reuse values already seen instead of drawing new ones.

This nesting procedure, where a sample from one Dirichlet process is fed into another Dirichlet process as a base distribution, is more than just a curiousity. It is known as a Hierarchical Dirichlet Process, and it plays an important role in the study of Bayesian Nonparametrics (more on this in a future post).

Without the stochastic memoization framework, constructing a sampler for a hierarchical Dirichlet process is a daunting task. We want to be able to draw samples from a distribution drawn from the second level Dirichlet process. However, to be able to do that, we need to be able to draw samples from a distribution sampled from a base distribution of the second-level Dirichlet process: this base distribution is a distribution drawn from the first-level Dirichlet process.

Though it appeared that we would need to be able to fully construct the first level sample (by drawing a countably infinite number of samples from the first-level base distribution). However, stochastic memoization allows us to only construct the first distribution just-in-time as it is needed at the second-level.

We can define a Python class to encapsulate the Hierarchical Dirichlet Process as a base class of the Dirichlet process.

In [165]:
class HierarchicalDirichletProcessSample(DirichletProcessSample):
    def __init__(self, base_measure, alpha1, alpha2):
        first_level_dp = DirichletProcessSample(base_measure, alpha1)
        self.second_level_dp = DirichletProcessSample(first_level_dp, alpha2)

    def __call__(self):
        return self.second_level_dp()

Since the Hierarchical DP is a Dirichlet Process inside of Dirichlet process, we must provide it with both a first and second level $\alpha$ value.

In [167]:
norm_hdp = HierarchicalDirichletProcessSample(base_measure, alpha1=10, alpha2=20)

We can sample directly from the probability distribution drawn from the Hierarchical Dirichlet Process.

In [170]:
pd.Series(norm_hdp() for _ in range(10000)).hist()
_=plt.title("Histogram of samples from distribution drawn from Hierarchical DP")

norm_hdp is not equivalent to the Hierarchical Dirichlet Process; it samples from a single distribution sampled from this HDP. Each time we instantiate the norm_hdp variable, we are getting a sampler for a unique distribution. Below we sample five times and get five different distributions.

In [180]:
for i in range(5):
    norm_hdp = HierarchicalDirichletProcessSample(base_measure, alpha1=10, alpha2=10)
    _=pd.Series(norm_hdp() for _ in range(100)).hist()
    _=plt.title("Histogram of samples from distribution drawn from Hierarchical DP")
    _=plt.figure()
<matplotlib.figure.Figure at 0x112a2da50>

In a later post, I will discuss how these tools are applied in the realm of Bayesian nonparametrics.

Tim HopperHigh Quality Code at Quora

I love this new post on Quora's engineering blog. The post states "high code quality is the long-term boost to development speed" and goes on to explain how they go about accomplishing this.

I've inherited large code bases at each of my jobs out of grad school, and I've spent a lot of thinking about this question. At least on the surface, I love the solutions Quora has in place for ensuring quality code: thoughtful code review, careful testing, style guidelines, static checking, and intentional code cleanup.

Og MacielBooks - July 2015

Books - July 2015

This July 2015 I travelled to the Red Hat office in Brno, Czech Republic to spend some time with my teammates there, and I managed to get a lot of reading done between long plane rides and being jet lagged for many nights :) So I finally managed to finish up some of the books that had been lingering on my ToDo list and even managed to finally read a few of the books that together make up the Chronicles of Narnia, since I had never read them as a kid.

Read

Out of all the books I read this month, I feel that All Quiet on the Western Front and The October Country were the ones I enjoyed reading the most, closely followed by Cryptonomicon, which took me a while to get through. The other books, with the exception of The Memoirs of Sherlock Holmes, helped me pass the time when I only wanted to be entertained.

All Quiet on the Western Front takes the prize for being one of the best books I have ever read! I felt that the way WWI was presented through the eyes of the main character was a great way to represent all the pain, angst and suffering that all sides of conflict went through, without catering for any particular side or having an agenda. Erich Maria Remarque's style had me some times breathless, some times with a knot on the pit of my stomach I as 'endured' the many life changing events that took place in the book. Is this an action-packed book about WWI? Will it read like a thriller? In my opinion, even though there are many chapters with gory details about killings and battles, the answer is a very bland 'maybe'. I think that the real 'star' of this book is its philosophical view of the war and how the main characters, all around 19-20 years of age, learn to deal with its life lasting effects.

Now, I have been a huge fan of Ray Bradbury for a while now, and when I got The October Country for my birthday last month, I just knew that it would be time well spent reading it. For those of you who are more acquainted his science fiction works, this book will surprise you as it shows you a bit of his 'darker' side. All of the short stories included in this collection deal with death, mysterious apparitions, inexplicable endings and are sure to spook you a little bit.

Cryptonomicon was at times slow, some other times funny and, especially toward the end, a very entertaining book. Weighing in at a hefty 1000 pages (depending on the edition you have, plus/minus 50 odd pages), this book covers two different periods in the lives of a number of different characters, past (around WWII) and present, all different threads eventually leading to a great finale. Alternating between past and present, the story takes us to the early days of how cryptology was 'officially invented' and used during the war, and how many of the events that took place back then were affecting the lives of some of the direct descendants of the main characters in our present day. As you go through the back and forth you start to gather bits and pieces of information that eventually connects all the dots of an interesting puzzle. It definitely requires a long term commitment to go though it, but it was enjoyable and, as I mention before, it made me laugh at many places.

Caktus GroupUsing Unsaved Related Models for Sample Data in Django 1.8

Note: In between the time I originally wrote this post and it getting published, a ticket and pull request were opened in Django to remove allow_unsaved_instance_assignment and move validation to the model save() method, which makes much more sense anyways. It's likely this will even be backported to Django 1.8.4. So, if you're using a version of Django that doesn't require this, hopefully you'll never stumble across this post in the first place! If this is still an issue for you, here's the original post:

In versions of Django prior to 1.8, it was easy to construct "sample" model data by putting together a collection of related model objects, even if none of those objects was saved to the database. Django 1.8 - 1.8.3 adds a restriction that prevents this behavior. Errors such as this are generally a sign that you're encountering this issue:

ValueError: Cannot assign "...": "MyRelatedModel" instance isn't saved in the database.

The justification for this is that, previously, unsaved foreign keys were silently lost if they were not saved to the database. Django 1.8 does provide a backwards compatibility flag to allow working around the issue. The workaround, per the Django documentation, is to create a new ForeignKey field that removes this restriction, like so:

class UnsavedForeignKey(models.ForeignKey):
    # A ForeignKey which can point to an unsaved object
    allow_unsaved_instance_assignment = True

class Book(models.Model):
    author = UnsavedForeignKey(Author)

This may be undesirable, however, because this approach means you lose all protection for all uses of this foreign key, even if you want Django to ensure foreign key values have been saved before being assigned in some cases.

There is a middle ground, not immediately obvious, that involves changing this attribute temporarily during the assignment of an unsaved value and then immediately changing it back. This can be accomplished by writing a context manager to change the attribute, for example:

import contextlib

@contextlib.contextmanager
def allow_unsaved(model, field):
    model_field = model._meta.get_field(field)
    saved = model_field.allow_unsaved_instance_assignment
    model_field.allow_unsaved_instance_assignment = True
    yield
    model_field.allow_unsaved_instance_assignment = saved

To use this decorator, surround any assignment of an unsaved foreign key value with the context manager as follows:

with allow_unsaved(MyModel, 'my_fk_field'):
    my_obj.my_fk_field = unsaved_instance

The specifics of how you access the field to pass into the context manager are important; any other way will likely generate the following error:

RelatedObjectDoesNotExist: MyModel has no instance.

While strictly speaking this approach is not thread safe, it should work for any process-based worker model (such as the default "sync" worker in Gunicorn).

This took a few iterations to figure out, so hopefully it will (still) prove useful to someone else!

Tim Hopper10x Engineering

Tim HopperNotes on the Dirichlet Distribution and Dirichlet Process

In [3]:
%matplotlib inline

Note: I wrote this post in an IPython notebook. It might be rendered better on NBViewer.

Dirichlet Distribution

The symmetric Dirichlet distribution (DD) can be considered a distribution of distributions. Each sample from the DD is a categorial distribution over $K$ categories. It is parameterized $G_0$, a distribution over $K$ categories and $\alpha$, a scale factor.

The expected value of the DD is $G_0$. The variance of the DD is a function of the scale factor. When $\alpha$ is large, samples from $DD(\alpha\cdot G_0)$ will be very close to $G_0$. When $\alpha$ is small, samples will vary more widely.

We demonstrate below by setting $G_0=[.2, .2, .6]$ and varying $\alpha$ from 0.1 to 1000. In each case, the mean of the samples is roughly $G_0$, but the standard deviation is decreases as $\alpha$ increases.

In [10]:
import numpy as np
from scipy.stats import dirichlet
np.set_printoptions(precision=2)

def stats(scale_factor, G0=[.2, .2, .6], N=10000):
    samples = dirichlet(alpha = scale_factor * np.array(G0)).rvs(N)
    print "                          alpha:", scale_factor
    print "              element-wise mean:", samples.mean(axis=0)
    print "element-wise standard deviation:", samples.std(axis=0)
    print
    
for scale in [0.1, 1, 10, 100, 1000]:
    stats(scale)
                          alpha: 0.1
              element-wise mean: [ 0.2  0.2  0.6]
element-wise standard deviation: [ 0.38  0.38  0.47]

                          alpha: 1
              element-wise mean: [ 0.2  0.2  0.6]
element-wise standard deviation: [ 0.28  0.28  0.35]

                          alpha: 10
              element-wise mean: [ 0.2  0.2  0.6]
element-wise standard deviation: [ 0.12  0.12  0.15]

                          alpha: 100
              element-wise mean: [ 0.2  0.2  0.6]
element-wise standard deviation: [ 0.04  0.04  0.05]

                          alpha: 1000
              element-wise mean: [ 0.2  0.2  0.6]
element-wise standard deviation: [ 0.01  0.01  0.02]

Dirichlet Process

The Dirichlet Process can be considered a way to generalize the Dirichlet distribution. While the Dirichlet distribution is parameterized by a discrete distribution $G_0$ and generates samples that are similar discrete distributions, the Dirichlet process is parameterized by a generic distribution $H_0$ and generates samples which are distributions similar to $H_0$. The Dirichlet process also has a parameter $\alpha$ that determines how similar how widely samples will vary from $H_0$.

We can construct a sample $H$ (recall that $H$ is a probability distribution) from a Dirichlet process $\text{DP}(\alpha H_0)$ by drawing a countably infinite number of samples $\theta_k$ from $H_0$) and setting:

$$H=\sum_{k=1}^\infty \pi_k \cdot\delta(x-\theta_k)$$

where the $\pi_k$ are carefully chosen weights (more later) that sum to 1. ($\delta$ is the Dirac delta function.)

$H$, a sample from $DP(\alpha H_0)$, is a probability distribution that looks similar to $H_0$ (also a distribution). In particular, $H$ is a discrete distribution that takes the value $\theta_k$ with probability $\pi_k$. This sampled distribution $H$ is a discrete distribution even if $H_0$ has continuous support; the support of $H$ is a countably infinite subset of the support $H_0$.

The weights ($\pi_k$ values) of a Dirichlet process sample related the Dirichlet process back to the Dirichlet distribution.

Gregor Heinrich writes:

The defining property of the DP is that its samples have weights $\pi_k$ and locations $\theta_k$ distributed in such a way that when partitioning $S(H)$ into finitely many arbitrary disjoint subsets $S_1, \ldots, S_j$ $J<\infty$, the sums of the weights $\pi_k$ in each of these $J$ subsets are distributed according to a Dirichlet distribution that is parameterized by $\alpha$ and a discrete base distribution (like $G_0$) whose weights are equal to the integrals of the base distribution $H_0$ over the subsets $S_n$.

As an example, Heinrich imagines a DP with a standard normal base measure $H_0\sim \mathcal{N}(0,1)$. Let $H$ be a sample from $DP(H_0)$ and partition the real line (the support of a normal distribution) as $S_1=(-\infty, -1]$, $S_2=(-1, 1]$, and $S_3=(1, \infty]$ then

$$H(S_1),H(S_2), H(S_3) \sim \text{Dir}\left(\alpha\,\text{erf}(-1), \alpha\,(\text{erf}(1) - \text{erf}(-1)), \alpha\,(1-\text{erf}(1))\right)$$

where $H(S_n)$ be the sum of the $\pi_k$ values whose $\theta_k$ lie in $S_n$.

These $S_n$ subsets are chosen for convenience, however similar results would hold for any choice of $S_n$. For any sample from a Dirichlet process, we can construct a sample from a Dirichlet distribution by partitioning the support of the sample into a finite number of bins.

There are several equivalent ways to choose the $\pi_k$ so that this property is satisfied: the Chinese restaurant process, the stick-breaking process, and the Pólya urn scheme.

To generate $\left\{\pi_k\right\}$ according to a stick-breaking process we define $\beta_k$ to be a sample from $\text{Beta}(1,\alpha)$. $\pi_1$ is equal to $\beta_1$. Successive values are defined recursively as

$$\pi_k=\beta_k \prod_{j=1}^{k-1}(1-\beta_j).$$

Thus, if we want to draw a sample from a Dirichlet distribution, we could, in theory, sample an infinite number of $\theta_k$ values from the base distribution $H_0$, an infinite number of $\beta_k$ values from the Beta distribution. Of course, sampling an infinite number of values is easier in theory than in practice.

However, by noting that the $\pi_k$ values are positive values summing to 1, we note that, in expectation, they must get increasingly small as $k\rightarrow\infty$. Thus, we can reasonably approximate a sample $H\sim DP(\alpha H_0)$ by drawing enough samples such that $\sum_{k=1}^K \pi_k\approx 1$.

We use this method below to draw approximate samples from several Dirichlet processes with a standard normal ($\mathcal{N}(0,1)$) base distribution but varying $\alpha$ values.

Recall that a single sample from a Dirichlet process is a probability distribution over a countably infinite subset of the support of the base measure.

The blue line is the PDF for a standard normal. The black lines represent the $\theta_k$ and $\pi_k$ values; $\theta_k$ is indicated by the position of the black line on the $x$-axis; $\pi_k$ is proportional to the height of each line.

We generate enough $\pi_k$ values are generated so their sum is greater than 0.99. When $\alpha$ is small, very few $\theta_k$'s will have corresponding $\pi_k$ values larger than $0.01$. However, as $\alpha$ grows large, the sample becomes a more accurate (though still discrete) approximation of $\mathcal{N}(0,1)$.

In [13]:
import matplotlib.pyplot as plt
from scipy.stats import beta, norm

def dirichlet_sample_approximation(base_measure, alpha, tol=0.01):
    betas = []
    pis = []
    betas.append(beta(1, alpha).rvs())
    pis.append(betas[0])
    while sum(pis) < (1.-tol):
        s = np.sum([np.log(1 - b) for b in betas])
        new_beta = beta(1, alpha).rvs() 
        betas.append(new_beta)
        pis.append(new_beta * np.exp(s))
    pis = np.array(pis)
    thetas = np.array([base_measure() for _ in pis])
    return pis, thetas

def plot_normal_dp_approximation(alpha):
    plt.figure()
    plt.title("Dirichlet Process Sample with N(0,1) Base Measure")
    plt.suptitle("alpha: %s" % alpha)
    pis, thetas = dirichlet_sample_approximation(lambda: norm().rvs(), alpha)
    pis = pis * (norm.pdf(0) / pis.max())
    plt.vlines(thetas, 0, pis, )
    X = np.linspace(-4,4,100)
    plt.plot(X, norm.pdf(X))

plot_normal_dp_approximation(.1)
plot_normal_dp_approximation(1)
plot_normal_dp_approximation(10)
plot_normal_dp_approximation(1000)

Often we want to draw samples from a distribution sampled from a Dirichlet process instead of from the Dirichlet process itself. Much of the literature on the topic unhelpful refers to this as sampling from a Dirichlet process.

Fortunately, we don't have to draw an infinite number of samples from the base distribution and stick breaking process to do this. Instead, we can draw these samples as they are needed.

Suppose, for example, we know a finite number of the $\theta_k$ and $\pi_k$ values for a sample $H\sim \text{Dir}(\alpha H_0)$. For example, we know

$$\pi_1=0.5,\; \pi_3=0.3,\; \theta_1=0.1,\; \theta_2=-0.5.$$

To sample from $H$, we can generate a uniform random $u$ number between 0 and 1. If $u$ is less than 0.5, our sample is $0.1$. If $0.5<=u<0.8$, our sample is $-0.5$. If $u>=0.8$, our sample (from $H$ will be a new sample $\theta_3$ from $H_0$. At the same time, we should also sample and store $\pi_3$. When we draw our next sample, we will again draw $u\sim\text{Uniform}(0,1)$ but will compare against $\pi_1, \pi_2$, AND $\pi_3$.

The class below will take a base distribution $H_0$ and $\alpha$ as arguments to its constructor. The class instance can then be called to generate samples from $H\sim \text{DP}(\alpha H_0)$.

In [20]:
from numpy.random import choice

class DirichletProcessSample():
    def __init__(self, base_measure, alpha):
        self.base_measure = base_measure
        self.alpha = alpha
        
        self.cache = []
        self.weights = []
        self.total_stick_used = 0.

    def __call__(self):
        remaining = 1.0 - self.total_stick_used
        i = DirichletProcessSample.roll_die(self.weights + [remaining])
        if i is not None and i < len(self.weights) :
            return self.cache[i]
        else:
            stick_piece = beta(1, self.alpha).rvs() * remaining
            self.total_stick_used += stick_piece
            self.weights.append(stick_piece)
            new_value = self.base_measure()
            self.cache.append(new_value)
            return new_value
        
    @staticmethod 
    def roll_die(weights):
        if weights:
            return choice(range(len(weights)), p=weights)
        else:
            return None

This Dirichlet process class could be called stochastic memoization. This idea was first articulated in somewhat abstruse terms by Daniel Roy, et al.

Below are histograms of 10000 samples drawn from samples drawn from Dirichlet processes with standard normal base distribution and varying $\alpha$ values.

In [22]:
import pandas as pd

base_measure = lambda: norm().rvs()
n_samples = 10000
samples = {}
for alpha in [1, 10, 100, 1000]:
    dirichlet_norm = DirichletProcessSample(base_measure=base_measure, alpha=alpha)
    samples["Alpha: %s" % alpha] = [dirichlet_norm() for _ in range(n_samples)]

_ = pd.DataFrame(samples).hist()

Note that these histograms look very similar to the corresponding plots of sampled distributions above. However, these histograms are plotting points sampled from a distribution sampled from a Dirichlet process while the plots above were showing approximate distributions samples from the Dirichlet process. Of course, as the number of samples from each $H$ grows large, we would expect the histogram to be a very good empirical approximation of $H$.

In a future post, I will look at how this DirichletProcessSample class can be used to draw samples from a hierarchical Dirichlet process.

In [ ]:
 

Tim HopperHandy One-off Webpages

I'm starting to love single-page informational websites. For example:

My website Should I Get a Phd? is in this same vein.

Publishing a site like this is very cheap with static hosting on AWS. I would love to see more of them created!

Caktus GroupPyCon 2015 Workshop Video: Building SMS Applications with Django

As proud sponsors of PyCon, we hosted a one and a half hour free workshop. We see the workshops as a wonderful opportunity to share some practical, hands-on experience in our area of expertise: building applications in Django. In addition, it’s a way to give back to the open source community.

This year, Technical Director Mark Lavin and Developers Caleb Smith and David Ray presented “Building SMS Applications with Django.” In the workshop, they taught the basics of SMS application development using Django and Django-based RapidSMS. Aside from covering the basic anatomy of an SMS-based application, as well as building SMS workflows and testing SMS applications, Mark, David, and Caleb were able to bring their practical experience with Caktus client projects to the table.

We’ve used SMS on behalf of international aid organizations and agencies like UNICEF as a cost-effective and pervasive method for conveying urgent information. We’ve built tools to help Libyans register to vote via SMS, deliver critical infant HIV/AIDs results in Zambia and Malawi, and alert humanitarian workers of danger in and around Syria.

Interested in SMS applications and Django? Don’t worry. If you missed the original workshop, we have good news: we recorded it. You can participate by watching the video above!

Caktus GroupReviews of two recent Django Books

Introduction

When I started building sites in Django, I learned the basics from the excellent Django tutorial. But I had to learn by trial and error which approaches to using Django's building blocks worked well and which approaches tended to cause problems later on. I looked for more intermediate-level documentation, but beyond James Bennett's Practical Django Projects and our Karen Tracey's Django 1.1 Testing and Debugging, there wasn't much to be found.

Over the years, ever more interesting introductory material has been showing up, including recently our lead technical manager Mark Lavin's Lightweight Django.

But more experienced developers are also in luck. Some of the community's most experienced Django developers have recently taken the time to put down their experience in books so that we can all benefit. I've been reading two of those books and I can highly recommend both.

Two Scoops of Django

I guess Two Scoops of Django, by Daniel Roy Greenfeld and Audrey Roy Greenfeld, isn't that recent; its third edition just came out. It's been updated twice to keep up with changes in Django (the latest edition covers Django 1.8), improve the existing material, and add more information.

The subtitle of the most recent edition is Best Practices for Django 1.8, and that's what the book is. The authors go through most facets of Django development and share what has worked best for them and what to watch out for, so you don't have to learn it all the hard way.

For example, reading the Django documentation, you can learn what each setting does. Then you can read in chapter 5 of Two Scoops a battle-tested scheme for managing different settings and files across multiple environments, from local development to testing servers and production, while protecting your secrets (passwords, keys, etc).

Similarly, chapter 19 of Two Scoops covers what cases you should and shouldn't use the Django admin for, warns about using list_editable in multi-user environments, and gives some tips for securing the admin and customizing it.

Those are just two examples. The most recent edition of the book has 35 chapters, each covering a useful topic. It's over 500 pages.

Another great thing about the book is that the chapters stand alone - you can pick it up and read whatever chapter you need right now.

I'll be keeping this book handy when I'm working on Django projects.

High Performance Django

High Performance Django by Peter Baumgartner and Yann Malet is aimed at the same audience as Two Scoops, but is tightly focused on performance. It moves on from building a robust Django app to how to deploy and scale it. It covers load balancers, proxies, caching, and monitoring.

One of its best features is that it gives war stories of deploys gone wrong and how the problems were attacked and solved.

Like Two Scoops, this book talks about general principles, along with specific approaches and tools that the authors are familiar with and have had success with. It does a good job of showing the overall architecture that most high-performance sites use, from the load balancer out front to the database at the back, and listing some popular choices of tools at each tier. Then they go into more detail about the specific tools they favor.

It also delves into how to spot performance bottlenecks in your Django site’s code, where they’re most likely to be, and good ways to deal with them.

This is the book I'll be coming back to when I have a question about performance.

Summary

There are two things I want to say about books like these.

First, they are immensely useful to those of us who work with Django every day. There's a huge amount of experience captured here for our benefit.

Second, I cannot imagine the amount of time and work it takes to create books like these. When I flip through Two Scoops, not only is it full of useful information, almost every page has examples or diagrams that had to be prepared too.

Caktus has added both these books to the office library, and I've bought personal copies too (including all three editions of Two Scoops). I hope you'll try either or both, and if you find them useful, spread the word.

Tim HopperThinking at Work

Having worked from home for the last few years, I have a hard time understanding how people get anything done in open-floor plan offices. I would be overwhelmed and frustrated by the noise and commotion.

I assumed open-floor plans for software shops were a relatively new invention. However, I just started reading Peopleware: Productive Projects and Teams, first published in 1987, and discovered that the first third of the book rails against open floor plan offices. I particularly enjoyed this quote:

In my years at Bell Labs, we worked in two-person offices. They were spacious, quiet, and the phones could be diverted. I shared my office with Wendl Thomis, who went on to build a small empire as an electric toy maker. In those days, he was working on the Electronic Switching System fault dictionary. The dictionary scheme relied upon the notion of n-space proximity, a concept that was hairy enough to challenge even Wendel's powers of concentration. One afternoon, I was bent over a program listing while Wendl was staring into space, his feet propped up on his desk. Our boss came in and asked, "Wendl! What are you doing?" Wendl said, "I'm thinking." And the boss said, "Can't you do that at home?"

The difference between that Bell Labs environment and a typical modern-day office plan is that in those quiet offices, one at least had the option of thinking on the job. In most of the office space we encounter today, there is enough noise and interruption to make any serious thinking virtually impossible. More is the shame: Your people bring their brains with them every morning. They could put them to work for you at no additional cost if only there were a small measure of peace and quiet in the workplace.

Tim HopperTweets I'm Proud Of

Tim HopperNew Post Function for Bash

On of the things I don't like about using a static site generator is the friction required for creating a new post. I've often end up posting things to Twitter that I would prefer to be more permanent simply because of the ease of tweeting.

To that end, I created a quick Bash function to create a new post for me. Creating this post in my Pelican directory only requires typing

$ new-post "New Post Function for Bash"

Combined with Greg Reda's Travis CI trick, the friction in getting a new post online is greatly reduced.

Caktus GroupQ3 2015 ShipIt Day ReCap

Last Friday marked another ShipIt Day at Caktus, a chance for our employees to set aside client work for experimentation and personal development. It’s always a wonderful chance for our developers to test new boundaries, learn new skills and sometimes even build something entirely new in a single day.


NC Nwoko and Mark Lavin teamed up to develop a pizza calculator app. The app simply and efficiently calculates how much pizza any host or catering planner needs to order to feed a large group of people. We eat a lot of pizza at Caktus. Noticing deficiencies in other calculators on the internet, NC and Mark built something simple, clean, and (above all) well researched. In the future, they hope to add size mixing capabilities and as well as a function for calculating the necessary ratios to provide for certain dietary restrictions, such as vegan, vegetarian, or gluten-free eaters.

Jeff Bradberry and Scott Morningstar worked on getting Ansible functioning to replace SALT states in the Django project template and made a lot of progress. And Karen Tracey approached some recent test failures, importing solutions from the database, while Rebecca Muraya began the massive task of updating some of our client based projects to Python 3.

Hunter MacDermut continued building the game he started last ShipIt Day, an HTML5 game using the Phaser framework. He added logic and other game-like elements to make a travelable board with the goal of destroying opponents. He also added animated sprites, including animations for an attack, giving each character their own unique moves. The result was a lot of fun to watch!

Dmitriy Chukhin and Caleb Smith developed a YouTube listening queue using ReactJS, using JQuery for the data layer. They loved the tag functions inherent in ReactJS as well as the speed.

Victor Rocha wrote a new admin action that enables a user to export models as a CSV file. He even found time to open source his work.

Vinod Kurup spent his day fixing RapidSMS bugs, creating two new pull requests. You can find them here and here. Once reviewed, they will be incorporated in the next RapidSMS release.

Neil Ashton worked through three chapters of experiments from The Foundations of Statistics: a Simulation-based Approach using iPython Notebook. He subsequently fell in love with iPython Notebook. An interactive computational environment, the iPython notebook seems the perfect platform for Neil’s love of data visualization and interactive experimentation. The iPython notebook ultimately allows the user to combine code execution, rich text, mathematics, plots, and other rich media.

Ross Pike spent the day exploring Font Awesome, the open-source library for scalable vector icons. He also took several tutorials in Sketch, an application for designing websites, interfaces, icons, and pretty much anything else.

Tobias McNulty spent some time working on the next release of django-cache-machine, a 3rd party Django app that adds caching and automatic invalidation to your Django models on a per-model basis. This ShipIt Day he worked on adding Python 3 support (with help from Vinod) and added a feature to support invalidation of queries when new model instances are created.

Finally, inspired by the open data apps built by Code for Durham, Rebecca Conley used D3 to write data visualizations of data on North Carolina’s public schools. Eventually, she wants to test more complex bar graph visualizations as well as learn data visualization in D3 beyond the bar graph.

We had a number of people on vacation this ShipIt Day and several administrators and team members who couldn’t put away their typical workload this time around. But no matter; there is always the next ShipIt Day!

Astro Code SchoolVideo - Why go to code school?

In this video Astro Lead Instructor Caleb Smith answers the question, "Why go to code school?". A major point is the laser focus. Code schools allow you to learn precisely the skills needed to perform a highly technical job. Watch this video for other parts of his answer. This is the first in a series of question and answer videos. More answers to real questions you may have about going to code school are on the way.

Don't forget to subscribe to the Astro Code School YouTube channel.

Astro Code SchoolDjango Bootcamp in Baltimore at Betamore

On August 7, 2015 Caleb Smith will be teaching a Django Bootcamp in Baltimore, Maryland! This short class is from 9am to 3pm and will be held at Betamore, a coworking space, incubator and campus for technology and entrepreneurship. At the class students will learn how to build a simple Django app. This bootcamp is targeted at beginners but everyone is welcome. The prerequisites are a laptop with Python and pip installed. Sign up now and reserve your spot.

Big thanks to the super kind folks at Betamore for hosting this and working with us to get this class going.

Caktus GroupLAN Party at Caktus

This past weekend, our wonderful Technology Support Specialist Scott Morningstar hosted a Local Area Network (LAN) party at Caktus HQ. Held twice a year since 2008, the event allows geeks, gamers, and retro technology lovers to relive the nostalgia of multiplayer gaming in the early days of dial-up internet. In other words, everyone brings their own computer, and uses the LAN to play online games in the company of others. These parties are a lot of fun and add a more personal social element to the online gaming community.

This year, participants played Terraria, an action game with creative world-building elements, Artemis, a spaceship bridge simulator game, and Counter Strike: Global Offensive, a team-based modern warfare game. Not only was it wonderful to see our space filled with enthusiastic gamers, but it was doubly exciting that participants joined gameplay remotely from LAN party events in Boston, Massachusetts, Farmville, Virginia, and Minneapolis, Minnesota.

We love being able to support and host such a wide variety of technology-related events in our community meeting space! For information on other functions held in our downtown Durham headquarters, or in our Astro Code School space, be sure to check out the Events page on our website.

Og MacielBooks - June 2015

Books - June 2015

Those of you who know me know that I am a huge book reader and spend most of my free time reading several books at the same time. One could say that reading is one of my passions, and having wasted so many years after high school completely ignoring this passion (in exchange for spending most of my time trying to learn about Linux, get an education, a job and, let's be frank, chasing after girls), I decided that something had to be done about it, and starting around 2008 I 'forced' myself to dedicate at least one solid hour of reading for fun every day.

I find it funny to say that I had to force myself, but this statement is very much true. Being so used to spending all of my time sitting in front of a computer and getting flooded with information every single minute of the day (IRC, Twitter, Facebook, commit emails, RSS feeds, etc), I found it difficult to 'unplug' and spend time doing nothing but focusing on only one thing. I was so used to multitasking and being constantly bombarded with lots of information that sitting quietly and reading didn't feel very productive to me... sad but true.

Anyhow, after several 'agonizing' months of getting up from my desk and making a point of turning off my cel phone and finding a quiet place somewhere in the building (or at home during the weekends), I finally got into the habit of reading for pleasure. I actually looked forward to these reading periods (imagine that, huh?) and eventually I realized that if I skipped this 'ritual' even one day, my days felt like they got longer and I felt stressed out and irritable for the remaining of the day. Reading became not only a good habit but my mechanism for relaxing and recharging my energies during the day!

Well, this passion and appetite for reading has only gotten bigger, and with time I have to say that it has become a pretty big part of who I am today! In a way I am happy that it took me this long to get back into the habit of reading... I mean, I feel that getting older was an important part of preparing myself so that I could really appreciate John Steinbeck, Ray Bradbury and the likes of them! Would I have truly appreciated The Grapes of Wrath when I was younger? Perhaps... but it took me around 40 years to get to it and I'm happy that when it did I was able to appreciate this amazing piece of art!

These last few months I decided that I wanted to start tracking all the books that I read, buy or receive as a gift every month (see my reading progress on GoodReads and add me as a friend), and jot down some of my impressions and motives for reading or buying them. Those familiar with Nick Hornby will probably associate this post (and hopefully others that will surely come) with the work he has done writing for the Believer Magazine ... and this would be correct. My intention is not to copy his style or anything like that, but I thought that the format he chose to report on his own reading 'adventures' would fit in quite nicely with what I wanted to get across to my readers... and I'm sticking with the format as long as it works for me :)

Astro Code SchoolVideo - Interview with Lead Instructor Caleb Smith

In this Astro interview video I talk with our Lead Instructor Caleb Smith. We learn about Caleb's formal education, a connection between music and computer programming, and why teaching excites him. Caleb wrote the curriculum and teaches the Astro Code School Python & Django Web Development class.

Don't forget to subscribe to the Astro Code School YouTube channel. We have a lot more videos in the works.

Caktus GroupLightning Talk Lunch: Two Useful Organizational Tools

Monthly, we organize short Lightning Talks that take place during the lunch hour here at Caktus. Not only does this allow us a wonderful excuse to have lunch delivered from one of our many local foodie options, but it’s an excellent chance to expand our knowledge on a variety of topics. Past talks have included everything from an introduction to synthesizers and other forms of electronic music, to bug fixing, to the design inspiration behind our PyCon 2015 site.

This month, we had two talks on organizational tools for project management and resource sorting. Developer Dan Poirier gave a brief talk on Pinboard, or, as he fondly refers to it, “social bookmarking for introverts.” Essentially, Pinboard is a database for storing, organizing, and sharing links and bookmarks to articles and pages on the web. Though lacking in sharp design or beautiful layout, Pinboard is useful, highly functional, and extremely intuitive. Dan was a wonderful guide in walking us through how he uses Pinboard to store development tips and articles, as well as information related to his various projects for Caktus. He even built his own front-end for the site to help organize his finds for daily use and to share with other Caktus developers.

Our second talk came from Game Designers Edward and Lucas Rowe, who are currently finishing up the work on our Epic Allies app. Before this project, Caktus wanted to try out a new management tool for development; Epic Allies turned out to be a good fit for testing JIRA, the issue and project tracking software from Atlassian. In their talk, Lucas and Edward took us on a tour of JIRA, discussed its functionality for development projects, and showed us how Epic Allies specifically used this highly customizable platform.

All in all it was an informative day, and Dan, Edward, and Lucas may have all won a few converts to their favorite organizational tools. Now we can’t wait to see what’s in the pipeline for our next set of Lightning Talks!

Astro Code SchoolAnnouncing Caktus Scholarships for Astro Code School

We’re very pleased to announce that Caktus Group will be sponsoring up to $20,000 worth of scholarships annually for Astro Code School students. There will be twenty $1,000 scholarships. We hope that these scholarships help increase access to code schools and the wider tech industry:

Caktus Group Diversity & Veterans Scholarship

This scholarship aims to support the careers of underrepresented groups in technology, specifically women, people of color, military veterans, and people with disabilities. For classrooms and for the tech industry to be the best it can be, it requires ideas from diverse groups of people.

Caktus Group North Carolinians Scholarship

Anyone who lives in North Carolina is eligible to receive this scholarship. Caktus was founded in North Carolina and we’ve benefited from the great talent here. We want tech growth in our area to include those that live here.

You can find more information about our scholarships on the financial aid page.

Caktus GroupAnnouncing Caktus Scholarships for Astro Code School

We’re very pleased to announce that Caktus Group will be sponsoring up to $20,000 worth of scholarships for Astro Code School students per year. There will be twenty $1,000 scholarships. We hope that these scholarships help increase access to code schools and the wider tech industry:

Caktus Group Diversity & Veterans Scholarship

This scholarship aims to support the careers of underrepresented groups in technology, specifically women, people of color, military veterans, and people with disabilities. For classrooms and for the tech industry to be the best it can be, it requires ideas from diverse groups of people.

Caktus Group North Carolinians Scholarship

Anyone who lives in North Carolina is eligible to receive this scholarship. Caktus was founded in North Carolina and we’ve benefited from the great talent here. We want tech growth in our area to include those that live here.

Astro Code SchoolWhat I Learned Teaching at UNC

This spring semester, I had the honor of teaching JOMC-583 "Multimedia Programming and Production" for the University of North Carolina at Chapel Hill School of Journalism and Mass Communication. The course requires university permission and two prior multimedia programming courses that focus on frontend web development. It was a wonderful opportunity to partner with the university, especially with a department that has shown leadership in recent years with adopting innovative programs and coursework for students interested in the data-driven area of journalism.

The subject matter of the course centered around backend web development with Python and Django and also included other technologies such as git, SQL, and the Unix command line. As a rough outline, the lecture topics were:

  1. Unix command line

  2. Git and Github

  3. Python

  4. Introductory Django

  5. Django views and templates

  6. Django models and data modeling

  7. Frontend development inside a Django project

  8. Miscellaneous topics

  9. Group project time

The course materials were based on Steven King's curriculum for the course from the year prior and is available at https://github.com/calebsmith/j583

At a high-level, the first half of the course was a mixture of lecture and individual assignments while the second half of the course was spent on two projects. The first development project was completed individually and was small in scale. The second and final project was more ambitious and required collaboration using Github. This served as a nice progression from focusing on concrete skills in isolation to applying those skills and developing further experientially.

One of the group projects was deployed successfully to Heroku and is visible here: http://rackfind.herokuapp.com/

While I think the course was a major learning experience for the students, it certainly was for me as well. It was particularly interesting to see the subject areas that students picked up easily or struggled with and how this often differed with my expectations. In particular, some areas that students picked up quickly were:

  1. The essential Unix command line tools such as: pwd, ls, cd, and so on

  2. Python basics

  3. Python packaging and setup, especially pip and virtualenv

  4. Using Git as a sole contributor

  5. Creating a data model

The students were much quicker to learn these concepts than I anticipated. For instance, we spent two lecture periods focusing on developing skills for the command line, but the first class was enough for most tasks. In the future, I would likely plan on needing only one lecture for that topic.

Some topics that required more reinforcement than anticipated were:

  1. Why writing a custom backend is desirable as opposed to a static HTML site

  2. The semantics of Django URL routing.

  3. How to glue JavaScript code into Django templates

I think the fundamental reason that students struggled with this more than anticipated relates to their arrival to the domain of backend programming from a background of frontend web development.

This was a great experience for me and it was rewarding to see my students succeed in programming with Python and Django. I'm very much looking forward to more opportunities to teach web development in the future.

Astro Code SchoolPython Beginner’s Night at Astro

Last night we held the first TriPython Python Beginner’s Night. About twenty three people interested in Python attended. Many of them were very experienced developers who answered all kinds of questions. From the very basic to the advanced.

A big thanks to all the Caktus Group folks who attended. You helped a lot of people! Thanks also to the other volunteers who attended. It's really cool to live in a city with so many people who enjoy helping others.

The next free Python Beginner's Night is Monday July 6, 2015 from 6pm to 8pm here at Astro Code School (map). We'll be here on the first Monday of each month with free pizza and Python experts. If you can join us please RSVP on the Meetup page. See you soon!

Caktus GroupEpic Allies Featured at mHealth at Duke 2015 Conference

At this year’s mHealth at Duke 2015 Conference, Dr. Lisa Hightow-Weidman discussed her current mHealth projects for HIV prevention. Chief among these projects is her work with Caktus Group on Epic Allies, a mobile gaming app that utilizes social media and mini-games to increase adherence to prescribed medication amongst HIV-positive men who have sex with men (MSM).

Why this particular population? According to research, MSM account for two-thirds of all new HIV infections. In fact, they are the only risk group experiencing an increase in incidence, especially in the southern United States. With 83% of young adults using smartphones, a mobile solution is ideal for targeting at-risk youth in this particular population.

Enter Epic Allies, an adherence intervention that seeks to make taking medication fun while providing social, community support. The app combines gaming, anonymous social interactions, medication reminders, and healthy habit rewards systems to encourage adherence to treatment. The development of the app is the result of a Small Business Innovation Research Grant endowed by the National Institute of Health and was built by Caktus Group in partnership with the UNC Institute for Global Health and Infectious Diseases and the Duke Global Health Institute.


Astro Code SchoolLearn About Astro Code School Info Session

Learn About Astro Code School Info Session Join us online at 10am EDT on Thursday, June 25, 2015 for a Google Hangout information session. Caleb and I will host the hangout and talk a little bit about Astro then answer any questions you might have. Please share this post and RSVP on the Hangout page.

Caktus GroupStanford Social Innovation Review Highlights Caktus' Work in Libya

The Stanford Social Innovation Review recently featured Caktus in “Text the Vote” in Suzie Boss’ “What’s Next: New Approaches to Social Change” column. It describes how our team of developers built the world’s first SMS voter registration system in Libya using RapidSMS.

Article excerpt

In a classic leapfrogging initiative, Libya has enabled its citizens to complete voter registration via digital messaging technology.

In late 2013, soon after Vinod Kurup joined Caktus Group, an open source software firm based in Durham, N.C., he became the lead developer for a new app. The client was the government of Libya, and the purpose of the app would be to support voter registration for the 2014 national elections in that country. Bomb threats and protests in Libya made in-person registration risky. “I realized right away that this wasn’t your standard tech project,” says Kurup.

As a result of that project, Libya became the first country in the world where citizens can register to vote via SMS text messaging. By the end of 2014, 1.5 million people—nearly half of all eligible voters in Libya— had taken advantage of the Caktus-designed app during two national elections. “This never would have happened in a country like the United States, where we have established systems in place [for registering voters],” says Tobias McNulty, co-founder and CEO of Caktus. “Libya was perfect for it. They didn’t have an infrastructure. They were looking for something that could be built and deployed fast.”

To read the rest of article, visit the Stanford Social Innovation Review online.

Caktus GroupRobots Robots Ra Ra Ra!!! (PyCon 2015 Must-See Talk: 6/6)

Part six of six in our PyCon 2015 Must-See Series, a weekly highlight of talks our staff enjoyed at PyCon.

I've had an interest in robotics since high school, but always thought it would be expensive and time consuming to actually do. Over the past few years, though, I've observed the rise of open hardware such as the Arduino and the Raspberry Pi, and modules and kits built on top of them, that make this type of project more affordable and accessible to the casual hobbyist. I was excited by Katherine's talk because Robot Operating System (ROS) seems to do for the software side what Arduino and such do for the hardware side.

ROS is a framework that can be used to control a wide range of robots and hardware. It abstracts away the hard work, allowing for a publish-subscribe method of communicating with your robot's subsystems. A plus side is that you can use higher-level programming languages such as Python or Lisp, not just C and C++, and there is an active and vibrant open source community built up around it already. Katherine did multiple demonstrations with a robot arm that she'd brought to the talk, that did much with a relatively small amount of easily understandable code. She showed that it was even easy to hook in OpenCV and do such things as finding a red bottle cap in the robot's field of vision.


More in the PyCon 2015 Must-See Talks Series.

Caktus GroupTesting Client-Side Applications with Django Post Mortem

I had the opportunity to give a webcast for O’Reilly Media during which I encountered a presenter’s nightmare: a broken demo. Worse than that it was a test failure in a presentation about testing. Is there any way to salvage such an epic failure?

What Happened

It was my second webcast and I chose to use the same format for both. I started with some brief introductory slides but most of the time was spent as a screen share, going through the code as well as running some commands in the terminal. Since this webcast was about testing this was mostly writing more tests and then running them. I had git branches setup for each phase of the process and for the first forty minutes this was going along great. Then it came to the grand finale. Integrate the server and client tests all together and run one last time. And it failed.

Test Failure

I quickly abandoned the idea of attempting to live debug this error and since I was at the end away I just went into my wrap up. Completely humbled and embarrassed I tried to answer the questions from the audience as gracefully as I could while inside I wanted to just curl up and hide.

Tracing the Error

The webcast was the end of the working day for me so when I was done I packed up and headed home. I had dinner with my family and tried not to obsess about what had just happened. The next morning with a clearer head I decided to dig into the problem. I had done much of the setup on my personal laptop but ran the webcast on my work laptop. Maybe there was something different about the machine setups. I ran the test again on my personal laptop. Still failed. I was sure I had tested this. Was I losing my mind?

I looked through my terminal history. There it was and I ran it again.

Single Test Passing

It passed! I’m not crazy! But what does that mean? I had run the test in isolation and it passed but when run in the full suite it failed. This points to some global shared state between tests. I took another look at the test.

import os

from django.conf import settings
from django.contrib.staticfiles.testing import StaticLiveServerTestCase
from django.test.utils import override_settings

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.support.ui import WebDriverWait


@override_settings(STATICFILES_DIRS=(
    os.path.join(os.path.dirname(__file__), 'static'), ))
class QunitTests(StaticLiveServerTestCase):
    """Iteractive tests with selenium."""

    @classmethod
    def setUpClass(cls):
        cls.browser = webdriver.PhantomJS()
        super().setUpClass()

    @classmethod
    def setUpClass(cls):
        cls.browser = webdriver.PhantomJS()
        super().setUpClass()

    @classmethod
    def tearDownClass(cls):
        cls.browser.quit()
        super().tearDownClass()

    def test_qunit(self):
        """Load the QUnit tests and check for failures."""

        self.browser.get(self.live_server_url + settings.STATIC_URL + 'index.html')
        results = WebDriverWait(self.browser, 5).until(
            expected_conditions.visibility_of_element_located(
                (By.ID, 'qunit-testresult')))
        total = int(results.find_element_by_class_name('total').text)
        failed = int(results.find_element_by_class_name('failed').text)
        self.assertTrue(total and not failed, results.text)

It seemed pretty isolated to me. The test gets its own webdriver instance. There is no file system manipulation. There is no interaction with the database and even if it did Django runs each test in its own transaction and rolls it back. Maybe this shared state wasn’t in my code.

Finding a Fix

I’ll admit when people on IRC or Stackoverflow claim to have found a bug in Django my first instinct is to laugh. However, Django does have some shared state in its settings configuration. The test is using the override_settings decorator but perhaps there was something preventing it from working. I started to dig into the staticfiles code and that’s where I found it. Django was using the lru_cache decorator for the construction of the staticfiles finders. This means they were being cached after their first access. Since this test was running last in the suite it meant that the change to STATICFILES_DIRS was not taking effect. To fix my test meant that I simply needed to bust this cache at the start of my test.

...
from django.contrib.staticfiles import finders, storage
...
from django.utils.functional import empty
...
class QunitTests(StaticLiveServerTestCase):
...
    def setUp(self):
        # Clear the cache versions of the staticfiles finders and storage
        # See https://code.djangoproject.com/ticket/24197
        storage.staticfiles_storage._wrapped = empty
        finders.get_finder.cache_clear()

All Tests Passing

Fixing at the Source

Digging into this problem, it became clear that this wasn’t just a problem with the STATICFILES_DIRS setting but was a problem with using override_settings with most of the contrib.staticfiles related settings. In fact I found the easiest fix for my test case by looking at Django’s own test suite. I decided this really needed to be fixed in Django so that this issue wouldn’t bite any other developers. I opened a ticket and a few days later I created a pull request with the fix. After some helpful review from Tim Graham it was merged and was included in the recent 1.8 release.

What’s Next

Having a test which passes alone and fails when running in the suite is a very frustrating problem. It wasn’t something that I planned to demonstrate when I started with this webcast but that’s where I ended up. The problem I experienced was entirely preventable if I had prepared for the webcast better. However, my own failing lead to a great example of tracking down global state in a test suite and ultimately helped to improve my favorite web framework in just the slightest amount. All together I think it makes the webcast better than I could have planned it.

Caktus GroupTech Community Yoga Now Offered at Caktus

The Caktus office is now home to a weekly yoga class for the tech community of Durham. Via our employee suggestion box, Lead Designer Ross Pike recommended a Caktus yoga class. Through team effort that suggestion will come to fruition next week. Starting Thursday, June 11th, we will be offering a yoga class taught by professional instructor Christina Conley. The class will be open to the public at large and will be held in our community meeting space at our offices in downtown Durham.

If you are interested in joining the yoga class, you can sign up here ($8 per session): http://www.eventbrite.com/e/tech-community-yoga-class-tickets-17261719267

Also, be on the lookout for a Caktus run club in the next few weeks. Here’s to more great ideas from the suggestion box!

Caktus GroupPyLadies RDU and Astro Code School Team Up for an Intro to Django Workshop

This past Saturday, Caktus developer Rebecca Conley taught a 4-hour introductory level workshop in Django hosted by PyLadies RDU. PyLadies RDU is the local chapter of an international mentorship group for women who love coding in Python. Their main focus is to empower women to become more active participants and leaders in the Python open-source community.

The workshop was held in our Astro Code School space and sponsored by Windsor Circle, Astro Code School, and Caktus Group. Leslie Ray, the local organizer of PyLadies, is always looking for new opportunities “to create a supportive atmosphere for women to learn and teach Python.” With a strong interest in building projects in Django herself, Leslie thought an introductory workshop was the perfect offering for those looking to expand their knowledge in Python as well as a great platform from which Rebecca could solidify her own skills in the language.

“Django is practical,” explains Rebecca, “and it’s the logical next step for those with experience in Python looking to expand their toolkit.”

The event was extremely successful, with a total of thirty students in attendance. Rebecca was impressed with the students, who were “ enthusiastic and willing to work cooperatively,” which is always key in workshop environments. The class attracted everyone from undergraduates, to PhD students, to those looking into mid-career changes. In addition, she was glad to team up with PyLadies for the workshop, appreciating the group’s goal to provide a free and friendly environment for those wishing to improve and expand on their skills.

“It’s important to create new channels for individuals to explore programming. Unfortunately, the lack of diversity in tech is an indication not of who is interested in programming or technology, but of the lack of entryways into that industry. So any opportunity to widen that gateway, or to create more gateways, or to give more people the power to program is to be valued and diversity will ultimately make the field better.”It’s important to create new gateways for people to enter the field. The group of people with interest in and aptitude for programming is large and diverse, and diversity will make this field better. It’s up to those of us already in the field to open more doors and actively welcome and support people when they come in.”

For more information on PyLadies and their local programming, be sure to join their Meetup page, follow them on Twitter, or check out the international PyLadies group page. Other local groups that provide opportunities to code and that we’re proud sponsors of include Girl Develop It! RDU, TriPython, and Code for Durham. For women in tech seeking career support, Caktus also founded Durham Women in Tech.

Astro Code SchoolVideo - Conditionals in Python

In Caleb Smith's third video in our series about beginning Python he shows you comparison operators, input(), print(), indentation and if statements in Python. Use http://repl.it/languages/Python3 to follow along in the browser.

Don't forget to subscribe to the Astro Code School YouTube channel. We have a lot more videos in the works.

Astro Code SchoolVideo - Using repl.it with Python 3

This is Caleb Smith's second video in our series about beginning Python. It shows you how to use the web based Python shell and text editor repl.it. Use http://repl.it/languages/Python3 to follow along in the browser.

Don't forget to subscribe to the Astro Code School YouTube channel. We have a lot more videos in the works.

Astro Code SchoolVideo - Very First Steps with Python

This is Caleb Smith's first video in our series about beginning Python. It introduces some fundamentals of programming in Python. Topics for this video include data values, types, basic operators and variables.

Don't forget to subscribe to the Astro Code School YouTube channel. We have a lot more videos in the works.

Caktus GroupCreating and Using Open Source: A Guide for ICT4D Managers

Choosing an open source product or platform upon which to build an ICT4D service is hard. Creating a sustainable, volunteer-driven open source project is even harder. There is a proliferation of open source tools in the world, but the messaging used to describe a given project does not always line up with the underlying technology. For example, the project may make claims about modularity or pluggability that, upon further investigation, prove to be exaggerations at best. Similarly, managers of ICT4D projects may be attracted to Open Source because of the promise of a “free” product, but as we’ve learned through trial and error at Caktus, it’s not always less costly to adapt an existing open source project than it would be to engineer a quality system from the ground up.

In this post I will go over some of the criteria we look at when evaluating a new open source project, from a developer’s perspective, in the hopes that it helps managers of ICT4D projects make educated decisions about when it makes sense to adopt a pre-existing open source solution. For those ICT4D managers looking to release a new open source platform, what follows may also prove helpful when deciding how best to allocate resources to the initial release and ongoing management of an open source product or platform. To that end, I’ll provide a high level overview of what matters most: licensing, code quality assessments, automated testing, development workflow, documentation, release frequency, and community engagement.

The three things that are most important to ICT4D projects, I would argue, are quick iteration, replicability, and scalability. Quick iteration is required in order to get early drafts of solutions out in front of beneficiaries to pilot as quickly as possible. Replicability is important when a pilot project is ready to be tested in multiple locations. Similarly, once a pilot has been shown to be successful, the ability to quickly scale up that project to meet regional, national, or even international demand is critical.

The problem is that these three success factors often place competing demands on the project. Doing things the quick and dirty way may be perceived as shortening the time to a working solution, but it also means the solution might not work in other contexts. Similarly, the project might hit a technical barrier when it comes time to scale up. With proper planning and execution, however, I believe all three of these — quick iteration, replicability, and scalability — can be achieved in a way that does not require compromises nor starting over from scratch when it comes time to replicate or scale an ICT4D project. Furthermore, we believe strongly at Caktus that doing things the right way the first time minimizes both risk and the time to develop a software project, even for quick, iterative pilots.

Selecting permissive licenses lowers the barrier to entry

There are many types and subtypes of open source licensing, and trying to select a project based on a license can easily get confusing. Generally speaking, we opt for the more permissive BSD- or MIT-style licenses at Caktus when we have the choice. The main thing to consider when using software with more restrictive licenses such as the GPL or AGPL is that they tend to be less business- or donor-friendly and hence may attract a smaller overall community than they would have otherwise. They can also add requirements that your project might not otherwise have had, such as open-sourcing it.

Creating code readable by humans improves scalability

Code quality is something that is easy to forget about early in a project. ICT4D pilots are often like startups: the drive is to get features out the door as quickly as possible to test and prove the minimum viable product (MVP). We believe you can produce work that is both speedily deployed and later easy to scale by focusing on code quality from the start. In software development there is a concept of “technical debt:” Moving quickly without concern for quality creates “debt” that must be paid back, with interest accruing over time.

Code quality includes creating code that is readable to fellow developers. Like any language, clarity for other people reading it matters. At Caktus our preference generally tends to be for the Python programming language because it is well known for being highly readable and easy to learn.

For those ICT4D program managers starting new projects, regardless of the programming language, it’s helpful to build in time for the development team to add automated checks to the project that enforce a code formatting standard. For those evaluating a new open source solution, apart from reviewing the code itself, ICT4D program managers can check for the existence of documented coding standards. The end goal is for all developers on a project is to write code that is indistinguishable from another developer’s code; you should not be able to tell from looking at a piece of code who wrote it. This makes it easier both to bring new people into the project and for a developer jump into a part of the code he or she didn’t write, in case the person who wrote it happens to be inaccessible at the time an urgent change is needed. The code should be the product of the team, not a set of disparate individuals, and having code formatting standards in place helps encourage that. At Caktus, we typically use flake8 (run via Travis CI) to check the format of our code automatically each time a developer makes a commit or submits a pull request.

Automated code testing ensures reliability

Automated code testing is both best practice and necessary to avoid software failures, but we have seen it dismissed in the rush to deploy. The key concepts for ICT4D program managers to consider in the planning process is what kind of automated testing developers are using. Automated testing includes both “unit” and “integration” testing. “Unit tests” are pieces of code that individually test discrete parts of the overall code base to ensure they continue to work as expected as changes are made to the system. “Integration tests,” similarly, verify that the different components function when combined into a complete system. The end goal of both types of tests is the same: to ensure that the existing software does not break as features are added or changed or bugs are fixed. Absent automated tests, it’s all too easy for something as small as a bug fix to introduce one or more new, unanticipated bugs in other parts of the system.

At Caktus we primarily use Django’s testing framework, which is based on unittest framework in Python. We also set up Continuous Integration to run tests on every set of changes automatically and email the developers when tests fail, so the team is always aware when the tests aren’t passing. When evaluating whether or not a project relies heavily on automated testing, two things to look for are (a) whether or not the project advertises test coverage (as a percentage, at least 85-90% is preferred), and (b) whether or not the development process requires new features to come bundled with unit tests. As with code quality, if automated tests are left out of a project, I would argue that the time to develop the project will actually increase rather than decrease because the development team will end up spending time tracking down bugs that would have been caught by the testing framework, time that could have been spent developing features.

A documented development workflow streamlines new contributions

The development workflow is another important part of any software project, in particular open source projects. Open source projects should have a clearly documented, community supported method for (a) proposing and discussing potential features or other changes, (b) developing those changes, (c) having those changes reviewed and approved by other developers, (d) merging those changes into the main branch(es), and (e) releasing sets of those changes as numbered releases (e.g., v1.2). Whether a project has these things documented can usually be discovered easily by searching for a “developer manual” or “contributors guide,” as well as reviewing the content of the project’s developer mailing list to see evidence of how contributions work in practice. This documentation acts as a clear entry point for both users and developers without which open source projects wither.

At Caktus we typically use a variant of the GitHub Flow model that includes one additional “staging” or “develop” branch that is used to deploy the code to an intermediary “staging” server. This allows code to be tested before being deployed to the production server. A key part of this workflow is the peer code review, a process by which a fellow developer reviews every new change. Not only does the process help detect potential issues early, it also broadens overall knowledge of the code base. Code reviews can’t be done intermittently or when it’s convenient, but should be done for every change being made to the project. We believe creating a culture of code reviews allows individual developers to forgo ego in favor of a drive towards system integrity. One can evaluate whether a project does code reviews by checking a number of places, including the project developer mailing list, the GitHub or BitBucket “pull requests” feature which allows line-by-line reviews, or simply by reviewing the commit log to see if changes are made directly to the “master” or “default” branch or if they’re made to separate “feature” branches first.

Clear documentation helps create sustainable open source projects

Good documentation is fundamental to any successful open source project. Perhaps counter intuitively, it’s just as easy to have too much documentation as it is to have too little. Signs that an open source project takes documentation seriously include things like how often the documentation is referenced on the project’s mailing list(s), where the documentation is stored, how the documentation is edited, and how easy the documentation makes it, both for new users and developers of the project, to come on board. While not always the case, documentation that is automatically generated by the code can be a case of “too much” rather than “good” documentation. Jacob Kaplan-Moss of the Django project wrote a great blog post back in 2009 on writing good technical documentation that is worth a read for anyone putting together documentation for an open source project.

At Caktus we generally have a preference for storing developer-written documentation in the code repository itself; this allows the team to quickly update documentation when code changes are made, and also makes it easy to spot discrepancies between code changes and documentation changes when doing code reviews. While wikis may be easy to update, they tend to fall out of sync with the code because updating them happens as part of a different process. Hosting documentation in a wiki also makes it harder to refer back to older versions of the documentation if you have a system that’s been running for a few years and have not been able to upgrade the underlying platform.

Regular releases and recent “commits” help ensure continuity

One of the first things we tend to look at (in part because it’s one of the easiest) is to check how recently the project we’re evaluating released a new version and/or how recently someone committed new changes to the code. While it’s not always a bad sign if there hasn’t been a release in a year or two, it’s generally better to find projects that have regular releases of at least 2-3 times a year. It can also be a bad sign, for example, if there are lots of frequent commits to the code repository, but the last “released” (numbered) version is many months or years old. This may mean that the release management has fallen off track, and the project is targeting only internal users rather than the larger open source / ICT4D community.

Developer community engagement necessary to leverage the power of open source

Community engagement and openness are two more important factors to consider when selecting an open source project as the foundation for (or to add to) an ICT4D solution. Community engagement matters because projects without a community of users and contributors tend not to be maintained over the long run. Engagement of the community can be evaluated by reviewing traffic on the project’s mailing list(s) and bug tracker (for both users and developers) and determining the prevailing character of project communications. Key events to look for include the usual response when someone enters a bug report, submits a suggested change or pull request, or proposes a discussion around the project’s development workflow. While reasonable demands can (and should) be placed on new users for following protocol, a high number of rejected changes or disgruntled first-time users tends to be an indicator of poor community relations. These are some of the reasons why we’re big proponents of the Django framework: the community is almost always warm and welcoming and is quick to enforce this culture. In addition to communications, other positive attributes to look for include documentation around adding new members to the core development team as well as codes of conduct or other policies that set forth in a public way the desire to create an inclusive community for all. These things matter because developers are people, and communication -- as in any discipline -- is critical.

Conclusion

While by no means an all-inclusive list, these are some of factors I think it’s important to consider when selecting a new open source product to use for an ICT4D solution. I hope to have provided useful insight into the developer’s perspective, one that I think ICT4D program managers should consider when evaluating open source projects. I realize selecting projects that hold themselves to the highest standard on all of these points may be a difficult task, so as with many things deficiencies in one area may be made up for with excellence in others. Similarly, implementing all of the above points on an open source project you release will not result in a sudden wave of contributions from volunteer developers, but the more you can do the more you’ll lower the barrier to entry for developers and facilitate community growth.

I hope to update this post from time to time with new ideas and approaches for evaluating open source projects for use in ICT4D, so if you have any questions, comments, or suggested additions, please leave them in the comments section below. I look forward to your feedback!

Caktus GroupDurham Women in Tech (D-WiT) Starts Strong

This past Tuesday we held our very first gathering for the new Durham Women in Tech (D-WiT) Meetup group. There was a huge turnout and a lot of enthusiasm for the community we’re seeking to support and build. It was particularly wonderful to see our recently opened Astro Code School space full of people.

We began with a short mingling period. I loved hearing everyone’s stories as to why they had come. I met a wide variety of women involved or interested in tech, from students just learning to code and looking for more support in that arena, to professionals with long careers hoping to learn effective methods for shaping a more inclusive culture within the tech industry.

Hao delivered a short presentation on the evening’s topic: imposter syndrome, or the feeling that you’ve flown in under the radar and are about to be found out. The feelings of incompetency and anxiety it evokes can be triggered by doing something new, a tendency towards perfectionism, or being different from those around you. For women—and especially for women of color—being different is often a de facto situation in a male-dominated field.

More important than the discussion of what imposter syndrome is, was the discussion of how to combat it. Attendees split into four groups to offer their own personal experiences with imposter syndrome as well as the tools and methods they’ve developed for resisting it. It was such a rewarding experience to walk away with viable solutions and methods for learning to internalize one’s achievements.

Our next meeting will be in July, and I don’t think I’m alone in my excitement to meet again with this new circle of support within the local tech community.

Astro Code SchoolAstro Launches in Durham

Astro Code School Director Brian Russell tells Durham Mayor Bill Bell about the school

Friday May 1 we held our launch party. A lot people showed up to welcome Astro Code School to Durham and learn about what we do. I had a great time telling our story to guests. Plus it was fun to meet Mayor Bell!

As a resident of the City of Durham I love working Downtown. It's close to where I live, convenient to a lot of great food and drink, and a great place to run into cool people all the time. I feel as if I'm part of something really awesome at a cool time in Durham history.

Astro's mission to educate people really fits well with a community who's committed to serving others. I first learned about this awesome attribute of Durhamites from friends who work at local non-profits. Inspired by them I joined AmeriCorp in 2004 as a technology VISTA at the Durham Literacy Center. This experience gave me quite an education and was a big influence on me.

A giant thanks to all the people at Caktus Consulting Group who helped organized the event. Without them it wouldn’t have been possible.

Caktus CTO Colin Copeland, Durham Mayor Bill Bell, and Caktus CBO Alex Leman

We’re right downtown Durham at 108 Morris Street. I hope that when you have a moment you'll stop in and say hi.

Caktus GroupCakti at CRS ICT4D 2015

This is Caktus’ first year taking part in the Catholic Relief Service’s (CRS) Information and Communication Technologies for Development (ICT4D) conference. The theme of this year’s conference is increasing the impact of aid and development tools through innovation. We’re especially looking forward to all of the speakers from organizations like the International Rescue Committee, USAID, World Vision, and the American Red Cross. In fact, the offerings are so vast, we thought we would provide a little cheat sheet to help you find Cakti throughout this year’s conference.

Wednesday, May 27th

How SMS Powers Democracy in Libya Vinod Kurup will explain how Caktus used Rapid SMS, a Django-based SMS framework, to build the world’s first voter registration system in Libya.

Commodity Tracking System (CTS): Tracking Distribution of Commodities Jack Byrne of the International Rescue Committee(IRC) is the Syria Response Director. He will present on the Caktus-built system IRC uses to track humanitarian aid for Syrian refugees.

Friday, May 29th

Before the Pilot: Planning for Scale Caktus’ CTO Colin Copeland will be part of a panel discussion on what technology concepts matter most at the start of a project and the various challenges of pilot programs. Also on the panel will be Jake Watson of IRC and Jeff Wishnie of MercyCorps. Hao Nguyen, Caktus’ Strategy Director, will moderate.

Leveraging the Open Source Community for Truly Sustainable ICT4D CEO Tobias McNulty will provide his insider’s perspective on the open source community and how to best use that community in the development of ICT4D tools and solutions.

Wednesday, Thursday, and Friday

Throughout the conference you can stop by the Caktus booth to read more about our ICT4D projects and services, meet Cakti, or play one of the mini-games from our Epic Allies app.

Not attending the conference? You can follow @caktusgroup and #ICT4D2015 for live updates!

Caktus GroupPyPy.js: What? How? Why? by Ryan Kelly (PyCon 2015 Must-See Talk: 5/6)

Part five of six in our PyCon 2015 Must-See Series, a weekly highlight of talks our staff enjoyed at PyCon.

From Ryan Kelly's talk I learned that it is actually possible, today, to run Python in a web browser (not something that interprets Python-like syntax and translates it into JavaScript, but an actual Python interpreter!). PyPy.js combines two technologies, PyPy (the Python interpreter written in Python) and Emscripten (an LLVM-to-JavaScript converter, typically used for getting games running in the browser), to run PyPy in the browser. This talk is a must-see for anyone who's longed before to write client-side Python instead of JavaScript for a web app. While realistically being able to do this in production may still be a ways off, at least in part due to the multiple megabytes of JavaScript one needs to download to get it working, I enjoyed the view Ryan's talk provided into the internals of this project. PyPy itself is always fascinating, and this talk made it even more so.


More in the PyCon 2015 Must-See Talks Series.

Caktus GroupAnnouncing the New Durham Women in Tech (DWiT) Meetup

We’re pleased to officially announce the launch of a new meetup: Durham Women in Tech (DWiT). Through group discussions, lectures, panels, and social gatherings, we hope to provide a safe space for women in small and medium-sized Durham tech firms to share challenges, ideas, and solutions. We especially want to support women on the business side in roles such as operations, marketing, business development, finance, and project management.

A small group of us at Caktus decided to start DWiT after being unable to find a local group for those in similar positions to us: we work on the business side and, as part of a growing company, wear many hats. Our roles often include implementing new processes and policies, tasks that influence culture and corporate direction. We have a seat at the table, but it’s not always clear how to help our companies move forward. How do we work towards removing the barriers women face in the tech industry within our roles? How do we help ourselves and our teams when faced with gendered challenges?

By pulling together a group of similar women, we hope to pool everyone’s experiences into a shared resource. We’ve seen the power of communities for female developers through the organizations Caktus supports internationally and locally with mentors and sponsorship, including, amongst others, Girl Develop It RDU, PyLadies RDU, DjangoGirls, and Pearl Hacks. We’re looking forward to strengthening the resources for women in technology in Durham.

Our inaugural meeting is on Tuesday, May 26th at 6 pm. We will be discussing imposter syndrome, a name given for those unfortunate moments where one feels like an imposter, despite external evidence to the contrary. RSVP by joining our meetup group.

Caktus GroupKeynote by Catherine Bracy (PyCon 2015 Must-See Talk: 4/6)

Part four of six in our PyCon 2015 Must-See Series, a weekly highlight of talks our staff enjoyed at PyCon.

My recommendation would be Catherine Bracy's Keynote about Code for America. Cakti should be familiar with Code for America. Colin Copeland, Caktus CTO, is the founder of Code for Durham and many of us are members. Her talk made it clear how important this work is. She was funny, straight-talking, and inspirational. For a long time before I joined Caktus, I was a "hobbyist" programmer. I often had time to program, but wasn't sure what to build or make. Code for America is a great opportunity for people to contribute to something that will benefit all of us. I have joined Code for America and hope to contribute locally soon through Code for Durham.


More in the PyCon 2015 Must-See Talks Series.

Caktus GroupQ2 2015 ShipIt Day ReCap

Last Friday everyone at Caktus set aside their regular client projects for our quarterly ShipIt Day, a chance for Caktus employees to take some time for personal development and independent projects. People work individually or in groups to flex their creativity, tackle interesting problems, or expand their personal knowledge. This quarter’s ShipIt Day saw everything from game development to Bokeh data visualization, Lego robots to superhero animation. Read more about the various projects from our Q2 2015 ShipIt Day.


Victor worked on our version of Ultimate Tic Tac Toe, a hit at PyCon 2015. He added in Jeff Bradbury’s artificial intelligence component. Now you can play against the computer! Victor also cleaned up the code and open sourced the project, now available here: github.com/caktus/ultimatetictactoe.

Philip dove into @total_ordering, a Python feature that fills in defining methods for sorting classes. Philip was curious as to why @total_ordering is necessary, and what might be the consequences of NOT using it. He discovered that though it is helpful in defining sorting classes, it is not as helpful as one would expect. In fact, rather than speeding things up, adding @total_ordering actually slows things down. But, he concluded, you should still use it to cover certain edge cases.

Karen updated our project template, the foundation for nearly all Caktus projects. The features she worked on will save us all a lot of time and daily annoyance. These included pulling DB from deployed environments, refreshing the staging environment from production, and more.

Erin explored Bokeh, a Python interactive data visualization library. She initially learned about building visualizations without javascript during PyCon (check out the video she recommended by Sarah Bird). She used Bokeh and the Google API to display data points on a map of Africa for potential use in one of our social impact projects.

Jeff B worked on Lisp implementation in Python. PyPy is written in a restricted version of Python (called RPython) and compiled down into highly efficient C or machine code. By implementing a toy version of Lisp on top of PyPy machinery, Jeff learned about how PyPy works.

Calvin and Colin built the beginnings of a live style guide into Caktus’ Django-project-template. The plan was loosely inspired by Mail Chimp's public style guide. They hope to eventually have a comprehensive guide of front-end elements to work with. Caktus will then be able to plug these elements in when building new client projects. This kind of design library should help things run smoothly between developers and the design team for front-end development.

Neil experimented with Mercury hoping the speed of the language would be a good addition to the Caktus toolkit. He then transitioned to building a project in Elm. He was able to develop some great looking hexagonal data visualizations. Most memorable was probably the final line of his presentation: “I was hoping to do more, but it turns out that teaching yourself a new programming language in six hours is really hard.” All Cakti developers nodded and smiled knowingly.

Caleb used Erlang and cowboy to build a small REST API. With more time, he hopes to provide a REST API that will provide geospatial searches for points of interest. This involves creating spatial indexes in Erlang’s built-in Mnesia database using geohashes.

Mark explored some of the issues raised in the Django-project-template and developed various fixes for them, including the way secrets are managed. Now anything that needs to be encrypted is encrypted with a public key generated when you bring up the SALT master. This fixes a very practical problem in the development workflow. He also developed a Django-project-template Heroku-style deploy, setting up a proof of concept project with a “git push” to deploy workflow.

Vinod took the time to read fellow developer Mark Lavin’s book Lightweight Django while I took up DRiVE by Daniel H. Pink to read about what motivates people to do good work or even complete rote tasks.

Scott worked with Dan to compare Salt states to Ansible playbooks. In addition, Dan took a look at Ember, working with the new framework as a potential for front-end app development. He built two simple apps, one for organizing albums in a playlist, and one for to-do lists. He had a lot of fun experimenting and working with the new framework.

Edward and Lucas built a minigame for our Epic Allies app. It was a fun, multi-slot, pinball machine game built with Unity3D.

Hunter built an HTML5 game using Phaser.js. Though he didn’t have the time to make a fully fledged video game, he did develop a fun looking boardgame with different characters, abilities, and animations.

NC developed several animations depicting running and jumping to be used to animate the superheros in our Epic Allies app. She loved learning about human movement, how to create realistic animations, and outputting the files in ways that will be useful to the rest of the Epic Allies team.

Wray showed us an ongoing project of his: a front-end framework called sassless, “the smallest CSS framework available.” It consists of front-end elements that allow you to set up a page in fractions so that they stay in position when resizing a browser window (to a point) rather than the elements stacking. In other words, you can build a responsive layout with a very lightweight CSS framework.

One of the most enertaining projects of the day was the collaboration between Rebecca C and Rob, who programmed Lego-bots to dance in a synced routine using the Lego NXT software. Aside from being a lot of fun to watch robots (and coworkers) dance, the presence of programmable Lego-bots prompted a much welcome visit from Calvin’s son Caelan, who at age of 9 is already learning to code!

Caktus GroupInteractive Data for the Web by Sarah Bird (PyCon 2015 Must-See Talk: 3/6)

Part three of six in our PyCon 2015 Must-See Series, a weekly highlight of talks our staff enjoyed at PyCon.

Sarah Bird's talk made me excited to try the Bokeh tutorials. The Bokeh library has very approachable methods for creating data visualizations inside of Canvas elements all via Python. No javascript necessary. Who should see this talk? Python developers who want to add a beautiful data visualization to their websites without writing any javascript. Also, Django developers who would like to use QuerySets to create data visualizations should watch the entire video, and then rewind to minute 8:50 for instructions on how to use Django QuerySets with a couple of lines of code.

After the talk, I wanted to build my own data visualization map of the world with plot points for one of my current Caktus projects. I followed up with one of the friendly developers from Continuum Analytics to find out that you do not need to spin up a separate Bokeh server to get your data visualizations running via Bokeh.

Astro Code SchoolFall Registration Now Open

Registration for the fall Python & Django Web Engineering class is open. You can fill out the application form on the Apply page and get more details on the application Process page. The deadline for applying is August 24, 2015. You can find a full syllabus for this class over on it's page be102.

This class is twelve weeks long and full time Monday to Friday from 9 AM – 5 PM. It'll be taught here at the Astro Code School at 108 Morris Street, Suite 1b, Durham, NC.

Python and Django make a powerful team to build maintainable web applications quickly. When you take this course you will build your own web application during lab time with assistance from your teacher and professional Django developers. You’ll also receive help preparing your portfolio and resume to find a job using the skills you’ve learned.

Please contact me if you have any questions.

Caktus GroupCakti Comment on Django's Class-based Views

After PyCon 2015, we were surprised when we realized how many Cakti who attended had all been asked about Django's class-based views (CBVs). We talked about why this might be, and this is a summary of what we came up with.

Lead Front-End Developer Calvin Spealman has noticed that there are many more tutorials on how to use CBVs than on how to decide whether to use them.

Astro Code School Lead Instructor Caleb Smith reminded us that while "less code" is sometimes given as an advantage of using CBVs, it really depends on what you're doing. Each case is different.

I pointed out that there seem to be some common misconceptions about CBVs.

Misconception: Functional views are deprecated and we're all supposed to be writing class-based views now.

Fact: Functional views are fully supported and not going anywhere. In many cases, they're a good choice.

Misconception: CBVs means using the generic class-based views that Django provides.

Fact: You can use as much or as little of Django's generic views as you like, and still be using class-based views. I like Vanilla Views as a simpler, easier to understand alternative to Django's generic views that still gives all the advantages of class-based views.

So, when to use class-based views? We decided the most common reason is if you want to reuse code across views. This is common, for example, when building APIs.

Caktus Technical Director Mark Lavin has a simple answer: "I default to writing functions and refactor to classes when needed writing Python. That doesn't change just because it's a Django view."

On the other hand, Developer Rebecca Muraya and I tend to just start with CBVs, since if the view will ever need to be refactored that will be a lot easier if it was split up into smaller bits from the beginning. And so many views fall into the standard patterns of Browse, Read, Edit, Add, and Delete that you can often implement them very quickly by taking advantage of a library of common CBVs. But I'll fall back to Mark's system of starting with a functional view when I'm building something that has pretty unique behavior.

Tim HopperHow I Became a Data Scientist Despite Having Been a Math Major

Caution: the following post is laden with qualitative extrapolation of anecdotes and impressions. Perhaps ironically (though perhaps not), it is not a data driven approach to measuring the efficacy of math majors as data scientists. If you have a differing opinion, I would greatly appreciate you to carefully articulate it and share it with the world.

I recently started my third "real" job since finishing school; at my first and third jobs I have been a "data scientist". I was a math major in college (and pretty good at it) and spent a year in the math Ph.D. program at the University of Virginia (and performed well there as well). These two facts alone would not have equipped me for a career in data science. In fact, it remains unclear to me that those two facts alone would have prepared me for any career (with the possible exception of teaching) without significantly more training.

When I was in college Business Week published an article declaring "There has never been a better time to be a mathematician." At the time, I saw an enormous disconnect between the piece and what I was being taught in math classes (and thus what I considered to be a "mathematician"). I have come across other pieces lauding this as the age of the mathematicians, and more often than not, I've wondered if the author knew what students actually studied in math departments.

The math courses I had as an undergraduate were:

  • Linear algebra
  • Discrete math
  • Differential equations (ODEs and numerical)
  • Theory of statistics 1
  • Numerical analysis 1 (numerical linear algebra) and 2 (quadrature)
  • Abstract algebra
  • Number theory
  • Real analysis
  • Complex analysis
  • Intermediate analysis (point set topology)

My program also required a one semester intro to C++ and two semesters of freshman physics. In my year as a math Ph.D. student, I took analysis, algebra, and topology classes; had I stayed in the program, my future coursework would have been similar: pure math where homework problems consistent almost exclusively of proofs done with pen and paper (or in LaTeX).

Though my current position occasionally requires mathematical proof, I suspect that is rare among data scientist. While the "data science" demarcation problem is challenging (and I will not seek to solve it here), it seems evident that my curriculum lacked preparation in many essential areas of data science. Chief among these are programming skill, knowledge of experimental statistics, and experience with math modeling.

Few would argue that programming ability is not a key skill of data science. As Drew Conway has argued, a data scientist need not have a degree in computer science, but "Being able to manipulate text files at the command-line, understanding vectorized operations, thinking algorithmically; these are the hacking skills that make for a successful data hacker." Many of my undergrad peers, having briefly seen C++ freshman year and occasionally used Mathematica to solve ODEs for homework assignments, would have been unaware that manipulation of a file from the command-line was even possibile, much less have been able to write a simple sed script; there was little difference with my grad school classmates.

Many data science positions require even more than the ability to solve problems with code. As Trey Causey has recently explained, many positions require understanding of software engineering skills and tools such as writing reusable code, using version control, software testing, and logging. Though I gained a fair bit of programming skill in college, these skills, now essential in my daily work, remained foreign to me until years later.

My math training had a lack of statistics courses. Though my brief exposure to mathematical statistics has been valuable in picking up machine learning, experimental statistics was missing altogether. Many data science teams are interested in questions of causal inference and design and analysis of experiments; some would make these essential skills for a data scientist. I learned nothing about these topics in math departments. Moreover, machine learning, also a cornerstone of data science, is not a subject I could have even defined until after I was finished with my math coursework; at the end of college, I would have said artificial intelligence was mostly about rule-based systems in Lisp and Prolog.

Yet even if statistics had play a more prominent role in my coursework, those who have studied statistics know there is often a gulf between understanding textbook statistics and being able to effectively apply statistical models and methods to real world problems. This is only an aspect of a bigger issue: mathematical (including statistical) modeling is an extraordinarily challenging problem, but instruction on effectively model real world problems is absent from many math programs. To this day, defining my problem in mathematical terms one of the hardest problems I face; I am certain that I am not alone on this. Though I am now armed with a wide variety of mathematical models, it is rarely clear exactly which model can or should be applied in a given situation.

I suspect that many people, even technical people, are uncertain as to what academic math is beyond undergraduate calculus. Mathematicians mostly work in the logical manipulation of abstractly defined structures. These structures rarely bear any necessary relationship to physical entities or data sets outside the abstractly defined domain of discourse. Though some might argue I am speaking only of "pure" mathematics, this is often true of what is formally known as "applied mathematics". John D. Cook has made similar observations about the limitations of pure and applied math (as proper disciplines) in dubbing himself a "very applied mathematician". Very applied mathematics is "an interest in the grubby work required to see the math actually used and a willingness to carry it out. This involves not just math but also computing, consulting, managing, marketing, etc." These skills are conspicuously absent from most math curricula I am familiar with.

Given this description of how my schooling left me woefully unprepared for a career in data science, one might ask how I have had two jobs with that title. I can think of several (though probably not all) reasons.

First, the academic study of mathematics provides much of the theoretical underpinnings of data science. Mathematics underlies the study of machine learning, statistics, optimization, data structures, analysis of algorithms, computer architecture, and other important aspects of data science. Knowledge of mathematics (potentially) allows the learner to more quickly grasp each of these fields. For example, learning how principle component analysis—a math model that can be applied and interpreted by someone without formal mathematical training—works will be significantly easier for someone with earlier exposure linear algebra. On a meta-level, training in mathematics forces students to think carefully and solve hard problems; these skills are valuable in many fields, including data science.

My second reason is connect to the first: I unwittingly took a number of courses that later played important roles in my data science toolkit. For example, my current work in Bayesian inference has been made possible by my knowledge of linear algebra, numerical analysis, stochastic processes, measure theory, and mathematical statistics.

Third, I did a minor in computer science as an undergraduate. That provided a solid foundation for me when I decided to get serious about building programming skill in 2010. Though my academic exposure to computer science lacked any software engineer skills, I left college with a solid grasp of basic data structures, analysis of algorithms, complexity theory, and a handful of programming languages.

Fourth, I did a masters degree in operations research (after my year as a math PhD student convinced me pure math wasn't for me). This provided me with experience in math modeling, a broad knowledge of mathematical optimization (central to machine learning), and the opportunity to take graduate-level machine learning classes.1

Fifth, my insatiable curiosity in computers and problem solving has played a key role in my career success. Eager to learn something about computer programming, I taught myself PHP and SQL as a high school student (to make Tolkien fan sites, incidentally). Having been given small Mathematica-based homework assignments in freshman differential equations, I bought and read a book on programming Mathematica. Throughout college and grad school, I often tried—and sometimes succeeded—to write programs to solve homework problems that professors expected to be solved by hand. This curiosity has proven valuable time and time again as I've been required to learn new skills and solve technical problems of all varieties. I'm comfortable jumping in to solve a new problem at work, because I've been doing that on my own time for fifteen years.

Sixth, I have been been fortunate enough to have employers who have patiently taught me and given me the freedom to learn on my own. I have learned an enormous amount in my two and a half year professional career, and I don't anticipate slowing down any time soon. As Mat Kelcey has said: always be sure you're not the smartest one in the room. I am very thankful for three jobs where I've been surrounded by smart people who have taught me a lot, and for supervisors who trust me enough to let me learn on my own.

Finally,4 it would be hard for me to overvalue the four and a half years of participation in the data science community on Twitter. Through Twitter, I have the ear of some of data science's brightest minds (most of whom I've never met in person), and I've built a peer network that has helped me find my current and last job. However, I mostly want to emphasize the pedagogical value of Twitter. Every day, I'm updated on the release of new software tools for data science, the best new blog posts for our field, and the musings of of some of my data science heros. Of course, I don't read every blog post or learn every software tool. But Twitter helps me to recognize which posts are most worth my time, and because of Twitter, I know something instead of nothing about Theano, Scalding, and dplyr.2

I don't know to what extent my experience generalizes3, in either the limitations of my education or my analysis of my success, but I am obviously not going to let that stop me from drawing some general conclusions.

For those hiring data scientists, recognize that mathematics as taught might not be the same mathematics you need from your team. Plenty of people with PhDs in mathematics would be unable to define linear regression or bloom filters. At the same time, recognize that math majors are taught to think well and solve hard problems; these skills shouldn't be undervalued. Math majors are also experienced in reading and learning math! They may be able to read academic papers and understand difficult (even if new) mathematical more quickly than a computer scientist or social scientist. Given enough practice and training, they would probably be excellent programmers.

For those studying math, recognize that the field you love, in its formal sense, may be keeping you away from enjoyable and lucrative careers. Most of your math professors have spent their adult lives solving math problems on paper or on a chalkboard. They are inexperienced and, possibly, unknowledgeable about very applied mathematics. A successful career in pure mathematics will be very hard and will require you to be very good. While there seem to be lots of jobs in teaching, they will rarely pay well. If you're still an student, you have a great opportunity to take control of your career path. Consider taking computer science classes (e.g. data structures, algorithms, software engeering, machine learning) and statistics classes (e.g. experimental design, data analysis, data mining). For both students and graduates, recognize your math knowledge becomes very marketable when combined skills such as programming and machine learning; there are a wealth of good books, MOOCs, and blog posts that can help you learn these things. More over, the barrier to entry for getting started with production quality tools has never been lower. Don't let your coursework be the extent of your education. There is so much more to learn!5


  1. At the same time, my academic training in operations research failed me, in some aspects, for a successful career in operations research. For example, practical math modeling was not sufficiently emphasized and the skills of computer programming and software development were undervalued. 

  2. I have successfully answered more than one interview question by regurgitating knowledge gleaned from tweets. 

  3. Among other reasons, I didn't really plan to get where I am today. I changed majors no fewer than three times in college (physics, CS, and math) and essentially dropped out of two PhD programs! 

  4. Of course, I have plenty of data science skills left to learn. My knowledge of experimental design is still pretty fuzzy. I still struggle with effective mathematical modeling. I haven't deployed a large scale machine learning system to production. I suck at software logging. I have no idea how deep learning works. 

  5. For example, install Anaconda and start playing with some of these IPython notebooks

Tim HopperPublishing a Static Site Generator from iOS

A few weeks ago, I setup Travis CI so this Pelican-based blog will publish itself when I commit a new post to Github.

At the time, I asked on Twitter if there were any good Git clients that would allow me to push new posts from my iPad; I didn't get any promising replies.

However, I just found out about an app called Working Copy "a powerful Git client for iOS 8 that clones, edits, commits, pushes, and more."

I just cloned my Stigler Diet repo on my iPad, and I'm composing this post from the Whole Foods cafe on my iPad. If you're reading this post, it's because I successfully published it from here!

Astro Code SchoolVideo - Tips for Using Generators in Python

Here's the third screencast video in Caleb Smith's series about functional programming in Python. This one describes generators, iterators and iterables in Python with some tips on how to implement generators.

Don't forget to subscribe to the Astro Code School YouTube channel. Lots more educational screencasts to come.

Caktus GroupBeyond PEP 8 by Raymond Hettinger (PyCon 2015 Must-See Talk: 2/6)

Part two of six in our PyCon 2015 Must-See Series, a weekly highlight of talks our staff enjoyed at PyCon.

I think everyone who codes in any language and uses any automated PEP-8 or linter sort of code checker should watch this talk. Unfortunately to go into any detail on what I learned (or really was reminded of) would ruin the effect of actually watching the talk. I'd encourage everyone to watch it. I came away from the talk wanting to figure out a way to incorporate its lesson into our Caktus development practices.

Frank WierzbickiJython 2.7.0 final released!

On behalf of the Jython development team, I'm pleased to announce that the final release of Jython 2.7.0 is available! It's been a long road to get to 2.7, and it's finally here! I'd like to thank Amobee for sponsoring my work on Jython. I'd also like to thank the many contributors to Jython, including - but not limited to - bug reports, patches, pull requests, documentation changes, support emails, and fantastic conversation on Freenode at #jython.

Along with language and runtime compatibility with CPython 2.7.0, Jython 2.7 provides substantial
support of the Python ecosystem. This includes built-in support of pip/setuptools (you can use with bin/pip) and a native launcher for Windows (bin/jython.exe), with the implication that you can finally install Jython scripts on Windows.

Jim Baker presented a talk at PyCon 2015 about Jython 2.7, including demos of new features.

Please see the NEWS file for detailed release notes. This release of Jython requires JDK 7 or above.

This release is being hosted at maven central. There are three main distributions. In order of popularity:
To see all of the files available including checksums, go here and navigate to the appropriate distribution and version.

Astro Code SchoolVideo - Implementing Decorators in Python

This screencast provides some insights into implementing decorators in Python using functional programming concepts and demonstrates some instances where decorators can be useful.

In the video, I reference the blog post Python Decorators in 12 Steps by Simeon Franklin for further reading.

Footnotes