A planet of blogs from our members...

Tim HopperBackyard Macro Videography

I've been munching on sunflower seeds while working on my back patio, and some tiny ants (Monomorium minimum, I think) have been enjoying the leftovers.

I pulled out my camera, 50mm lens, and extension tube to experiment with macro videography. The result is quite fun!

Here's what the recording setup looked like.

Caktus Group3 Reasons to Upgrade to the Latest Version of Django

When considering a website upgrade, many business stakeholders probably think about the frontend, i.e., how the website looks or the features users interact with. Perhaps less often considered is the importance of upgrading the backend; that is, the databases, applications, and servers powering all the behind-the-scenes activity. Infrastructure support and upgrades are necessary but often performed as a separate project from any improvements to design or user experience, rather than as part of a holistic update project.

With that in mind, it helps to have an understanding of why upgrading the backend should be considered a necessary part of any website upgrade project. We offer 3 reasons, focusing on our specialty of Django-based websites. Upgrading:

  • increases security,
  • reduces development and maintenance costs, and
  • ensures support for future growth.

Read on for details about each of these factors, or get in touch to speak with us about them.

Increase the security of your site

The Django framework is continually being improved and certain releases are designated as “Long Term Support” (LTS) versions. LTS versions receive security updates and bug fixes for a three-year period, as opposed to the usual 18 months. When your website uses an unsupported version of Django, newly uncovered bugs are not being fixed, patched, or supported by the Django and Open Source communities. No new security fixes are planned for retired versions, a situation that carries a number of risks.

These risks come in the form of vulnerabilities - weaknesses that leave your site open to attack. Attacks could potentially cause servers to go down, data to be leaked or stolen, or features to stop working. If a vulnerability is taken advantage of, it could lead to a loss of reputation and potentially a loss of revenue or legal ramifications. With high consumer expectations and increasing requirements from international data protection laws, this could prove disastrous for organizations or web applications without stringent upgrade plans in place.

If your site is using an older version of Django, a security patch may not be released for your version of Django. This means that a fix for the vulnerability would have to be authored and implemented by your development team, which, over time, is less cost effective than upgrading to the LTS version.

Upgrading to an LTS release offers significant benefits, including security updates as needed. Fixes for security issues and vulnerabilities are implemented quickly. There is no need to implement fixes yourself (or hire out expensive custom work). Taking proactive steps to upgrade reduces risk and can save you the trouble of expensive, reactive steps in the event of a cyberattack.

Reduce development and maintenance costs

In addition to improving security and ensuring support for future growth, upgrading also offers productivity benefits for development teams. Many extra lines of code may be required in order to continue to backport fixes for your website or app as issues occur or features are added. Adding all this code and continuing to use old versions of Django will eventually lead to technical debt, where the short-term fixes and outdated code end up creating extra work to patch and maintain a project in the long run.

Custom fixes and patches also introduce a large learning curve for new developers or contractors. The issue here is two-fold: Onboarding new developers is more time consuming than it needs to be, and if key personnel leave, you may lose knowledge which is integral to maintaining or updating the project.

Upgrading your version of Django reduces technical debt by eliminating old, no-longer-needed code. It also allows your development team to reduce the time and money spent on addressing security issues and bug fixes, freeing up time for them to work on website improvements or revenue-generating work.

Ensure support for future growth

Extensibility is the practice of keeping future growth in mind when working on a development project. We often hear from potential clients who built a website or web app in the early days of their business, when releasing features quickly took precedence over planning for future growth. Whether that growth is in the form of more data, more users, or more functionality, planning for it impacts current design and development decisions. When growth isn’t considered, scaling up the project and adding new features requires a disproportionate amount of work. If the original development was not intended to support the changes being made, custom workarounds must be introduced.

Where does this leave your web project? Technologically out of date, unnecessarily clunky, and less able to deliver a quality experience to site visitors.

Upgrading Django from an out-of-date version to a more recent LTS version not only provides access to software that is constantly receiving bug and security fixes; it also simplifies the upgrade process when a new version of Django is released with a feature needed by your project. If your project is two, three, even four releases behind, upgrading all at once could be cost-prohibitive. By regularly upgrading, you gain near-immediate access to new features in Django if and when needed. In other words, you can depend on a highly-engaged developer community actively working to add features rather than reinventing the wheel by developing them yourself.

Next steps

The wider open source development community is producing great tools and enhancements every day and the community continues to grow in size and support. Your project may find itself left behind if Django is left unsupported - or growing along with the community if you upgrade.

So where to get started? For clients considering an upgrade, we generally advise moving up to the most recent LTS release. While the latest version of Django offers the newest features, the LTS version represents a version of Django that will be more cost efficient to maintain given the community’s three-year commitment to releasing updates for it.

As Django specialists, Caktus developers have experience upgrading and maintaining Django-based websites. We have successfully completed upgrades for numerous clients and offer an upgrade protection plan that ensures your site will be maintained year to year as well as updated to the most recent LTS version of Django. Sound good? Get in touch to start the process of upgrading and securing your website, or take a look at some of our other services if you’ve got a larger project in mind.

Tim HopperAdversarial Learning: Stories of Degradation and Humiliation

My friends Andrew and Joel were kind enough to have me back on their podcast Adversarial Learning. We shared our tales of bad data science interviews. Enjoy!

Caktus GroupUniqueness is an Advantage

Back in March, the organizers from the Women in Tech summit asked if I’d like to collaborate on a panel on diversity in technology at their Philadelphia summit. Back in October, I had created a panel on “Staying a Women in Tech” and was excited for the opportunity to speak on such a significant topic in my hometown of Philadelphia. I was introduced to Brigitte Daniel, who had submitted this new panel, and we began putting the panel plans together. Brigitte would moderate the discussion and I would be a panelist along with three other dynamic women in tech: Elise Wei, Jumoke Dada and Gulrukh Ahanger.

Brigitte, founder of Mogulette, a program centered “around educating, mentoring, and empowering women, with a focus on women of color who are interested in careers in business and technology”, thought it would be a great ice breaker to begin the panel with a clip of Bozoma Saint John at WWDC 2016. Saint John’s ability to connect with her audience on a massive scale set an example for other women of color in the tech industry looking for ways to become more visible and make a name for themselves.

After panel introductions, with panelists ranging from developers and managers to founders and chapter leaders of non-profit organizations that help women learn to code, we began chatting about how we could use our uniqueness to our advantage on our respective teams. Here are some of the highlights captured by our audience on Twitter:

Gulrukh Ahanger

Jumoke Dada

Elise Wei

Erin Mullaney

Overall, I was incredibly happy with how our panel turned out. I definitely heard things from other panelists that I could take back with me and think about. It’s an incredible feeling to be with a group of smart women who are there to help lift each other up.

Gif via Libby VanderPloeg on Giphy

See what other events Cakti have participated in or check out these talks.

Caktus GroupCaktus Consulting Group is an Official AWS Consulting Partner

We’re proud to announce that Caktus has become a certified Amazon Web Services (AWS) Consulting Partner in recognition of the depth and breadth of our AWS expertise. Since AWS became an option for fast, flexible, and low cost infrastructure, we’ve used it to build scalable web or cloud apps for our clients. We’ve used AWS services for computing, networking, storage, databases, security, application services and security for 10 clients over the last few years (and that’s not including the projects we do for fun or as part of ShipIt Day projects).

In addition to our client experience, we have 7 individual AWS certifications amongst our staff. AWS Certification is industry-recognized and demonstrates a thorough, tested knowledge of Amazon Web Services.

We’re looking forward to building more apps with AWS as our top pick for cloud computing services. Joining the Amazon Web Services Partner Network puts us in good company and grants us access to a special range of tools that we can put to work for our clients. To learn more about how we use AWS to deliver highly scalable apps, please contact us.

Or, check out a few of our top blog posts on working with AWS:

Caktus GroupCelebrating 10 Years of Building Web Apps the Right Way

This year marks 10 years of building sharp web apps at Caktus Group. We’re honored by the trust our clients have put in us; it has enabled Caktus to grow from a team of 3 Python developers to an organization of 31 people and supported our efforts to give back to the local and open source communities.

What do Caktus staff have to say about this milestone?

Looking back

Looking back on her 7 years of work at Caktus, Karen says, “It’s been fun to go from 6 people to 30, be a part of that growth and work on communication and defining roles.” She cites her enjoyment of building long-term customer relationships and being able to nurture projects from their early development to completion and improvement over time as her main reason for sticking with Caktus.

Mark, also with Caktus for 7 years, is proud of how his work here and the support of his colleagues provided opportunities to speak at community events and co-author Lightweight Django. He adds, “I feel like I've grown with this company and it's grown with me. I'm proud to say I work here. I remember my first DjangoCon when were were 5-6 people. I told someone I worked at Caktus and they said ‘Oh, I've heard of you guys.’ I knew then that we were doing something right.”

Caktus Top 10s

Top 10 Caktus GitHub contributions

  1. django-project-template
  2. django-scribbler
  3. django-treenav
  4. django-pagelets
  5. fabulaws
  6. django-email-bandit
  7. margarita
  8. django-file-picker
  9. django-jsx
  10. django-comps

Top 10 blog posts in the last year

As part of our commitment to giving back to the development community, we maintain a technical blog with tips for Django, Python, UX, and more. The collection has gotten pretty big over the years! Our 10 most popular blog posts, in order of views, are:

  1. Using Amazon S3 to Store Your Django Site's Static and Media Files
  2. Getting Started Scheduling Tasks with Celery
  3. Migrating to a Custom User Model in Django
  4. Getting Started using Python in Eclipse
  5. Configuring a Jenkins Slave
  6. Custom JOINs with Django's query.join()
  7. Best Python Libraries
  8. Celery in Production
  9. Django Logging Configuration: How the Default Settings Interfere with Yours
  10. Writing Unit Tests for Django Migrations

Building Apps the Right Way

Even with those top 10s, one of the things we’re most proud of is our dedication to building web apps the right way. Caktus CEO Tobias McNulty says, “Doing things right is an ethos that extends to all areas of the business - not just app development. We’ve spent the last 10 years working to implement processes and systems that ensure we continually deliver excellent work to our clients and treat our staff with fairness and respect.”

Caktus' core values guide both our internal and external interactions and have played a key part in growing Caktus to where it is today. We’re confident they will also continue to drive our future growth.

We’ve encountered an immense variety of technical challenges in our 10-year mission to build web applications the right way. It is always a delight to bring that experience to bear on a new project. If you have a project that might benefit from Caktus’ approach, don’t hesitate to get in touch.

Caktus GroupUsing Tokens During Sprint Planning to Allocate Time

In January of 2016, Caktus transitioned from a general Agile development environment to a more focused Scrum environment. Part of this transition entailed moving from a targeted budget allocation approach per project, to a self-organizing, goal-based team structure with no obvious provision for tight, consistent control over project budgets.

If managing budgets is part of your job, you can appreciate how much our project managers struggled with this. We shifted to working in 2-week-long goal-based sprints, but still had to pay attention to budget constraints. We searched for a way to still effectively manage our budgets, but to do so without exercising unseemly amounts of un-Scrum-like “command and control”.

We also noticed that the development team members were having their share of budget-related struggles:

  • If we didn’t discuss hourly budgets in sprint planning, it was difficult for team members to gauge whether the stories they were committing to aligned both with the sprint goals and the project budgets.
  • However, if we did discuss hourly budgets in sprint planning, the teams tended to feel a lack of agency, which inhibited self-organization. This also tended to introduce an unwanted comparison of hours to story points.
  • It was common for the team to commit to stories that appeared to meet the sprint goals, and not realize until the end of the sprint that multiple team members had over-focused on the same project during the sprint. This could lead to overspending on some projects, while underspending on others.
  • The teams realized that without increased transparency regarding budgets, it was entirely possible for them to deliver sprint goals and satisfy client needs, but still come in over- or under-budget.

So how do we maintain our project budgets, empower the team, be truly Agile, and still deliver a working product at the end of each sprint that meets the goals?


Here’s our solution:

  1. Acquire supplies: colored tokens, large pads of paper, markers.
  2. Create a grid on a large piece of paper with enough boxes for your team’s budget sources. These can be projects, sprint activities, time off, etc., but should reflect the main buckets of time that your team members allocate to during sprints. Label each box.
  3. Designate a budget for each token to represent. Our tokens each represent 1 half day (or 4 hours).
  4. During sprint planning, each team member selects a color and takes the amount of tokens equal to their availability during the sprint. Full-time team members at Caktus get 20 tokens; part-time members take fewer as appropriate.
  5. Team members then allocate their tokens in the boxes as they see fit. At this time, the Product Owner (PO) can communicate any budget limitations for specific projects. The team resolves allocation conflicts amongst themselves.

This exercise helps the team identify and resolve these frequent conflicts:

  • Sprint goals not achievable within the budget
  • Over- or under-allocation by individual team members, due to PTO, enthusiasm, or other commitments
  • Project favoritism (everyone wants to work on one project)
  • Project only gets time from a single team member, leaving no space for pull request review or quality assurance from the rest of the team

Typically, our project managers (playing the role of Product Owner) go over the sprint goals prior to this exercise. Once the initial allocation is complete, the team progresses to planning their sprint, starting with the token exercise. This usually takes no more than 5-10 minutes. The grid with the tokens is left in play through the end of the sprint, allowing team members to reference and adjust it as needed. After sprint planning, the Scrum Master posts the initial allocations to the team’s Basecamp.

The budget allocations per person that come out of this exercise are not communicated to the team again during the sprint by the PO, Scrum Master, or other stakeholders. The team can choose to reference the allocations or not, as they see fit. If desired, the PO can compare the initial allocation data to the actual expenditures after the sprint ends. This type of comparison over multiple sprints can be useful in identifying trends that the PO can act upon. For example, if a project is consistently allocated more time in sprint planning than is actually spent on it, yet the goals are always completed, this could be an indication that the goals and/or stories for that project are too small and can be increased in scope.

All three of our Scrum teams are choosing to use this exercise for now. We’ve found that the token exercise provides budget transparency for the development teams, a mechanism for hourly budget management (without command and control) for the POs, and a starting point for team conversations about resource allocation. It also starts sprint planning with a hands-on activity that gets the team thinking and moving around.

Looking for more info about using Scrum and Agile in web development? Read about how we implemented Scrum in a client-services organization, or check out this post about using priority in Scrum to reduce team anxiety.

Caktus GroupShipIt Day Recap Q2 2017

Once per quarter, Caktus employees have the opportunity to take a day away from client work to focus on learning or refreshing skills, testing out ideas, or working on open source contributions. The Q2 2017 ShipIt Day work included building apps, updating open source projects, trying out new tools, and more. Keep reading for the details.

PostgreSQL Performance

Erin used ShipIt Day to watch a tutorial on Postgres performance by Craig Kerstiens and test the Caktus website with some of the things she learned. She used the free pgAdmin III tool to try out some of Craig’s suggested database queries for performance monitoring. While drilling down into our website, she explored cache and index hit rates, reviewed query performance on our blog, and tested with pg_stat_statements to find the most expensive queries in aggregate across database. Erin plans to use her findings to inform decisions impacting website performance.

GitHub Pull Requests Tool

Dan built a tool to help with GitHub pull requests. The tool watches a pull request until it’s ready to merge, then merges it for him. He built it from scratch using the requests library and GitHub API. The tool works by reloading the page occasionally to see if the request is ready to merge, and has tab title changes to make it easy to keep an eye on the status of the request.

Book Club Voting App

Charlotte M built an app to help the Caktus book club vote on their next book. Members can view and add books to the book list for the next election, then vote using an election interface. Once the books are selected, members vote by dragging and dropping the titles in their preferred order to submit votes.

As part of her project, Charlotte researched real-life voting systems and settled on the Borda Count method, preferring a consensus-based system over a majoritarian one.

Open Source Projects

Mark reviewed open source projects and worked on maintenance for the Sick Muse project, a front end for collectd. He wanted to make it work on Python 3 and ensure it works on the latest version of Tornado. While the back end worked, he found that the JavaScript/Bower-based front end broke and plans to remove Bower in future. As part of maintenance, he also worked to improve test coverage from 50% to 88%.

Test Case Management Tool Research

Gerald researched test case management tools that would integrate with JIRA, aiming to find something that would mesh with JIRA as well as sharing a similar visual style. He looked at qTest by QAsymphony, Xray for JIRA, and Zephyr for JIRA, settling on Zephyr for testing. Although it required him to create a few workarounds, Gerald got Zephyr up and running, demonstrating a few user stories and test cases.

For the next ShipIt Day, Gerald plans to look at QA metrics and reporting.

Python for Data Visualization

As part of her grad school projects, NC’s coursework requires at least one semester of Python and data visualization. She spent this ShipIt Day working on creating a media library, working on the exit function and queries for media types which would allow the user to get a list of all of the records that fit a given query.

User Stories for Agile Development

UX designer Basia read User Stories Applied for Agile Software Development by Mike Cohn to brush up her skill in writing user stories as a way to enhance the user story mapping techniques she leverages in discovery workshops. She shared a review of what a user story is and what it conveys, noting that each must be accompanied by acceptance criteria that will validate developed functionality.

She also walked through why user stories should be used in software development, the importance of working as a team to verbally communicate them, the usefulness of user stories in helping to defer details until the team is sure they’re needed, and how they discourage teams from pretending they know up-front everything there is to be known about the project. Most importantly for the Agile developer, she explained how user stories encourage iterative work.

Hello Ansible

Working on an Ansible project for ShipIt Day.

Neil, Dmitriy, and Jeff B worked as a team to start a bare-bones “hello world” project using Ansible and and write an Ansible playbook for its deployment using nginx, gunicorn, Django, Postgres, and memcached.

As a newcomer to devops, Neil liked using Ansible and learned that it’s not as scary as it seems. Dmitriy liked working through the different steps and generally learning about Ansible. Jeff was on board as an advisor, and sees areas where more documentation can be written.

Project Tequila

Vinod worked together with the ‘Hello Ansible’ team on a similar project. Caktus hosts many client projects, all of which were initially created with varying deployment recipes. Many of these use Margarita, Caktus’s homegrown library of Salt recipes. Vinod decided to take one of these older projects (the Libya SMS project) and investigate how it could be migrated from Margarita to Tequila, our internally-developed library for Ansible. This worked surprisingly well (thanks to help from Jeff B, the primary author of Tequila) and by Friday afternoon, we had a single app server deployed successfully to a Vagrant box.


Sarah, Charlotte F, Elizabeth and Gannon created and reviewed posts for the Caktus blog, with topics including sprint planning, conference recaps, and project management. Keep an eye out for those in the next several weeks.

Triangulated Hearts

Kia Lam demonstrates her project for ShipIt Day Q2 2017.

Kia revisited a project built a year and a half ago using the Processing library for animation. The project uses the Triangulate and Minim libraries. The animation of a heart reacts to sound, including voice or a song, by changing color and shifting geometric lines.

For the next version Kia would like to make adjustments to functions built into the library. It’s currently audio reactive but not beat reactive, something she intends to work on.

Or, maybe she’ll animate the Caktus logo for our 10th anniversary party!

Until next time

As you can see, Cakti have been busy on a range of projects. Want to join us and work on sharp web apps? Check out the Caktus careers page for current openings.

Caktus GroupBuilding a Custom Block Template Tag

Building custom tags for Django templates has gotten much easier over the years, with decorators provided that do most of the work when building common, simple kinds of tags.

One area that isn't covered is block tags, the kind of tags that have an opening and ending tag, with content inside that might also need processing by the template engine. (Confusingly, there's a block tag named "block", but I'm talking about block tags in general).

A block tag can do pretty much anything, which is probably why there's not a simple decorator to help write them. In this post, I'm going to walk through building an example block tag that takes arguments that can control its logic.

Django Documentation

There are a couple of pages in the Django documentation that you should at least scan before continuing, and will likely want to consult while reading:

What our example tag will do

Let's write a tag that can make simple changes to its content, changing occurrences of one string to another. We'll call it replace, and usage might look like this:

{% replace old="dog" new="cat" %}
My dog is great.  I love dogs.
{% endreplace %}

which would end up rendered as My cat is great.  I love cats..

We'll also have an optional numeric argument to limit how many times we do the replacement:

{% replace 1 old="dog" new="cat" %}
My dog is great.  I love dogs.
{% endreplace %}

which we'll want to render as My cat is great. I love dogs..

Parsing the template

The first thing we'll write is the compilation function, which Django will call when it's parsing a template and comes across our tag. Conventionally, such functions are called do_<tagname>. We tell Django about our new tag by registering it:

from django import template

register = template.Library()

def do_replace(parser, token):

register.tag('replace', do_replace)

We'll be passed two arguments, parser which is the state of parsing of the template, and token which represents the most recently parsed token in the template - in our case, the contents of our opening template tag. For example, if a template contains {% replace 1 2 foo='bar' %}, then token will contain "replace 1 2 foo='bar'".

To parse that token, I ended up writing the following method as a general-purpose template tag argument parser:

from django.template.base import FilterExpression, kwarg_re

def parse_tag(token, parser):
    Generic template tag parser.

    Returns a three-tuple: (tag_name, args, kwargs)

    tag_name is a string, the name of the tag.

    args is a list of FilterExpressions, from all the arguments that didn't look like kwargs,
    in the order they occurred, including any that were mingled amongst kwargs.

    kwargs is a dictionary mapping kwarg names to FilterExpressions, for all the arguments that
    looked like kwargs, including any that were mingled amongst args.

    (At rendering time, a FilterExpression f can be evaluated by calling f.resolve(context).)
    # Split the tag content into words, respecting quoted strings.
    bits = token.split_contents()

    # Pull out the tag name.
    tag_name = bits.pop(0)

    # Parse the rest of the args, and build FilterExpressions from them so that
    # we can evaluate them later.
    args = []
    kwargs = {}
    for bit in bits:
        # Is this a kwarg or an arg?
        match = kwarg_re.match(bit)
        kwarg_format = match and match.group(1)
        if kwarg_format:
            key, value = match.groups()
            kwargs[key] = FilterExpression(value, parser)
            args.append(FilterExpression(bit, parser))

    return (tag_name, args, kwargs)

Let's work through what that does.

Calling split_contents() on the token is like calling .split(), but it's smart about quoted parameters and will keep them intact. We get back args, a list of strings representing the parts of the template tag invocation, very much like sys.argv gives us for running a program, except that no quotation marks have been stripped away.

The first element in args is our template tag name itself. We remove it because we don't really need it for parsing the args, but save it for generality.

Next we work through the arguments, using the same regular expression as Django's template library to decide which arguments are positional and which are keyword arguments.

The regular expression for keyword arguments also splits on the =, so we can extract the keyword and the value.

We'd like our argument values to support literal values, variables, and even applying filters. We can't actually evaluate our arguments yet, since we're just parsing the template and don't have any particular template context yet where we could look for things like variables. What we do instead is construct a FilterExpression for each one, which parses the syntax of the value, and uses the parser state to find any filters that are referred to.

When all that is done, this method returns a three-tuple: (<tagname>, <args>, <kwargs>).

Our replace tag has two required kwargs and an optional arg. We can check that now:

from django.template import TemplateSyntaxError

# ...

def do_replace(parser, token):
    tag_name, args, kwargs = parse_tag(token, parser)

    usage = '{% {tag_name} [limit] old="fromstring" new="tostring" %} ... {% end{tag_name} %}'.format(tag_name=tag_name)
    if len(args) > 1 or set(kwargs.keys()) != {'old', 'new'}:
        raise TemplateSyntaxError("Usage: %s" % usage)

Note again how we haven't hardcoded the tag name.

Let's pull our limit argument out of the args list:

if args:
    limit = args[0]
    limit = FilterExpression('-1', parser)

If no limit was supplied, we default to -1, which will indicate later that there's no limit. We wrap it in a FilterExpression so we can just call limit.resolve(context) without having to check whether limit is a FilterExpression or not.

We can't check the values here. They might depend on the context, so we'll have to check them at rendering time.

This is all similar to what we might do if we were writing a non-block tag without using any of the helpful decorators that hide some of this detail. But now we need to deal with some unknown amount of template following our opening tag, up to our closing tag. We need to ask the template parser to process everything in the template until we get to our closing tag:

nodelist = parser.parse(('end_replace',))

We get back a NodeList object (django.template.NodeList), which represents a list of template "nodes" representing the parsed part of the template, up to but not including our end tag.

We tell the parser to just ignore our end tag, which is the next token:


Now we're done parsing the part of the template from our opening tag to our closing tag. We have the arguments to our tag in limit and kwargs, and the parsed template between our tags in nodelist.

Django expects our function to return a new node object that stores that information for us to use later when the template is rendered. We haven't written the code for our node object yet, but here's how our parsing function will end:

return ReplaceNode(nodelist, limit=limit, old=kwargs['from'], new=kwargs['to'])

Reviewing what we've done so far

Each time Django comes across {% replace ... %} while parsing a template, it calls do_replace(). We parse all the text from {% replace ... %} to {% endreplace %} and store the result in an instance of ReplaceNode. Later, whenever Django renders the parsed template using a particular context, we'll be able to use that information to render this part of the template.

The node

Let's start coding our template node. All we need it to do so far is to store the information we got from parsing part of the template:

from django import template

class ReplaceNode(template.Node):
    def __init__(self, nodelist, limit, old, new):
        self.nodelist = nodelist
        self.limit = limit
        self.old = old
        self.new = new


As we've seen, the result of parsing a Django template is a NodeList containing a list of node objects. Whenever Django needs to render a template with a particular context, it calls each node object, passing the context, and asks the node object to render itself. It gets back some text from each node, concatenates all the returned pieces of text, and that's the result.

Our node needs its own render method to do this. We can start with a stub:

class ReplaceNode(template.Node):
  def render(self, context):
    return "result"

Now, let's look at those arguments again. We've mentioned that we couldn't validate their values before, because we wouldn't know them until we had a context to evaluate them in.

When we code this, we need to keep in mind Django's policy that in general, render() should fail silently. So we program defensively:

class ReplaceNode(template.Node):
  def render(self, context):
      # Evaluate the arguments in the current context
          limit = int(self.limit.resolve(context))
      except (ValueError, TypeError):
          limit = -1

      from_string = self.old.resolve(context)
      to_string = conditional_escape(self.new.resolve(context))
      # Those should be checked for stringness. Left as an exercise.

Also note that we conditionally escape the replacement string. That might have come from user input, and can't be trusted to be blindly inserted into a web page due to the risk of Cross Site Scripting.

Now we'll render whatever was between our template tags, getting back a string:

content = self.nodelist.render(context)

Finally, do the replacement and return the result:

content = mark_safe(content.replace(from_string, to_string, limit))
return content

We've escaped our own input, and the block contents we got from the template parser should already be escaped too, so we mark the result safe so it won't get double-escaped by accident later.


We've seen, step by step, how to build a custom Django template tag that accepts arguments and works on whole blocks of a template. This example does something pretty simple, but with this foundation, you can create tags that do anything you want with the contents of a block.

If you found this post useful, we have more posts about Django, Python, and many other interesting topics.

Caktus GroupCaktus Activities at PyCon 2017

It’s almost time for PyCon and the team here at Caktus is ready to meet other attendees. Where and how can you find us?

At the booth

Keep an eye out for the Caktus booth in space 232 at the conference expo from May 18-20. Members of our team look forward to welcoming you there, including Colin, David, Julie, and Whitney.

For those interested in learning how Caktus delivers web apps faster, we’ll have copies of our Shipping Faster white paper on hand for review. There will be a prize draw for a Polaroid Cube+ mini action camera, so stop by the booth and let us scan your badge to be entered. The winner will be announced on May 20. Follow Caktus on Twitter for live updates.

Caktus giveaway prizes for PyCon 2017

We’ll also be doing an early giveaway for wireless headphones on May 18 during the opening reception at the expo. Here’s how you can win:

  1. Take a picture of yourself at the Caktus booth
  2. Tweet it to us @CaktusGroup with hashtag #CaktusPyCon

The winner will be randomly selected and announced around 8pm the same evening.

In addition to the prizes we’ll have Caktus swag. Be among the first to get one of our special edition 10th anniversary water bottles!

Caktus swag for PyCon 2017

At the talks and events

Several of our development team are attending as well, so look for them in different talks and events. While some of them are repeat attendees, Dmitriy is looking forward to attending for the first time and listening to talks like Immutable Programming - Writing Functional Python.

Developer Erin is interested in listening to Cython as Secret Weapon for Efficiency. She’s also looking forward to volunteering as a TA at Young Coders: Outside In because of how much she’s enjoyed being a Django Girls coach, and is excited to share her enthusiasm for code with them.

Sarah went to PyCon last year and was very impressed by the inclusiveness of the community. She said, “Everyone is so willing to teach, help, and share with other attendees. I'm looking forward to the testing-centered talks.”

If you’re planning on heading out for the 5k fun run, look out for Mark. He’s also interested in several talks, including Library UX: Using abstraction towards friendlier APIs.

After hours

We’ll be hosting an after hours event on Friday, May 19. Join us for an exclusive happy hour gathering to enjoy some light refreshments. Stop by our booth to get on the invite list.

At the job fair

Caktus is hiring! We’re looking for a Django Web Developer, so stop by table 24 on Sunday, May 21 to talk to members of the team about what it’s like to work with Caktus. We have offices in Durham, NC and Baltimore, MD.

See you soon!

There’s less than a month to go and we can’t wait to meet you. Be sure to contact us to let us know that you’re interested in speaking with us about something in particular, and we’ll be sure to set up a time.

Tim HopperLike most great mathematicians, he expects universal precision

From the Autobiography of Benjamin Franklin:

Thomas Godfrey, a self-taught mathematician, great in his way, and afterward inventor of what is now called Hadley's Quadrant. But he knew little out of his way, and was not a pleasing companion; as, like most great mathematicians I have met with, he expected universal precision in every-thing said, or was for ever denying or distinguishing upon trifles, to the disturbance of all conversation.

I'm a recovering Godfrey Precisionist.

Philip SemanchukThanks, Science!

I took part in the Raleigh March for Science last Saturday. For the opportunity to learn about it, participate in it, photograph it, share it with you — oh, and also for, you know, being alive today — thanks, science!

Caktus GroupCalling all Cat Herders: New Meetup for Digital Project Managers

When I first became a digital project manager (DPM), I struggled to find relevant resources. A ton of information was available on traditional project management, but not much specifically on digital project management. Eventually, I connected with another DPM in my organization and we quickly became friends and confidants. She opened my eyes to the Digital PM Summit, a new conference targeted at DPMs, which was ultimately the inspiration for my new Meetup group.

I attended the Digital PM Summit three times (and even presented last year!). The conference opened up a whole new world to me, one full of practical resources and friendly, helpful contacts. Ever since I first attended the conference, I wanted to replicate some piece of it to connect with DPMs in the Triangle area. It took a few years, but the Triangle Digital Project Managers Meetup has finally come to fruition, thanks in large part to Caktus Group.

The new Meetup is focused on providing opportunities for DPMs in the Research Triangle Area (and beyond) to network, share knowledge, and support each other. No certifications are required to join, and it doesn’t matter what process (or lack thereof) that you use -- Waterfall, Agile, Scrum, or your own Special Secret Sauce. Some of our meetings will be based on a professional topic, while other meetings will be more social. Our goal is to meet at least once every two months.

Project Management is a Team Effort

Over the past few years, I’ve found that many DPMs work solo, or have few cohorts within their organization. At Caktus, we have three full-time project managers and one full-time project management director. PMs at Caktus also serve in the more defined Scrum role of product owner. I’ve been a part of the Caktus team since September 2016, and it’s the first time I’ve been lucky enough to work with a team of DPMs.

When I worked alone, it was often intimidating and even frustrating, because I didn’t have anyone to bounce ideas off of and there was no PM precedent or process already in place. I felt like I was constantly reinventing the wheel. Working in a silo makes connecting with a group of similar professionals even more important, in order to share ideas, stay current, and grow your skills. For these reasons, my team supported my idea to create a Meetup and Caktus, which believes in supporting community involvement, provided me with the time and resources needed to do so.

The Triangle DPM kickoff meeting was held in the Caktus Tech Space at the end of February. The group was small, but passionate and experienced. One attendee even joined remotely via web conference. It was perfect, exactly the kind of inclusive, friendly group that I wanted to bring together. A group where even if you can’t attend in person, you can join remotely and still be home with your kids. A group where everyone is welcome, regardless of industry or job title -- and DPMs have a variety of titles like Digital Producer, Product Owner, or even Account Director! If you manage anything digital, from the Django web development that we do at Caktus to online marketing campaigns or even video games, you’re welcome to join the Meetup.

Come join fellow cat herders at a future meeting. Details will be posted on the Triangle DPM Meetup page and the Caktus events page.

Caktus GroupProduct Discovery Part 2: From User Contexts to Solutions

In the first installment of this two-part series, I introduced product discovery as the process of building a shared understanding about the product between stakeholders and the product team, which helps you make better decisions about what to build. I also suggested that we look at product discovery as a four-step process:

  1. Frame the problem
  2. Identify the users
  3. Map out user actions, tasks, and workflows
  4. Sketch out ideas

Having previously discussed how to frame the problem and identify the users, let’s move on to mapping out user tasks and workflows, and sketching out solutions.

Map out user tasks and workflows

User-centered software design and development arose from the recognition that we must account for human capabilities and characteristics when we build systems and technologies. That’s why so much emphasis is placed on understanding users through research and on empathizing with them by employing tools such as personas and proto-personas.

In order to understand the users, you identify their demographic, psychological, and behavioral characteristics, as well as their goals, needs, pain points, and possible solutions to their challenges. And to place that information in context, you build a narrative within which your users function as they use your product.

User task flowchart

At the very least, when building a product you will create a user flowchart to capture tasks the product should support, decisions the user will be making within the system, user inputs, and system outputs.

While a user flowchart is a useful and succinct way to diagram tasks that need to be supported by an application, there are other methods of capturing user actions that are more story-based and thereby help build a richer representation of user behaviors.

Agile user story mapping

Agile user story mapping is a visualization technique introduced by Jeff Patton that depicts a user’s path through an application. It can also be used to map out user workflows outside of the application.

In Agile software development, a user story is a brief description of a desired feature that is written from the perspective of an end-user, and that captures user outcomes that the feature is meant to support.

Mapping user stories is a group activity in which teams build a narrative about how users engage with software. Using sticky notes, stakeholders and product teams map out user workflows, tasks, task variations, and sub-tasks in chronological order from left to right, and in order of priority or detail from top to bottom.

The resulting user story map is an artifact that offers a quick look at the application’s big picture while preserving a level of detail that can be leveraged to create a backlog. It also shows feature prioritization and can assist in the estimation process.

Example of a user story map created at Caktus for a client. A fragment of a user story map done at Caktus for an animal rescue

User experience mapping

User experience mapping (or user journey mapping) is the process of capturing and communicating complex user interactions across various channels through which the user comes in contact with your product and/or company. It helps build an understanding of user actions, feelings, thoughts, pain and satisfaction points that go beyond the realm of the application itself. The resulting experience (or journey) map provides an omnichannel representation of user experience with touch points at which the user interacts with your product and opportunities to create new or better experiences.

Example of a user journey map created at Caktus for a client. A fragment of a user experience (journey) map created at Caktus for an animal rescue

Narrative arc story mapping

I recently learned about yet another story mapping technique. While I have not had a chance to try it out in my workflow, I found it intriguing and thought it worth sharing.

This approach to story mapping, popularized by Donna Lichaw, relies on a narrative arc as a framework to develop three types of stories — the concept story (the big picture story), the origin story (how your product will be discovered by users), and the usage story (how people use your product). The concept and origin stories are perfect tools for discovering new products, while the usage story can be leveraged to understand the current use and engagement patterns of your product and to identify opportunities for improvements.

Sketch out solutions

By this time in the process, you should know what problem you’re solving and for whom you’re solving it. You should also have a pretty good idea of what workflows and experiences your application should support. Now it’s time to start discussing specific solutions.

Whether in a session involving stakeholders or with your internal team, you can conduct activities that will help you hone in on the right idea. Here are a few suggestions:

  • Ideation: Work individually to produce as many ideas as possible (either by sketching or describing them in writing), then as a team select a small list of solutions that best address the problem at hand.
  • Brainstorming: Work as a team to generate a wealth of ideas by finding inspiration in each other’s concepts, then work together at identifying the best ones.
  • Prototyping: Build paper prototypes, sketch wireframes (on paper or digitally), or code a simple interface to start validating your ideas. You can test your prototypes with team members or recruit a few users and conduct lightweight usability testing.

Get the job done

Each project is unique and product discovery should be tailored to the needs of each project. Whether you develop personas or proto-personas, draw user flowcharts, map user stories, or create an omnichannel user experience map depends on the product you’re building, the resources you have, and the project management paradigm your team follows.

For Agile teams, the lean proto-persona strategy combined with small scale user research and agile story mapping can build a strong foundation of product discovery. But Agile teams will also benefit from the omnichannel perspective of a user journey map that places the product in the context of a broader ecosystem.

At Caktus we often kick off a project with a discovery workshop. The workshop is an opportunity for our team and the client team to get together and build a shared understanding of the product to be built. Working off of existing data or making assumptions where data are lacking, we frame the problem, identify user types, and build personas or proto-personas. In the process, we also identify knowledge gaps and may recommend small scale user research as appropriate. On projects where the problem is well framed and users understood, we work with stakeholders to map out user actions, tasks, and workflows using a technique that best fits the needs and the budget of a given project.

We come out of a workshop with a summary of product goals, identified target user types, and a list of the most valuable content or most valuable product features. If the workshop includes Agile user story mapping, the added benefit is an artifact that can easily be translated into a prioritized backlog. If, during the workshop, we map users’ entire and omnichannel experience, we gain a breadth of understanding of the user journey that goes beyond the application itself, and can support the development of current and future projects.

By establishing in this way a solid, shared understanding of stakeholders’ and users’ needs before any code is committed, we increase our chances of making right decisions about what to build. And by doing so, we reduce the long-term cost because we reduce risks and decrease a need for rework down the road.

Don't forget to read part 1 of this blog post to learn about how to get started with product discovery.

Caktus GroupProduct Discovery Part 1: Getting Started

When setting out to build a new website or web application, it is a good idea to build a shared understanding of the product between stakeholders and the product team. Through research and collaborative activities that aim to answer questions about the product, its goals, and its users’ needs, the stakeholders and product team discover the full breadth and depth of the application to be built, as well as contexts and implications that need to be considered for the product to be successful. We call this process product discovery.

A study conducted by the Institute of Electrical and Electronic Engineers (IEEE) found that software development projects fail when they do not address stakeholders’ needs adequately. It has also been shown that 50% of programmers’ time is spent on avoidable rework. By devoting resources upfront to build a solid, shared understanding of project goals, users, and user contexts, you can ensure that you will be building the right solution and minimizing waste.

Product discovery can be approached as a four-step process:

  1. Frame the problem
  2. Identify the users
  3. Map out user actions, tasks, and workflows
  4. Sketch out ideas

Steps 1 and 2, framing the problem and identifying the users, get you started with understanding the ramifications of your product. Step 3, mapping out user tasks and workflows, is a way to define user contexts and begin exploring solutions. Finally step 4, sketching out ideas, is a step toward articulating a solution.

In this article, I will focus on steps 1 and 2. Steps 3 and 4 will be covered in the second installment of this two-part series.

Frame the problem

When framing the problem(s), you are striving to answer the following questions:

  • What problem(s) am I trying to solve?
  • For whom am I solving this problem?
  • Why am I solving this problem?
  • What does success mean and how can I measure it?
  • What constraints do I need to accommodate?

Answers to these questions may be drawn from your business analytics and existing user or customer research. Data that inform your answers may come from:

  • Competitive audit
  • SWOT (Strengths, Weaknesses, Opportunities, Threats) analysis
  • User or customer interviews or surveys that reveal pain points, needs, and goals
  • Existing use and engagement patterns
  • Points of drop-off or failure, etc.

If data in those areas are lacking, you may start out by making assumptions and stating hypotheses that you will later put to test.

Identify the users

You can identify your users by asking questions such as:

  • What are the demographic, psychological, and behavioral characteristics of the users?
  • What are users’ goals, needs, and pain points?
  • What user outcomes do I need to support?
  • What are the workflows my users employ?
  • How do users interface with my product?
  • How do users leverage technology in their life and/or work?
  • What types of solutions would best serve the users?
  • Other questions about your users’ lives and work, and their interactions with products similar to yours.

You can gain answers to these questions by conducting user research including:

  • Usability testing (observing people using an existing product or a competitor’s product)
  • User interviews (talking to users directly about their workflows, goals, needs, and pain points)
  • User surveys (having users answer questions, usually online)
  • Contextual inquiry (observing users in the context in which they use or would use your product)

Armed with the data, you can then develop user profiles called personas. Personas are tools that allow you to consolidate information about your users into a succinct format, and, perhaps more importantly, give your users a human face. Personas are documented user profiles. But they are also a device that helps you identify with your users and develop empathy for them. It is particularly true in the case of so-called proto-personas — user profiles not based on actual data, but rather on assumptions and guesses you make about your users.

Sample persona created at Caktus for an animal rescue project.

A sample proto-persona done at Caktus for an animal rescue

Personas (or proto-personas) include information grouped into categories, and there are multiple suggestions about good categories to use.

In Lean UX, Eric Ries recommends grouping information about users into:

  • Sketch and name
  • Behavioral and demographic information
  • Pain points and needs
  • Potential solutions

Ladies that UX suggest the following information categories to build a persona:

  • Bio and demographics
  • Emotions and behaviors
  • Goals
  • Solutions

In User Story Mapping, Jeff Patton shares a persona template that includes:

  • User type and role
  • Name and sketch
  • Context
  • About
  • Implications

Next, strive for deeper understanding and explore solutions

Once you’ve gained an understanding of the problem you are solving and the characteristics of the users, you’re ready to dive deeper into user contexts and to start considering solutions. In the next blog post, I will discuss techniques that can be leveraged to explore user contexts and ways to start identifying solutions. Stay tuned!

This blog post continues with Part 2: From User Contexts to Solutions, to be released later this week.

Tim HopperMetawork is more interesting than work

This Software Engineering Radio interview with Neal Ford on Success Skills for Architects is full of gems about building effective software.

He talks a lot about how coders love to solve problems, and that love can lead them to invent interesting, but unnecessary, problems to solve. This is true.

Metawork is more interesting than work. It's so hard to get back to simplicity, because we love complicated little puzzles to solve, so we keep overengineering everything.

Anyone who's developing software would benefit from listening.

Tim HopperTowards Reducing Distractions while Working

Staying focused while working in front of a computer and within reach of a smartphone is hard.

In 2017, teaching people to focus is becoming a industry.

I've been trying to rethink distractions in my own life, particularly in my work environment. Here are some things that have helped:

Working from Home

Working in an office, especially an open-floor plan office, is disastrous for staying focused. DeMarco and Lister wrote about this in Peopleware 30 years ago, and yet open offices are the norm for startups today.

I'm much more productive by working from home in my quiet office or on my back patio. I'm finally able to spend my time thinking about hard problems rather than ways of silencing Constant Throat Clearer or Perpetual Annoying Laugher.


Every app and website these days wants to send you notifications. I'm aggressive about reducing notifications down to those that I need see, and I let almost nothing notify me with sound. I use Do Not Disturb mode on my phone and Mac whenever I need to stop notifications altogether.


Slack has become the new normal for company communication. Some would say Slack itself is ruining our focus, but having it regularly available has been essential for my own work.

I've come up with a few ways to take control of Slack:

  1. Only show "My unread, along with everything I've starred" in the sidebar. See Michael Lopp's excellent post on Slack for more here.
  2. Enable notifications selectively.
  3. Sign out of distracting avocational Slacks.

Social Media

I've started using an app called Focus to block distracting websites (including Facebook and Twitter.com) and apps on my work computer from 9 AM to 5:30 PM. I use Focus's scheduling feature so blocking isn't optional for me.

I've decided not to block Tweetbot. Though it can be distracting, Twitter is an invaluable way for me to learn from my professional colleagues, bounce ideas off of them, and have a good laugh.

On my iPhone, iPad, and personal Laptop, I've started using Freedom to block all social media during the day. This has stopped me from instinctively checking Instagram every time I walk to the bathroom or get suck on a hard problem. I highly recommend it.1

I also use Freedom to block social media for the first hour I'm up in the morning and before I go to bed.


I have two main tactics to keep email from being distracting.

  • I aggressively unsubscribe from mailing lists and ads.
  • I use Sanebox to filter low priority messages out of my inbox.

When emails only need a brief reply, I tend to write responses as soon as possible. At the moment, I'm trying to break people of the expectation that I'll respond quickly. Using services like Boomerang which lets me write emails now and have them sent later helps here.


Long-form reading at the computer is terrible for comprehension. As Doug Lemov has argued, you have to get away from your computer and other devices to read deeply. I do this by printing articles or reading on my iPad with Freedom blocking enabled. I take my printouts or iPad and walk away from my desk to read.

Todo Items

I'm a firm believer in the Getting Things Done principle of reducing the cognitive overhead of tracking to-do items in my head. I use Omnifocus for task management. Mail Drop and this Alfred workflow help me to quickly add tasks to my Omnifocus inbox. When I think of something I need to take care of outside of work, I drop that thought into Omnifocus; this keeps those personal to-do items from distracting me while I'm working.

Staying focused is hard. I'm still learning how to do it well, and I'm sure I'm not the only one struggling to improve here. If you have any tips to share, I'd love to hear them!

  1. I can't use Freedom on my work computer, because it acts as a VPN which conflicts with my work VPN. 

Philip SemanchukHow to Measure Anglo-Saxonicity – With a Ruler or Yardstick?

Summary (Nutshell)

This is a first look at a work in progress. I’m using Python to study text from an etymological perspective. Specifically, I’m measuring how many words in a given English language text have Anglo-Saxon origin. Many people (including myself) think that Anglo-Saxon words convey a different sense than their counterparts of French/Latin origin. To demonstrate the point in a small way, I’ve included a Latin and Anglo-Saxon version of each heading in this blog post.

Background (Milieu)

English is a Germanic language with Scandinavian influence, with a big layer of Old French poured on top. That Old French (Anglo Norman French, to be specific) was principally derived from Latin, so English is a hybrid between two major Indo-European language groups. Those mongrel origins are a big part of why English is messy and rich.

French was introduced to English as the language of conquerers and nobility. French was also the language of some European royalty in the 18th and 19th century, further adding to its reputation as a language associated with high status. Even today, English words with French origins often have higher cultural status than their counterparts with Anglo-Saxon origins (think cuisine versus cooking, illumination versus light, create versus make, and escargot versus snail). By contrast, the Anglo-Saxon words are often considered more visceral (think sea versus ocean, sweat versus perspire, and free versus emancipated — more on that last pair in a moment).

For instance, when taunting someone, you reach for blunt Anglo-Saxon words. “Your mother was a hamster, and your father smelled of elderberries!” is 100% Anglo-Saxon, except for “elderberries” which was coined in Middle English from “elder” and “berry”, both of Anglo-Saxon origin.

A still of the French taunting King Arthur from Monty Python and the Holy GrailWilliam of Normandy in 1067, addressing his English subjects.

Legal documents and government issuances, on the other hand, tend to include more words of Latin/French origin. It’s no coincidence that the Latin/French words “Emancipation Proclamation” describe a legal act, but if you want to stir the heart about emancipation, you say something like “Free at last!”(1)  which is all Anglo-Saxon.

Others have written more eloquently than I about how word origin influences tone (Annalisa Quinn at NPRGemma Varnom, and M. Birch, to suggest a few), so I won’t belabor the point more than I already have. But I wanted to talk about how it inspired the project I’ve been working on.

The Project (The Work)

I should preface this by saying that I Am Not A Linguist, and I don’t even play one on TV.

I thought it would be interesting to perform lexicographical analysis of text from an etymological perspective. My etymological categorization is necessarily simple. When I look at a text, I put each word into one of three etymological categories: Anglo-Saxon, non-Anglo-Saxon, or unknown. From this rough grouping I generate statistics that allow me to compare one text to another.

For instance, does one author consistently use more Anglo-Saxon words than other authors? Does an author’s usage of Anglo-Saxon words change from one work to another? Also of interest to me is the etymology of words as the book progresses from front to back. Do the relative frequencies of etymologies change as the book progresses towards its exciting conclusion? For authors writing in English as a second language, is their word selection influenced by their first language?

All of the questions above can be explored with the tool I’ve written. It’s easier to show the tool’s output than describe it, so here’s an analysis of Lewis Carroll’s 1865 work “Alice’s Adventures in Wonderland”.

The graph below shows the relative frequency of the three etymological categories as the book progresses from beginning to end.

A graphical representation of how the etymological ration of Alice in Wonderland changes as one progresses through the book

This graph shows the relative frequency of the three etymological categories as counting statistics for various part-of-speech categories.

A graphical representation of the counts of words by parts of speech and etymology in Alice in Wonderland

The table below is a more detailed version of the chart immediately above. Some percentages may not add up to 100% due to rounding.

Total %age of All Words Anglo-Saxon non-Anglo-Saxon Unknown
All Words 26624 100% 18233 (68%) 3812 (14%) 4579 (17%)
Unique 3528 13% 1354 (38%) 899 (25%) 1275 (36%)
Nouns 8522 32% 4521 (53%) 2354 (27%) 1647 (19%)
Verbs 5479 20% 2994 (54%) 565 (10%) 1920 (35%)
Adjectives 1639 6% 896 (54%) 375 (22%) 368 (22%)
Adverbs 1974 7% 1348 (68%) 420 (21%) 206 (10%)
Other 9010 33% 8474 (94%) 98 (1%) 438 (4%)

Observations (What I See)

There’s some minor observations to be made here, but the strength of this tool will be in comparative analysis. It’s hard to draw conclusions from one analysis before I have an idea of what’s typical.

For instance, at first glance, the ratio of Anglo-Saxon to non-Anglo-Saxon words looks dramatic, but this says more about English than it does about Carroll. The most common words in English are overwhelmingly Anglo-Saxon in origin. (2)  For the small sample size of works I’ve processed so far (just 8 in total), I can see that it’s common for roughly three quarters of the words to be Anglo-Saxon. Alice in Wonderland isn’t an outlier by that standard.

We can also see that the frequency of Anglo-Saxon words decreases slightly throughout the book. This is the kind of trend that I find interesting, but in this case it’s due to an increase in the number of words of unknown etymology. Sometimes a word’s etymology is truly unknown. More often, though, the etymology is classified as unknown for other reasons. Most likely, it’s simply not in my etymological database (which isn’t very complete yet). Also, the word could be a proper noun, an invented word (like “woodshadows” from James Joyce’s Ulysses), or a word for which the etymology is ambiguous. An example of this last category is “bank” which is Anglo-Saxon in origin when referring to the side of a river, but French/Italian in origin when referring to a place that handles money.

At present, the quantity of words classified as “unknown” is too large for my tastes, and I plan to reduce it by improving both my database and the tool.

Verbs are overrepresented in the “unknown” category. My guess is that this is an artifact of my stemmer having difficulty stemming verbs. (I’m currently using the Snowball Stemmer from NLTK.)

As you can see, at this point it’s easier to draw conclusions about the representation of the data than it is about the data themselves. That leads me to the next (and final) topic in this post.

Future (What’s to Come)

As I said in the introduction, this is an early look at a work in progress. Here’s some of the things I’d like to add –

  • Better etymological data
  • Large scale comparisons of text to look for trends (across authors, genres, etc.)
  • More numeric (rather than visual) descriptions of the data to facilitate automated comparison. One idea is to add the mean and standard deviation of the percentage of Anglo-Saxon words.
  • Open sourcing

If you have any suggestions on how to use this tool or make it more interesting, I’d love to hear them in the comments below. I moderate all comments to filter spam which is yet another Viking influence on England.


Like English itself, “Endnote” is an etymological hybrid. “End” is of Anglo-Saxon origin, while  “note” comes from Old French/Latin.

1. Martin Luther King, Jr. isn’t the only person to have said “Free at last!”, but his use of it is perhaps the most famous. His “I Have a Dream” speech makes brilliant use of etymological contrasts. Many of his memorable phrases in that speech (“I have a dream today”, “Let freedom ring”, “Free at last”) are Anglo-Saxon.

2. In 2014 I pulled from Wikipedia a list of the 100 most common English words. At the time, it contained just four non-Anglo-Saxon words. They were “just” (ME < Latin), “people” (ME < Anglo-French < Latin), “use” (ME < Old French, replaced OE brucan, cognate w/modern Swedish bruk-), and “because” (ME < Fr ‘par cause’). There are lots of ways to count the 100 most common words, and doubtless the list would have been different in Carroll’s day. But my guess is that the presence of Anglo-Saxon hasn’t changed dramatically from that 96% regardless of when and how one counts.

Caktus GroupLearning to ask the right questions, or people

I ask a lot of questions as a developer. Some of them have been more basic, like ‘How do I import a Python function from one file into another?’, and some more complex, like ‘How should we take an API request and return a dynamically-generated PDF as a response?’

As I have continued to learn, a couple things have been particularly beneficial:

  1. Learning to Google the question to find the answer
  2. Finding more advanced developers to answer my questions and guide my thinking

As I have grown as a developer, I have improved at knowing when to use each resource, and each remains an important part of my growth. I’ve gathered some points here that have been helpful to me during my learning, as well as some suggestions on how to help others become sharper developers.

At the beginning:

When learning to develop we oftentimes have direct questions that can be answered by another person, or by searching the internet. I remember having questions such as ‘What is a terminal and a shell?’ or ‘How do I know if something is a Python file?’. An experienced developer can answer these questions quickly, but can also point a beginner in the direction of how to find these answers on the internet. Some important things I learned at this point:

  • The answer to my question is most likely on the internet (Stack Overflow, Stack Exchange, etc.) and I should get better at finding it
  • Asking a more experienced developer may be faster, but figuring out the answer on my own can be more useful for knowing how to find answers successfully in the future
  • It is helpful to have a person help me think through the implications of my questions Thinking critically through my question and what I’m trying to solve is a lot more important than the specific answer

For people new to development, I recommend trying to Google your questions first. It may take some time to figure out how to look things up on Google or Stack Overflow, but these are useful skills that even experienced developers use every day. I also recommend finding an experienced programmer to provide any further clarification and direction - look out for meetups, or, if possible, ask a friend. For some Python meetups around Durham, NC, see TriPython.

For experienced programmers fielding such questions, remember that it’s a lot more important to become better at thinking through issues than to receive a quick answer.

With some experience:

As we gain development experience, we become better able to answer our own questions by doing internet searches, but we also encounter more complex questions, like ‘What happened to cause this timeout error?’ or ‘Is it possible to build an app that does...?’. Again, it’s important to do Google searches to see if other people have asked similar questions, or to ask these questions to more advanced developers. Some things I learned at this point:

  • Other people have probably attempted to use this feature, and either written about it, or documented it
  • There are multiple ways to accomplish what I am trying to do
  • Some features/libraries are better than others
  • It’s always helpful to ask myself, ‘What am I trying to solve here?’
  • More experienced developers oftentimes know how to solve my issue better, so I should ask their opinion

For people asking these questions, I recommend searching the internet not only to find what other people have asked or answered about specific features or libraries but also to ask questions about what a particular feature or library means for the project: What do we want to accomplish with this feature or library? How would this feature affect the user?

The answers to these questions can be beneficial in framing questions about the feature or library to more experienced developers.

With more experience:

Greater experience means that we can usually answer our more basic questions on our own. However, it also leads to different questions, oftentimes changing ‘Can I’ and ‘How does one’ questions into ‘Should I’ and ‘Why would one’ questions. An example may be considering a library and researching its issues, maintenance, and other people’s experience with it, or considering a feature for a project and whether it will lead to our own maintenance issues. Generally, the questions we ask can aid in deciding between the tradeoffs of different options, and this is something that a more experienced developer can help with.

Some things I learned at this point:

  • It is almost always possible to have a feature to do
  • Every feature has tradeoffs
  • I usually haven’t thought of all the tradeoffs or the historical reasons for using certain libraries and not others, but a more experienced developer may have

I encourage people with more complex development questions to come up with some options for solving an issue or adding a feature. Use a Google search or another developer’s experience to research these, and then ask a more experienced developer what the implications of each option might mean for the project.

A caveat:

When searching online, it is always possible to find outdated answers, especially for rapidly-changing topics like JavaScript, or when researching new libraries. While someone’s answer may have worked 5 years ago, the library may have changed and no longer supports the previous method, or the community may have moved along to use different development guidelines or frameworks. To combat this, I recommend limiting Google searches to just the past year or two, or try to see if older answers have newer comments with more accurate information. Another possibility is finding options on the internet and asking a more experienced developer which one is a better solution.

Further thoughts:

As I have continued learning as a developer, I have become better at knowing what to search for on the internet, and also at asking questions of more experienced developers. I still have lots of basic questions, and now also some more complex ones. While practice makes us better at searching for answers on the internet, I am particularly grateful to be able to ask questions of more experienced developers and to allow them to guide my thinking about certain topics. In fact, I have really relied on the expertise of more experienced developers to guide the way that I approach technical issues and figure out the best way to resolve them. I encourage any tech firm to support these kinds of interactions between developers, whether in a formal way (mentorship meetings, hosting meetups, etc.) or informally. I know it has made my time at Caktus both more enjoyable and more efficient.

Becoming a better developer is an ongoing process. Check out how to plan for mistakes as a developer to continue the journey.

Caktus GroupDigging Into Django QuerySets

Digging Into Django QuerySets

Object-relational mappers (or ORMs for short), such as the one that comes built-in with Django, make it easy for even new developers to become productive without needing to have a large body of knowledge about how to make use of relational databases. They abstract away the details of database access, replacing tables with declarative model classes and queries with chains of method calls. Since this is all done in standard Python developers can build on top of it further, adding instance methods to a model to wrap reusable pieces of logic. However, the abstraction provided by ORMs is not perfect. There are pitfalls lurking for unwary developers, such as the N + 1 problem. On the bright side, it is not difficult to explore and gain a better understanding of Django's ORM. Taking the time and effort to do so will help you become a better Django developer.

In this article I'll be setting up a simple example app, consisting of nothing more than a few models, and then making use of the Django shell to perform various queries and examine the results. You don't have to follow along, but it is recommended that you do so.

First, create a clean virtualenv. Here I'll be using Python 3 all of the way, but there should be little difference with Python 2.

$ mkvirtualenv -p $(which python3) querysets
Already using interpreter /usr/bin/python3
Using base prefix '/usr'
New python executable in /home/jrb/.virtualenvs/querysets/bin/python3
Also creating executable in /home/jrb/.virtualenvs/querysets/bin/python
Installing setuptools, pkg_resources, pip, wheel...done.

Next, install Django and IPython,

(querysets) $ pip install django ipython

Create the new project.

(querysets) $ django-admin.py startproject querysets
(querysets) $ cd querysets/
(querysets) $ ./manage.py startapp qs

Update querysets/settings.py to add 'qs', to the end of the INSTALLED_APPS list. Then, edit qs/models.py to add the simple models we will be dealing with

from django.db import models

class OnlyOne(models.Model):
    name = models.CharField(max_length=16)

class MainModel(models.Model):
    name = models.CharField(max_length=16)
    one = models.ForeignKey(OnlyOne)

class RelatedModel(models.Model):
    name = models.CharField(max_length=16)
    main = models.ForeignKey(MainModel, related_name='many')

Finally, set up the database.

(querysets) jrb@caktus025:~/caktus/querysets$ ./manage.py makemigrations qs
Migrations for 'qs':
    - Create model MainModel
    - Create model OnlyOne
    - Create model RelatedModel
    - Add field one to mainmodel
(querysets) jrb@caktus025:~/caktus/querysets$ ./manage.py migrate

Running python manage.py shell should now pull up an IPython session.

Now that we have a working project set up, we'll need some means of keeping track of the quantity and the raw SQL of any queries sent to the database. Django's TransactionTestCase class provides an assertNumQueries method, which is interesting but too specific and too tied to the test suite for our needs. However, examining its implementation, we can see that it ultimately makes use of a context manager called CaptureQueriesContext, from the django.test.utils module. This context manager will cause a database connection to capture all of the SQL queries sent, even if such is currently turned off (i.e. if DEBUG = False is set), and make those queries available on the context object. I find this a useful tool to use in debugging to track down code that is issuing too many queries to the database, in situations where Django Debug Toolbar won't help.

At the time of writing, the most recent released version of Django is 1.10.6. I've copied the code for CaptureQueriesContext for this version below, with a few irrelevancies redacted.

class CaptureQueriesContext(object):
    def __init__(self, connection):
        self.connection = connection

    def captured_queries(self):
        return self.connection.queries[self.initial_queries:self.final_queries]

    def __enter__(self):
        self.force_debug_cursor = self.connection.force_debug_cursor
        self.connection.force_debug_cursor = True
        self.initial_queries = len(self.connection.queries_log)
        self.final_queries = None
        return self

    def __exit__(self, exc_type, exc_value, traceback):
        self.connection.force_debug_cursor = self.force_debug_cursor
        if exc_type is not None:
        self.final_queries = len(self.connection.queries_log)

So here we can see several things of interest to us. The context manager keeps a reference to the database connection (as self.connection), it sets and then unsets a flag on the connection (self.connection.force_debug_cursor) which tells the connection to do the captures, it stores the number of queries at the start and at the end (self.initial_queries and self.final_queries), and finally, it provides a slice of the actual queries captured as the property captured_queries. Nothing here restricts its use to the test suite, so we'll be making use of it throughout in our IPython session.

Let's try it out now.

In [1]: from django.test.utils import CaptureQueriesContext

In [2]: from django.db import connection

In [3]: from qs import models

In [4]: with CaptureQueriesContext(connection) as context:
   ...:     print(models.MainModel.objects.all())
<QuerySet []>

In [5]: print(context.initial_queries, context.final_queries)
0 1

So we can see that there were no queries to start out with, and that a query was issued to the database by our code. Let's see what that looks like,

In [6]: print(context.captured_queries)
[{'time': '0.001', 'sql': 'SELECT "qs_mainmodel"."id", "qs_mainmodel"."name", "q
s_mainmodel"."one_id" FROM "qs_mainmodel" LIMIT 21'}]

This shows us that the captured_queries property gives us a list of dicts, and each dict contains the raw SQL and the time it took to execute. In the above query, note the LIMIT 21. This is there because the repr() of a QuerySet limits itself to showing no more than 20 of the items it contains. The additional twenty-first item is captured so that it knows whether or not to add an ellipsis at the end to indicate that there are more items available.

Let's create some data. First up, we need a quick and dirty way of populating the name fields

In [7]: import random

In [8]: import string

In [9]: def random_name():
   ...:     return ''.join(random.choice(string.ascii_letters) for i in range(16
   ...: ))

In [10]: random_name()
Out[10]: 'nRtybzKaSZWjHOBZ'

Now the objects

In [11]: with CaptureQueriesContext(connection) as context:
    ...:     models.OnlyOne.objects.bulk_create([
    ...:         models.OnlyOne(name=random_name())
    ...:         for i in range(5)
    ...:     ])
    ...:     models.MainModel.objects.bulk_create([
    ...:         models.MainModel(name=random_name(), one_id=i + 1)
    ...:         for i in range(5)
    ...:     ])
    ...:     models.RelatedModel.objects.bulk_create([
    ...:         models.RelatedModel(name=random_name(), main_id=i + 1)
    ...:         for i in range(5)
    ...:         for x in range(7)
    ...:     ])

In [12]: print(context.final_queries - context.initial_queries)

In [13]: print(context.captured_queries)
[{'sql': 'BEGIN', 'time': '0.000'}, {'sql': 'INSERT INTO "qs_onlyone" ("name") S
GTwPkXTUSpZYBWCT\'', 'time': '0.000'}, {'sql': 'BEGIN', 'time': '0.000'}, {'sql'
: 'INSERT INTO "qs_mainmodel" ("name", "one_id") SELECT \'fsekHOfSJxdiGiqp\', 1
, 5', 'time': '0.000'}, {'sql': 'BEGIN', 'time': '0.000'}, {'sql': 'INSERT INTO
"qs_relatedmodel" ("name", "main_id") SELECT \'tMOCzPRjKZHbwBLb\', 1 UNION ALL S
'BKgXGwdXJQBMQGJM\', 5', 'time': '0.000'}]

This looks pretty ugly, but we can see that each .bulk_create() results in two queries, a BEGIN starting the transaction, and an INSERT INTO with a crazy set of SELECT and UNION ALL clauses following it.

Ok, now that we are finally all set up, let's explore. What happens if we just create a QuerySet and set it in a variable?

In [14]: with CaptureQueriesContext(connection) as context:
    ...:     qs = models.MainModel.objects.all()

In [15]: print(context.final_queries - context.initial_queries)

In [16]: print(context.captured_queries)

No queries were sent to the database! This is because a Django QuerySet is a lazy object. It contains all of the information it needs to populate itself from the database, but will not actually do so until the information is needed. Similarly, .filter(), .exclude(), and the other QuerySet-returning methods will not, by themselves, trigger a query sent to the database.

In [17]: with CaptureQueriesContext(connection) as context:
    ...:     qs = models.MainModel.objects.filter(name='foo')
    ...: print(context.final_queries - context.initial_queries)

In [18]: with CaptureQueriesContext(connection) as context:
    ...:     qs = models.MainModel.objects.filter(name='foo')
    ...:     qs2 = qs.filter(name='bar')
    ...: print(context.final_queries - context.initial_queries)

Here we see that even chaining a filtered QuerySet off of another QuerySet is insufficient to cause a database access. However, non-QuerySet-returning methods such as .count() will result in a query sent to the database.

In [19]: with CaptureQueriesContext(connection) as context:
    ...:     count = models.MainModel.objects.count()
    ...: print(context.final_queries - context.initial_queries)

So, when will a QuerySet result in a round-trip to the database? Basically, this happens any time concrete results are needed from the QuerySet, such as looping explicitly or implicitly. Here are some of the more typical ones

In [20]: with CaptureQueriesContext(connection) as context:
    ...:     for m in models.MainModel.objects.all():
    ...:         obj = m
    ...:     r = repr(models.OnlyOne.objects.all())
    ...:     l = len(models.RelatedModel.objects.all())
    ...:     list_main = list(models.MainModel.objects.all())
    ...:     b = bool(models.OnlyOne.objects.all())
    ...: print(context.final_queries - context.initial_queries)

Note that each of these triggers its own query. The Django docs have a full list of the things that cause a QuerySet to trigger a query.

We've now seen that simply instantiating a QuerySet doesn't send a query to the database, and that obtaining data out of it does. The next most obvious question is, will a QuerySet ask for data from the database multiple times? Let's find out

In [21]: with CaptureQueriesContext(connection) as context:
    ...:     qs = models.MainModel.objects.all()
    ...:     L = list(qs)
    ...:     L2 = list(qs)
    ...: print(context.final_queries - context.initial_queries)

Terrific! Just as we would hope, the QuerySet somehow reuses its previous data when we ask for it again. Keep in mind, though, if we attempt to further refine a QuerySet,

In [22]: with CaptureQueriesContext(connection) as context:
    ...:     qs = models.MainModel.objects.all()
    ...:     L = list(qs)
    ...:     qs2 = qs.filter(name__startswith='b')
    ...:     L2 = list(qs2)
    ...: print(context.final_queries - context.initial_queries)

it does not re-use the data. So how does this work? The implementation of QuerySet can be found in django.db.models.query, but in particular, let's look at the implementation of the relevant methods

def __iter__(self):
    return iter(self._result_cache)

def _fetch_all(self):
    if self._result_cache is None:
        self._result_cache = list(self.iterator())
    if self._prefetch_related_lookups and not self._prefetch_done:

def iterator(self):
    return iter(self._iterable_class(self))

So we can see that iterating over a QuerySet will check to see if a cache at ._result_cache is populated yet, and if not, populates it with a list of objects. This list, then, is what will be iterated over. Subsequent iterations will then get the cache, so no further queries are issued. Doing a chained .filter() call, though, results in a new QuerySet that does not share the cache of the previous one.

The iterator() method used above is a documented public method, which returns an iterator over a configurable iterable class of model instances. Note that it does not involve the cache, so subsequent calls will result in a new query to the database. So why is this a public method? Under what circumstances would it be useful to not populate the cache? The iterator() method is most useful when you have memory concerns when iterating over a particularly large QuerySet, or one that has a large amount of data stored in the fields, especially if it is known that the QuerySet will only be used once and then thrown away.

Interestingly, certain non-QuerySet-returning methods such as .count(),

In [23]: with CaptureQueriesContext(connection) as context:
    ...:     qs = models.MainModel.objects.all()
    ...:     L = list(qs)
    ...:     c = qs.count()
    ...: print(context.final_queries - context.initial_queries)

can also make use of an already filled cache.

A common pattern that you will see is iterating over a QuerySet in a template, and rendering information about each item, which may involve access of related objects. To simulate this, let's loop and set the name of each item's OnlyOne into a variable.

In [24]: with CaptureQueriesContext(connection) as context:
    ...:     for item in models.MainModel.objects.all():
    ...:         name = item.one.name
    ...: print(context.final_queries - context.initial_queries)

Six queries! What could possibly be going on here?

In [25]: for q in context.captured_queries:
    ...:     print(q['sql'])
SELECT "qs_mainmodel"."id", "qs_mainmodel"."name", "qs_mainmodel"."one_id" FROM "qs_mainmodel"
SELECT "qs_onlyone"."id", "qs_onlyone"."name" FROM "qs_onlyone" WHERE "qs_onlyone"."id" = 1
SELECT "qs_onlyone"."id", "qs_onlyone"."name" FROM "qs_onlyone" WHERE "qs_onlyone"."id" = 2
SELECT "qs_onlyone"."id", "qs_onlyone"."name" FROM "qs_onlyone" WHERE "qs_onlyone"."id" = 3
SELECT "qs_onlyone"."id", "qs_onlyone"."name" FROM "qs_onlyone" WHERE "qs_onlyone"."id" = 4
SELECT "qs_onlyone"."id", "qs_onlyone"."name" FROM "qs_onlyone" WHERE "qs_onlyone"."id" = 5

As we can see, we have one query which populates the main QuerySet, but then as each item gets processed, each sends an additional query to get the item's associated OnlyOne object. This is referred to as the N + 1 Problem. But how can we fix it? It turns out that Django comes with a QuerySet method for just this purpose: select_related(). If we adjust our code like this,

In [26]: with CaptureQueriesContext(connection) as context:
    ...:     for item in models.MainModel.objects.select_related('one').all():
    ...:         name = item.one.name
    ...: print(context.final_queries - context.initial_queries)

we drop back down to only one query again

In [27]: for q in context.captured_queries:
    ...:     print(q['sql'])
SELECT "qs_mainmodel"."id", "qs_mainmodel"."name", "qs_mainmodel"."one_id", "qs_
onlyone"."id", "qs_onlyone"."name" FROM "qs_mainmodel" INNER JOIN "qs_onlyone" O
N ("qs_mainmodel"."one_id" = "qs_onlyone"."id")

So .select_related('one') tells Django to do an INNER JOIN across the foreign key, and make use of that information when instantiating the objects in Python. Great! The select_related() method is capable of taking multiple arguments and will do a join for each of them. You can also join multiple tables deep by using Django's double-underscore syntax, for example .select_related('foo__bar') would join our main model's table with the table for 'foo', and then further join with the table for 'bar'. Note that other things that would cause a join in the sql, such as filtering on a field on the related object, will not cause that related object to be made available as a Python object; you still need to specify your .select_related() fields explicitly.

This all works if the model we are querying has a foreign key to the other model. What if the relationship runs the other way, resulting in a one-to-many relationship?

In [29]: with CaptureQueriesContext(connection) as context:
    ...:     for item in models.MainModel.objects.all():
    ...:         for related in item.many.all():
    ...:             name = related.name
    ...: print(context.final_queries - context.initial_queries)

In [30]: for q in context.captured_queries:
    ...:     print(q['sql'])
SELECT "qs_mainmodel"."id", "qs_mainmodel"."name", "qs_mainmodel"."one_id" FROM
SELECT "qs_relatedmodel"."id", "qs_relatedmodel"."name", "qs_relatedmodel"."main
_id" FROM "qs_relatedmodel" WHERE "qs_relatedmodel"."main_id" = 1
SELECT "qs_relatedmodel"."id", "qs_relatedmodel"."name", "qs_relatedmodel"."main
_id" FROM "qs_relatedmodel" WHERE "qs_relatedmodel"."main_id" = 2
SELECT "qs_relatedmodel"."id", "qs_relatedmodel"."name", "qs_relatedmodel"."main
_id" FROM "qs_relatedmodel" WHERE "qs_relatedmodel"."main_id" = 3
SELECT "qs_relatedmodel"."id", "qs_relatedmodel"."name", "qs_relatedmodel"."main
_id" FROM "qs_relatedmodel" WHERE "qs_relatedmodel"."main_id" = 4
SELECT "qs_relatedmodel"."id", "qs_relatedmodel"."name", "qs_relatedmodel"."main
_id" FROM "qs_relatedmodel" WHERE "qs_relatedmodel"."main_id" = 5

As before, we get 6 queries. However, if we were to try to use .select_related('many'), we would get a FieldError. For this situation, Django provides a different method to mitigate the problem: prefetch_related.

In [31]: with CaptureQueriesContext(connection) as context:
    ...:     for item in models.MainModel.objects.prefetch_related('many').all()
    ...: :
    ...:         for related in item.many.all():
    ...:             name = related.name
    ...: print(context.final_queries - context.initial_queries)

Two queries, that's at least better. What's going on here, though, why two? If we take a look at the queries generated, we see

In [32]: for q in context.captured_queries:
    ...:     print(q['sql'])
SELECT "qs_mainmodel"."id", "qs_mainmodel"."name", "qs_mainmodel"."one_id" FROM
SELECT "qs_relatedmodel"."id", "qs_relatedmodel"."name", "qs_relatedmodel"."main
_id" FROM "qs_relatedmodel" WHERE "qs_relatedmodel"."main_id" IN (1, 2, 3, 4, 5)

So it turns out that Django first loads up the QuerySet for MainModel, then it determines what primary key values it received, and then does a second query on RelatedModel, filtering on those that have a foreign key to one of those values.

There is one thing that you should be aware of when prefetching one-to-many relationships in this manner. A fairly typical thing to do is to make use of Django model's object-oriented nature, and write instance methods that do some non-trivial computation, sometimes involving looping or filtering on one-to-many or many-to-many relationships. We'll simulate that here by just using a .filter() call in the inner loop

In [33]: with CaptureQueriesContext(connection) as context:
    ...:     for item in models.MainModel.objects.prefetch_related('many').all()
    ...: :
    ...:         for related in item.many.filter(name__startswith='b'):
    ...:             name = related.name
    ...: print(context.final_queries - context.initial_queries)

And now we find that we're back up to seven queries, despite the use of .prefetch_related(). What's going on here is that the prefetch is making item.many.all() act exactly like an already iterated-over QuerySet, like from earlier in this article, by filling its cache for later re-use. However, as in those earlier cases, if you do any further refinement of the QuerySet it does not share the cache with the new QuerySet. In many cases, it would simply be better to iterate over the relationship and filter using Python directly. Additionally, Django starting with version 1.7 introduced a Prefetch object, which allows more control over the query used in the prefetch_related() call. I advise using tools such as Django Debug Toolbar, using real data, to determine what makes the most sense for your use.

There is another thing that you should be aware of when encapsulating queries involving one-to-many or many-to-many relationships. You may see code like this

def some_expensive_calculation(self):
    related_objs = RelatedModel.objects.filter(main=self)

This code, as we should now be able to see, is an anti-pattern that will always issue a query when called from a MainModel item, regardless of whatever optimizations have been used on the QuerySet which obtained the MainModel in the first place. It would be better to do this instead

def some_expensive_calculation(self):
    related_objs = self.many.all()

That way, if we have calling code that does this

for item in models.MainModel.objects.prefetch_related('many'):
    result = item.some_expensive_calculation()

we should only get the two queries we expect, not one for the main set plus one each for however many items are in that set.

So now we've seen that the QuerySets that you use in your apps can have significant real-world performance implications. However, with some care and understanding of the simple concepts behind Django's QuerySets, you can improve your code and become a better Django developer. But more than that, I hope that you take away from this article the realization that you shouldn't be afraid to read Django's source code to see how something works, or to build minimal working examples or simple tools to explore problems within the Django shell.

Read more Django posts on the Caktus blog.

Tim HopperWeb Development and Design for the Backend Developer

I've been tinkering with websites for nearly 20 years. My friend Hunter and I were big into making terrible Angelfire sites as pre-teens. In high school, my dad paid me to make him a webpage for his doctor's office (I used Frontpage). A year or two after that, I read Kevin Yank's "Build Your Own Database Driven Website Using PHP & MySQL" and hacked together a PHP back-end for a Lord of the Rings fan site.

In recent years, I've put together this blog, shouldigetaphd.com, and a few other simple web-based side projects. However, I haven't kept up with modern web development, and my projects have been hacked together from boilerplate or templates. I've programmed professionally since 2011, I've spent very little of that writing anything close to graphical user interfaces.

I have a number of other side projects that I'd like to do at some point, and most of them would require some sort of graphical interface. While I could work on app development, I think web-based implementations would be a great starting place.

A few months back, I decided to stop watching Netflix on the treadmill and instead use those 45 minutes each morning to learn; in particular, I've been trying to learn more about modern(ish) web design and development. My work has a subscription to Safari Books Online which gives me access to copious technical books and video tutorials.

The number of resources available on Safari (along with YouTube, blog posts, etc) is astounding. I started many video tutorials on Safari that I quickly realized weren't going to be useful. Yet there many gems to be found, which I share here with you.

What follows is an overview of the technologies I've realized I need to learn more about and links to the resources I've found valuable in learning about them. If you think there are gaps I haven't yet filled or better resources than I've listed below, I'd love your feedback.

What I Knew Going In

I've been a professional software developer and data scientist since 2012. I mostly write Python, but I've programmed in a number of different languages.

I have a pretty good grasp on how HTML and CSS work. I've used enough Javascript over the years to be dangerous; I understood how it runs in the browsers. I understand what a DOM is and how it relates to the page source.

I've used the Python Flask web framework for several projects. I understand how to repond to HTTP requests with server-generated content. I had some idea of how to run my own web server on AWS.

I've used Jekyll, Hugo, and Pelican to create statically generated sites.

I understood DNS at a high level, but never really learned what all the different DNS types were, and I didn't understand why name server changes take so long to propagate.

I had some idea of what node.js and npm are.

I'm a committed Sublime Text user.

A Meta Tutorial on Web Development

A great place to start is Andrew Montalenti's lengthy tutorial on using Python, Flask, Bootstrap, and Mongo to rapidly prototype a website. The tutorial is out of date, but the principles still stand.

Another great resource is Cody Lindley's free Front-End Developer's Handbook. This is a substantial list meta-resource that organizes links for learning all angles of front-end development. "It is specifically written with the intention of being a professional resource for potential and currently practicing front-end developers to equip themselves with learning materials and development tools."

Chrome Developer Tools

One of the most important tools for me in learning more about web development has been the Chrome Developer Tools. You can live edit the DOM elements and style sheets and watch how a website changes. I've mostly learned Developer Tools through exploring it myself, but there are lots of tutorials for it on Youtube.

HTML, CSS, and Bootstrap

Many modern websites are responsive: they automatically adapt to various size screens and devices, from phones to desktops. Writing responsive websites from scratch requires deep knowledge of HTML, CSS, Javascript, and browsers. Unless you're doing this professionally, you probably don't want to write a responsive site from scratch.

For several projects, I've used the lightweight Skeleton project to create simple, responsive pages.

Recently, I decide to dive deep into the more robust Bootstrap framework originally developed at Twitter.

I watched Brock Nunn's Building a Responsive Website with Bootstrap (Safari), a two hour tutorial on getting started with Bootstrap. The documentation for Bootstrap is clear (if terse) and worth reading through.

Once you have a basic idea of how Bootstrap works, the best thing you can do is start playing with it. Since I was familiar with the Pelican static site generator, I decided to switch this blog to Bootstrap theme starting with pelican-bootstrap3.

I've worked with Bootstrap 3 until now. Bootstrap 4 is about to come out. Bootstrap 4 moves the style sheets from LESS to SASS and adds Flexbox functionality. Unless you understand what those mean (more below), you'd be fine using version 3.

I wanted to get a better grasp on CSS Selectors, so I read Eric Meyer's brief Selectors, Specificity, and the Cascade: Applying CSS3 to Documents (Safari)

I watched Marty Hall's JavaScript, jQuery, and jQuery UI tutorial (Safari). I was able to skip big chunks where I already understood certain parts, but it helped me fill in lots of gaps.

Advanced Stylesheets (LESS, SASS, and Flexbox)

There are several alternatives to writing raw CSS. Two popular ones are Less and SASS. These "preprocessors" allow you to write CSS-like stylesheets but with constructs such as variables, nesting, inheritance, and mathematical operators.

I found this brief tutorial on Less (Safari) helpful, and I've enjoyed Less a lot. I haven't used SASS yet, but it's very similar. I'll probably switch to SASS when I start using Bootstrap 4.

Another modern innovation is the Flexbox layout model for CSS. Stone River Learning has a great tutorial on Flexbox (Safari). It seems that Flexbox is the future of CSS-based layouts, and it's worth learning about.

Advanced JavaScript (Elm, React, Angular, Backbone, Ember)

The JavaScript web framework space has exploded. Many of these are implementations of the Model, View, Controller pattern, including React, Angular, and Ember. These tools allow the creation of complex web apps (as well as mobile apps).

Web Server Operations and DNS

I learned a ton form Linux Web Operations (Safari) by Ben Whaley. "The videos discuss the relationship between web and application servers, load balancers, and databases and introduce configuration management, monitoring, containers, cryptography, and DNS."

I've struggled with DNS configuration over the years, so I watched Cricket Liu's Learning DNS series (Safari). I still wouldn't want to be responsible for a company's complex DNS infrastructure, but I can now configure my own sites DNS with a little more understanding.

Development Automation

Package Managers

It's likely that any modern web project will have some external Javascript dependencies. Package managers (analogous to Pypi or Anaconda.org on Python) have emerged to help support this. Node.js comes with the npm package manager, but Bower seems to make more sense for front-end development.1 Cody Lindley has a nice introduction to npm and Bower. Bower is well documented and easy to start using. There is a nice Flask extension to help you integrate Bower with your Python project.

Task Automation

Web development comes with lots of build-style tasks that have to happen repeatedly. For example, before you can render a webpage in the browser, you might need to convert the Less to CSS and start a local web server. Before deploying to production, you might want to also run tests and minify your Javascript.

There's a GUI application called Codekit that can do a lot of these tasks. You can also do it through a Node.js program called Grunt. I haven't used it yet, but it looks like following the documentation would be the best way to get started.

Gulp is a popular alternative to Grunt.


Visual Design

Design has never been my strong point. One way to compensate for that is to rely on the work of others. There are copious Bootstrap themes available, and some are even free.

I enjoyed Software Engineering Daily's interview with Tracy Osborn on Design for Non-designers. She has some blog posts on the topic. Tracy recommends COLOURLovers for color ideas and Font Pair for selecting fonts from Google Fonts.

User Experience Design

On the topic of UX, I finally read Steve Krug's classic Don't Make Me Think (Safari); it's great. Ginny Redish's Letting Go of Words (Safari) is similarly excellent. Steve Krug's Don't Make Me Think


I've learned a lot in the past few months. I've filled in some gaps about how CSS works. I've gotten a better grasp on the Javascript prototype model. I've learned that I can start with higher level tools (e.g. Bootstrap and JQuery) to rapidly build my side projects with some amount of visual appeal. I'm learning how to use available tools to reduce the boilerplate I have to write, automate tedious tasks, and reduce my personal technical debt.

I still have a lot of learning and a lot of practicing ahead of me, but I'm starting to feel confident that I could make headway on some of my projects. The modern frontend development landscape is massive, varied, and ever changing, but that shouldn't prohibit you from diving in if you want to.

  1. The recent buzz in package management has been about Yarn, a replacement for npm. 

Caktus GroupCome Visit Us at PyCon 2017

PyCon 2017 is fast approaching, and we’re excited to support the event this year as sponsors once again. It’s a great opportunity to meet new friends, exchange ideas and interact with the community at large.

Caktus has attended PyCon since 2010 and our developers are always excited to learn from the variety of talks scheduled for the conference. We’ll be represented by a team of 10 attendees who can’t wait to get to Portland.

Sales director Julie White and last year's prize winner

We look forward to welcoming visitors to our booth from May 18-20 to chat about building Django apps, Python best practices, or industry trends. Interested in working with Caktus or speaking with us about what it’s like to work here? Stop by during the job fair to learn more about joining Caktus as a Django Web Developer. And of course, we’ll be doing giveaways - but you’ll have to stop by and say hi to win!

In the meantime, follow Caktus on Twitter for a sneak preview of our giveaways and an early chance to win (more on this soon, so keep an eye on our feed).

Our booth is always busy, so be sure to contact us in advance to ensure dedicated time with our team.

Are you coming to PyCon this year? Let us know in the comments what talks, workshops, or events you’re most excited about.

Caktus GroupHosting Django Sites on Amazon Elastic Beanstalk


Amazon Web Services (AWS)' Elastic Beanstalk is a service that bundles up a number of their lower-level services to manage many details for you when deploying a site. We particularly like it for deploys and autoscaling.

We were first introduced to Elastic Beanstalk when taking over an existing project that used it. It's not without its shortcomings, but we've generally been happy enough with it to stick with it for the project and consider it for others.


Elastic Beanstalk can handle a number of technologies that people use to build web sites, including Python (2.7 and 3.4), Java, PHP, .Net, Node.js, and Ruby. More are probably in the works.

You can also deploy containers, with whatever you want in them.

To add a site to Elastic Beanstalk, you create a new Elastic Beanstalk application, set some configuration, and upload the source for your application. Then Elastic Beanstalk will provision the necessary underlying resources, such as EC2 (virtual machine) instances, load balancers, autoscaling groups, DNS, etc, and install your application appropriately.

Elastic Beanstalk can monitor the load and automatically scale underlying resources as needed.

When your application needs updating, you upload the updated source, and Elastic Beanstalk updates the underlying resources. You can choose from multiple update strategies.

For example, on our staging server, we have Elastic Beanstalk update the application on our existing web servers, one at a time. In production, we have Elastic Beanstalk start setting up a new set of servers, check their health, and only start directing traffic to them as they're up and healthy. Then it shuts down the previous servers.

Elastic Beanstalk and Python

As you'd expect, when deploying a Python application, Elastic Beanstalk will create a virtual environment and install whatever's in your requirements.txt file.

A lot of other configuration can be done. Here are some example configuration file snippets from Amazon's documentation on Elastic Beanstalk for Python:

    DJANGO_SETTINGS_MODULE: production.settings
    "/images/": "staticimages/"
    WSGIPath: ebdjango/wsgi.py
    NumProcesses: 3
    NumThreads: 20

    libmemcached-devel: '0.31'

    command: "django-admin.py collectstatic --noinput"
    command: "django-admin.py syncdb --noinput"
    leader_only: true
    command: "django-admin.py migrate"
    leader_only: true

Here are some things we can note:

  • We can define environment variables, like DJANGO_SETTINGS_MODULE.
  • We can tell Elastic Beanstalk to serve static files for us.
  • We ask Elastic Beanstalk to run our Django application using WSGI.
  • We can install additional packages, like memcached.
  • We can run commands during deploys.
  • Using leader_only, we can tell Elastic Beanstalk that some commands only need to be run on one instance during the deploy.

This is also where we could set parameters for autoscaling, configure the deployment strategy, set up notifications, and many other things.


The leader_only feature is great for doing something on only one of your servers during a deploy. But don't make the mistake we made of trying to use that to configure one server differently from the others, for example to run some background task periodically. Once the deploy is done, there's nothing special about the server that ran the leader_only commands during the deploy, and that server is as likely as any other to be terminated during autoscaling.

Right now, Elastic Beanstalk doesn't provide any way to readily differentiate servers so you can, for example, run things on only one server at a time. You'll have to do that yourselves.

Our solution for the situation where we ran into this was to use select_for_update to "lock" the records we were updating.


You can have Elastic Beanstalk manage RDS (Amazon's hosted database service) for you, but so far we've preferred to set up RDS outside of Elastic Beanstalk. If Elastic Beanstalk was managing it, then Elastic Beanstalk would provide pointers to the database server in environment variables, which would be more convenient than keeping track of it ourselves. On the other hand, with our database outside of Elastic Beanstalk, we know our data is safe, even if we make some terrible mistake in our Elastic Beanstalk configuration.


As always, you have to think carefully when deploying an update that includes migrations. Both of the deploy strategies we mentioned earlier will result in the migrations running on the database while some servers are still running the previous code, so you need to be sure that'll work. But that's a problem anytime you have multiple web servers and deserves a blog post of its own.

Time to deploy

One thing we're not happy with is the time it takes for an update deploy - over 20 minutes for our staging environment, and over 40 minutes for our production environment with its more conservative deploy strategy.

Some of that time is under our control. For example, it still takes longer than we'd like to set up the environment on each server, including building new environments for both Python and Node.js. We've already made some speedups there, and will continue working on that.

Other parts of the time are not under our control, especially some parts unique to production deploys. It takes AWS several minutes to provision and start a new EC2 instance, even before starting to set up our application's specific environment. It does that for one new server, waits until it's completely ready and tests healthy (several more minutes), and then starts the process all over again for the rest of the new servers it needs. (Those are done in parallel.)

When all the traffic is going to the new servers, it starts terminating the previous servers at 3 minute intervals, and waiting for all those to finish before declaring the deploy complete.

There are clear advantages to doing things this way: Elastic Beanstalk won't publish a completely broken version of our application, and the site never has any downtime during a production deploy. There are CLI tools that help you manage your deploys, plus a nice web interface to monitor what’s going on. We just wish the individual steps didn't take so long. Hopefully Elastic Beanstalk will find ways over time to improve things.


Despite some of its current shortcomings, we're quite happy to have Elastic Beanstalk in our deployment and hosting toolbox. We previously built our own tool for deploying and managing Django projects in an AWS autoscaling environment (before many of the more recent additions to the AWS suite, such as RDS for Postgres), and we know how much work it is to design, build, and maintain such a platform.

Tim HopperAutomating Python with Ansible

I wrote a few months back about how data scientists need more automation. In particular, I suggested that data scientists would be wise to learn more about automated system configuration and automated deployments.

In an attempt to take my own advice, I've finally been making myself learn Ansible. It turns out that a great way to learn it is to sit down and read through the docs, front to back; I commend that tactic to you. I also put together this tutorial to walk through a practical example of how a working data scientist might use this powerful tool.

What follows is an Ansible guide that will take you from installing Ansible to automatically deploying a long-running Python to a remote machine and running it in a Conda environment using supervisord. It presumes your development machine is on OS X and the remote machine is Debian-like; however, it shouldn't require too many changes to run it on other systems.

I wrote this post in a Jupyter notebook with a Bash kernel. You can find the notebook, Ansible files, and installation directions on my Github.


Ansible provides "human readable automation" for "app deployment" and "configuration management". Unlike tools like Chef, it doesn't require an agent to be running on remote machines. In short, it translates declarative YAML files into shell commands and runs them on your machines over SSH.

Ansible is backed by Red Hat and has a great website.

Installing Ansible with Homebrew

First, you'll need to install Ansible. On a Mac, I recommend doing this with Homebrew.

In [2]:
brew install ansible
Warning: ansible- already installed
Warning: You are using OS X 10.12.
We do not provide support for this pre-release version.
You may encounter build failures or other breakages.


Soon, I'll show you how to put write an Ansible YAML file. However, Ansible also allows you specify tasks from the command line.

Here's how we could use Ansible ping our local host:

In [3]:
ansible -i 'localhost,' -c local -m ping all
ansible -i 'localhost,' -c local -m ping all
localhost | SUCCESS => {
    "changed": false, 
    "ping": "pong"

This command calls ansible and tells it:

  • To use localhost as it's inventory (-i). Inventory is Ansible speak for machine or machines you want to be able to run commands on.
  • To connect (-c) locally (local) instead of over SSH.
  • To run the ping module (-m) to test the connection.
  • To run the command on all hosts in the inventory (in this case, our inventory is just the localhost).

Michael Booth has a post that goes into more detail about this command.

Behind the scenes, Ansible is turning this -m ping command into shell commands. (Try running with the -vvv flag to see what's happening behind the scenes.) It can also execute arbitrary commands; by default, it'll use the Bourne shell sh.

In [4]:
ansible all -i 'localhost, ' -c local -a "/bin/echo hello"

Setting up an Ansible Inventory

Instead of specifying our inventory with the -i flag each time, we should specify an Ansible inventory file. This file is a text file specifying machines you have SSH access to; you can also group machines under bracketed headings. For example:




Ansible has to be able to connect to these machines over SSH, so you will likely need to have relevant entries in your .ssh/config file.

By default, the Ansible CLI will look for a system-wide Ansible inventory file in /etc/ansible/hosts. You can also specify an alternative path for an intentory file with the -i flag.

For this tutorial, I'd like to have an inventory file specific to the project directory without having to specify it each time we call Ansible. We can do this by creating a file called ./ansible.cfg and set the name of our local inventory file:

In [5]:
cat ./ansible.cfg
cat ./ansible.cfg
inventory = ./hosts

You can check that Ansible is picking up your config file by running ansible --version.

In [6]:
ansible --version
ansible --version
  config file = /Users/tdhopper/repos/automating_python/ansible.cfg
  configured module search path = Default w/o overrides

For this example, I just have one host, a Digital Ocean VPS. To run the examples below, you should create a VPS instance on Digital Ocean, Amazon, or elsewhere; you'll want to configure it for passwordless authentication. I have an entry like this in my ~/.ssh/hosts file:

Host digitalocean
  HostName 45.55.395.23
  User root
  Port 22
  IdentityFile /Users/tdhopper/.ssh/id_rsa
  ForwardAgent yes

and my intentory file (~/hosts) is just


Before trying ansible, you should ensure that you can connect to this host:

In [7]:
ssh digitalocean echo 1
ssh digitalocean echo 1

Now I can verify that Ansible can connect to my machine by running the ping command.

In [8]:
ansible all -m ping
ansible all -m ping
digitalocean | SUCCESS => {
    "changed": false, 
    "ping": "pong"

We told Ansible to run this command on all specified hosts in the inventory. It found our inventory by loading the ansible.cfg which specified ./hosts as the inventory file.

It's possible that this will fail for you even if you can SSH into the machine. If the error is something like /bin/sh: 1: /usr/bin/python: not found, this is because your VPS doesn't have Python installed on it. You can install it with Ansible, but you may just want to manually run sudo apt-get -y install python on the VPS to get started.

Writing our first Playbook

While adhoc commands will often be useful, the real power of Ansible comes from creating repeatable sets of instructions called Playbooks.

A playbook contains a list of "plays". Each play specifies a set of tasks to be run and which hosts to run them on. A "task" is a call to an Ansible module, like the "ping" module we've already seen. Ansible comes packaged with about 1000 modules for all sorts of use cases. You can also extend it with your own modules and roles.

Our first playbook will just execute the ping module on all our hosts. It's a playbook with a single play comprised of a single task.

In [9]:
cat ping.yml
cat ping.yml
- hosts: all
  - name: ping all hosts

We can run our playbook with the ansible-playbook command.

In [10]:
ansible-playbook ping.yml
ansible-playbook ping.yml
< PLAY [all] >
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

< TASK [setup] >
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

ok: [digitalocean]
< TASK [ping all hosts] >
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

ok: [digitalocean]
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

digitalocean               : ok=2    changed=0    unreachable=0    failed=0   

You might wonder why there are cows on your screen. You can find out here. However, the important thing is that our task was executed and returned successfully.

We can override the hosts list for the play with the -i flag to see what the output looks like when Ansible fails to run the play because it can't find the host.

Let's work now on installing the dependencies for our Python project.

Installing supervisord

"Supervisor is a client/server system that allows its users to monitor and control a number of processes on UNIX-like operating systems." We'll use it to run and monitor our Python process.

On a Debian-like system, we can install it with APT. In the Ansible DSL that's just:

- name: Install supervisord
  sudo: yes
    name: supervisor
    state: present
    update_cache: yes

You can read more about the apt module here.

Once we have it installed, we can start it with this task:

- name: Start supervisord
  sudo: yes
    name: "supervisor"
    state: running
    enabled: yes

This uses the service module.

We could add these these tasks to a playbook file (like ping.yml), but what maybe we will want to share it among multiple playbooks? For this, Ansible has a construct called Roles. A role is a collection of "variable values, certain tasks, and certain handlers – or just one or more of these things". (You can learn more about variables and handlers in the Ansible docs.)

Roles are organized as subfolders of a folder called "Roles" in the working directory. The rapid proliferation of folders in Ansible organization can be overwhelming, but a very simple rule is just a file called main.yml nestled several folders deep. In our case, it's in ./roles/supervisor/tasks/main.yml.

Check out the docs to learn more about role organization.

Here's what our roll looks like:

In [11]:
cat ./roles/supervisor/tasks/main.yml
cat ./roles/supervisor/tasks/main.yml

- name: Install supervisord
  become: true
    name: supervisor
    state: present
    update_cache: yes
- name: Start supervisord
  become: true
    name: "supervisor"
    state: running
    enabled: yes

Note that I added tags: to the task definitions. Tags just allow you to run a portion of a playbook instead of the whole thing with the --tags flag for ansible-playbook.

Now that we have the supervisor install encapsulated in a role, we can write a simple playbook to run the roll.

In [12]:
cat supervisor.yml
cat supervisor.yml
- hosts: digitalocean
    - role: supervisor
In [13]:
ansible-playbook supervisor.yml
ansible-playbook supervisor.yml
< PLAY [digitalocean] >
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

< TASK [setup] >
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

ok: [digitalocean]
< TASK [supervisor : Install supervisord] >
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

changed: [digitalocean]
< TASK [supervisor : Start supervisord] >
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

changed: [digitalocean]
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

digitalocean               : ok=3    changed=2    unreachable=0    failed=0   

Installing Conda with Ansible Galaxy

Next we want to ensure that Conda installed on our system. We could write our own role to follow the recommended process. However, Ansible has a helpful tool to help us avoid reinventing the wheel by allowing users to share roles; this is called Ansible Galaxy.

You can search the Galaxy website for miniconda and see that a handful of roles for installing Miniconda exist. I liked this one.

We can install the roll locally using the ansible-galaxy command line tool.

In [14]:
ansible-galaxy install -f andrewrothstein.miniconda

You can have the roll installed wherever you want (run ansible-galaxy install --help to see how, but by default they'll go to /usr/local/etc/ansible/roles/.

In [15]:
ls -lh /usr/local/etc/ansible/roles/andrewrothstein.miniconda
ls -lh /usr/local/etc/ansible/roles/andrewrothstein.miniconda
total 32
-rw-rw-r--  1 tdhopper  admin   1.1K Jan 16 16:52 LICENSE
-rw-rw-r--  1 tdhopper  admin   666B Jan 16 16:52 README.md
-rw-rw-r--  1 tdhopper  admin   973B Jan 16 16:52 circle.yml
drwxrwxr-x  3 tdhopper  admin   102B Mar 21 11:33 defaults
drwxrwxr-x  3 tdhopper  admin   102B Mar 21 11:33 handlers
drwxrwxr-x  4 tdhopper  admin   136B Mar 21 11:33 meta
drwxrwxr-x  3 tdhopper  admin   102B Mar 21 11:33 tasks
drwxrwxr-x  3 tdhopper  admin   102B Mar 21 11:33 templates
-rw-rw-r--  1 tdhopper  admin    57B Jan 16 16:52 test.yml
drwxrwxr-x  3 tdhopper  admin   102B Mar 21 11:33 vars

You can look at the tasks/main.yml to see the core logic of installing Miniconda. It has tasks to download the installer, run the installer, delete the installer, run conda update conda, and make conda the default system Python.

In [16]:
cat /usr/local/etc/ansible/roles/andrewrothstein.miniconda/tasks/main.yml
# tasks file for miniconda
- name: download installer...
  become: yes
  become_user: root
    url: '{{miniconda_installer_url}}'
    dest: /tmp/{{miniconda_installer_sh}}
    timeout: '{{miniconda_timeout_seconds}}'
    checksum: '{{miniconda_checksum}}'
    mode: '0755'

- name: installing....
  become: yes
  become_user: root
  command: /tmp/{{miniconda_installer_sh}} -b -p {{miniconda_parent_dir}}/{{miniconda_name}}
    creates: '{{miniconda_parent_dir}}/{{miniconda_name}}'

- name: deleting installer...
  become: yes
  become_user: root
  when: miniconda_cleanup
    path: /tmp/{{miniconda_installer_sh}}
    state: absent
- name: link miniconda...
  become: yes
  become_user: root
    dest: '{{miniconda_parent_dir}}/miniconda'
    src: '{{miniconda_parent_dir}}/{{miniconda_name}}'
    state: link

- name: conda updates
  become: yes
  become_user: root
  command: '{{miniconda_parent_dir}}/miniconda/bin/conda update -y --all'

- name: make system default python etc...
  when: miniconda_make_sys_default
  become: yes
  become_user: root
    - etc/profile.d/miniconda.sh
    src: '{{item}}.j2'
    dest: /{{item}}
    mode: 0644

Overriding Ansible Variables

Once a roll is installed locally, you can add it to a play just like you can with roles you wrote. Installing Miniconda is now as simple as:

    - role: andrewrothstein.miniconda

Before we add that to a playbook, I want to customize where miniconda is installed. If you look back at the main.yml file above, you see a bunch of things surrounded in double brackets. These are variables (in the Jinja2 template language). From the play, we can see that Miniconda will be installed at {{miniconda_parent_dir}}/{{miniconda_name}}. The role defines these variables in /andrewrothstein.miniconda/defaults/main.yml. We can override the default variables by specifying them in our play.

A play to install miniconda could look like this:

- hosts: digitalocean
    conda_folder_name: miniconda
    conda_root: /root
    - role: andrewrothstein.miniconda
      miniconda_parent_dir: "{{ conda_root }}"
      miniconda_name: "{{ conda_folder_name }}"

I added this to playbook.yml.

We now know how to use Ansible to start and run supervisord and to install Miniconda. Let's see how to use it to deploy and start our application.

Deploy Python Application

There are countless ways to deploy a Python application. We're going to see how to use Ansible to deploy from Github.

I created a little project called long_running_python_application. It has a main.py that writes a log line to stdout every 30 seconds; that's it. It also includes a Conda environment file specifying the dependencies and a shell script that activates the environment and runs the program.

We're going to use Ansible to

  1. Clone the repository into our remote machine.
  2. Create a Conda environment based on the environment.yml file.
  3. Create a supervisord file for running the program.
  4. Start the supervisord job.

Clone the repository

Cloning a repository with Ansible is easy. We just use the git module. This play will clone the repo into the specified directory. The update: yes flag tells Ansible to update the repository from the remote if it has already been cloned.

- hosts: digitalocean
    project_repo: git://github.com/tdhopper/long_running_python_process.git
    project_location: /srv/long_running_python_process
    - name: Clone project code.
        repo: "{{ project_repo }}"
        dest: "{{ project_location }}"
        update: yes

Creating the Conda Environment

Since we've now installed conda and cloned the repository with an environment.yml file, we just need to run conda env update from the directory containing the environment spec. Here's a play to do that:

- hosts: digitalocean
    project_location: /srv/long_running_python_process
    - name: Create Conda environment from project environment file.
      command: "{{conda_root}}/{{conda_folder_name}}/bin/conda env update"
        chdir: "{{ project_location }}"

It uses the command module which just executes a shell command in the desired directory.

Create a Supervisord File

By default, supervisord will look in /etc/supervisor/conf.d/ for configuration on which programs to run.

We need to put a file in there that tells supervisord to run our run.sh script. Ansible has an integrated way of setting up templates which can be placed on remote machines.

I put a supervisord job template in the ./templates folder.

In [17]:
cat ./templates/run_process.j2
cat ./templates/run_process.j2
[program:{{ program_name }}]
command=sh run.sh
directory={{ project_location }}
stderr_logfile=/var/log/{{ program_name }}.err.log
stdout_logfile=/var/log/{{ program_name }}.out.log

This is a is normal INI-style config file, except it includes Jinja2 variables. We can use the Ansible template module to create a task which fills in the variables with information about our program and copies it into the conf.d folder on the remote machine.

The play for this would look like:

- hosts: digitalocean
    project_location: /srv/long_running_python_process
    program_name: long_running_process
    supervisord_configs_path: /etc/supervisor/conf.d
    - name: Copy supervisord job file to remote
        src: ./templates/run_process.j2
        dest: "{{ supervisord_configs_path }}/run_process.conf"
        owner: root

Start the supevisord job

Finally, we just need to tell supervisord on our remote machine to start the job described by run_process.conf.

Instead of issuing our own shell commands via Ansible, we can use the supervisorctl module. The task is just:

    - name: Start job
        name: "{{ program_name }}"
        state: present

state: present ensures that Ansible calls supervisorctl reread to load a new config. Because our config has autostart=true, supervisor will start it as soon as the task is added.

The Big Playbook!

We can take everything we've described above and put it in one playbook.

This playbook will:

  • Install Miniconda using the role from Ansible Galaxy.
  • Install and start Supervisor using the role we created.
  • Clone the Github project we want to run.
  • Create a Conda environment based on the environment.yml file.
  • Create a supervisord file for running the program.
  • Start the supervisord job.

All of this will be done on the host we specify (digitalocean).

In [18]:
cat playbook.yml
cat playbook.yml
- hosts: digitalocean
    project_repo: git://github.com/tdhopper/long_running_python_process.git
    project_location: /srv/long_running_python_process
    program_name: long_running_process
    conda_folder_name: miniconda
    conda_root: /root
    supervisord_configs_path: /etc/supervisor/conf.d
    - role: andrewrothstein.miniconda
      miniconda_parent_dir: "{{ conda_root }}"
      miniconda_name: "{{ conda_folder_name }}"
    - role: supervisor
    - name: Clone project code.
        repo: "{{ project_repo }}"
        dest: "{{ project_location }}"
        update: yes
    - name: Create Conda environment from project environment file.
      command: "{{conda_root}}/{{conda_folder_name}}/bin/conda env update"
        chdir: "{{ project_location }}"
    - name: Copy supervisord job file to remote
        src: ./templates/run_process.j2
        dest: "{{ supervisord_configs_path }}/run_process.conf"
        owner: root
    - name: Start job
        name: "{{ program_name }}"
        state: present

To configure our machine, we just have to run ansible-playbook playbook.yml.

In [19]:
ANSIBLE_NOCOWS=1 ansible-playbook playbook.yml
ANSIBLE_NOCOWS=1 ansible-playbook playbook.yml

PLAY [digitalocean] ************************************************************

TASK [setup] *******************************************************************
ok: [digitalocean]

TASK [andrewrothstein.unarchive-deps : resolve platform specific vars] *********

TASK [andrewrothstein.unarchive-deps : install common pkgs...] *****************
changed: [digitalocean] => (item=[u'tar', u'unzip', u'gzip', u'bzip2'])

TASK [andrewrothstein.bash : install bash] *************************************
ok: [digitalocean]

TASK [andrewrothstein.alpine-glibc-shim : fix alpine] **************************
skipping: [digitalocean]

TASK [andrewrothstein.miniconda : download installer...] ***********************
changed: [digitalocean]

TASK [andrewrothstein.miniconda : installing....] ******************************
changed: [digitalocean]

TASK [andrewrothstein.miniconda : deleting installer...] ***********************
skipping: [digitalocean]

TASK [andrewrothstein.miniconda : link miniconda...] ***************************
changed: [digitalocean]

TASK [andrewrothstein.miniconda : conda updates] *******************************
changed: [digitalocean]

TASK [andrewrothstein.miniconda : make system default python etc...] ***********
skipping: [digitalocean] => (item=etc/profile.d/miniconda.sh) 

TASK [supervisor : Install supervisord] ****************************************
ok: [digitalocean]

TASK [supervisor : Start supervisord] ******************************************
ok: [digitalocean]

TASK [Clone project code.] *****************************************************
changed: [digitalocean]

TASK [Create Conda environment from project environment file.] *****************
changed: [digitalocean]

TASK [Copy supervisord job file to remote] *************************************
changed: [digitalocean]

TASK [Start job] ***************************************************************
changed: [digitalocean]

PLAY RECAP *********************************************************************
digitalocean               : ok=13   changed=9    unreachable=0    failed=0   

See that the PLAY RECAP shows that everything was OK, no systems were unreachable, and no tasks failed.

We can verify that the program is running without error:

In [20]:
ssh digitalocean sudo supervisorctl status
ssh digitalocean sudo supervisorctl status
long_running_process             RUNNING   pid 4618, uptime 0:01:34
In [21]:
ssh digitalocean cat /var/log/long_running_process.out.log
ssh digitalocean cat /var/log/long_running_process.out.log
INFO:root:Process ran for the 1th time
INFO:root:Process ran for the 2th time
INFO:root:Process ran for the 3th time
INFO:root:Process ran for the 4th time

If your lucky (i.e. your systems and networks were setup sufficiently similar to mine), you can run this exact same command to configure and start a process on your own system. Moreover, you could use this exact same command to start this program on an arbitrary number of machines by simply adding more hosts to your inventory and play spec!


Ansible is a powerful, customizable tool. Unlike some similar tools, it requires very little setup to start using it. As I've learned more about it, I've seen more and more ways in which I could've used it in copious projects in the past; I intend to make it a regular part of my toolkit. (Historically I've done this kind of thing with hacky combinations of shell scripts and Fabric; Ansible would often be better.)

This tutorial just scratches the surface of the Ansible functionality. If you want to learn more, I again recommend reading through the docs; they're very good. Of course, you should start writing and running your own playbooks as soon as possible! I also liked this tutorial from Server Admin for Programmers. If you want to compare Ansible to alternatives, the Taste Test book by Matt Jaynes looks promising. For more on Supervisor, serversforhackers.com has a nice tutorial, and its docs are thorough.

Caktus GroupA Production-ready Dockerfile for Your Python/Django App

Docker has matured a lot since it was released nearly 4 years ago. We’ve been watching it closely at Caktus, and have been thrilled by the adoption -- both by the community and by service providers. As a team of Python and Django developers, we’re always searching for best of breed deployment tools. Docker is a clear fit for packaging the underlying code for many projects, including the Python and Django apps we build at Caktus.

Technical overview

There are many ways to containerize a Python/Django app, no one of which could be considered “the best.” That being said, I think the following approach provides a good balance of simplicity, configurability, and container size. The specific tools I’ll be using are: Docker (of course), Alpine Linux, and uWSGI.

Alpine Linux is a simple, lightweight Linux distribution based on musl libc and Busybox. Its main claim to fame on the container landscape is that it can create a very small (5MB) Docker image. Typically one’s application will be much larger than that after the code and all dependencies have been included, but the container will still be much smaller than if based on a general-purpose Linux distribution.

There are many WSGI servers available for Python, and we use both Gunicorn and uWSGI at Caktus. A couple of the benefits of uWSGI are that (1) it’s almost entirely configurable through environment variables (which fits well with containers), and (2) it includes native HTTP support, which can circumvent the need for a separate HTTP server like Apache or Nginx, provided static files are hosted on a 3rd-party CDN such as Amazon S3.

The Dockerfile

Without further ado, here’s a production-ready Dockerfile you can use as a starting point for your project (it should be added in your top level project directory, or whichever directory contains the Python package(s) provided by your application):

FROM python:3.5-alpine

# Copy in your requirements file
ADD requirements.txt /requirements.txt

# OR, if you’re using a directory for your requirements, copy everything (comment out the above and uncomment this if so):
# ADD requirements /requirements

# Install build deps, then run `pip install`, then remove unneeded build deps all in a single step. Correct the path to your production requirements file, if needed.
RUN set -ex \
    && apk add --no-cache --virtual .build-deps \
            gcc \
            make \
            libc-dev \
            musl-dev \
            linux-headers \
            pcre-dev \
            postgresql-dev \
    && pyvenv /venv \
    && /venv/bin/pip install -U pip \
    && LIBRARY_PATH=/lib:/usr/lib /bin/sh -c "/venv/bin/pip install --no-cache-dir -r /requirements.txt" \
    && runDeps="$( \
            scanelf --needed --nobanner --recursive /venv \
                    | awk '{ gsub(/,/, "\nso:", $2); print "so:" $2 }' \
                    | sort -u \
                    | xargs -r apk info --installed \
                    | sort -u \
    )" \
    && apk add --virtual .python-rundeps $runDeps \
    && apk del .build-deps

# Copy your application code to the container (make sure you create a .dockerignore file if any large files or directories should be excluded)
RUN mkdir /code/
WORKDIR /code/
ADD . /code/

# uWSGI will listen on this port

# Add any custom, static environment variables needed by Django or your settings file here:
ENV DJANGO_SETTINGS_MODULE=my_project.settings.deploy

# uWSGI configuration (customize as needed):

# Call collectstatic (customize the following line with the minimal environment variables needed for manage.py to run):
RUN DATABASE_URL=none /venv/bin/python manage.py collectstatic --noinput

# Start uWSGI
CMD ["/venv/bin/uwsgi", "--http-auto-chunked", "--http-keepalive"]

We extend from the Alpine flavor of the official Docker image for Python 3.5, copy the folder containing our requirements files to the container, and then, in a single line, (a) install the OS dependencies needed, (b) pip install the requirements themselves (edit this line to match the location of your requirements file, if needed), (c) scan our virtual environment for any shared libraries linked to by the requirements we installed, and (d) remove the C compiler and any other OS packages no longer needed, except those identified in step (c) (this approach, using scanelf, is borrowed from the underlying 3.5-alpine Dockerfile). It’s important to keep this all on one line so that Docker will cache the entire operation as a single layer.

You’ll notice I’ve only included a minimal set of OS dependencies here. If this is an established production app, you’ll most likely need to visit https://pkgs.alpinelinux.org/packages, search for the Alpine Linux package names of the OS dependencies you need, including the -dev supplemental packages as needed, and add them to the list above.

Next, we copy our application code to the image, set some default environment variables, and run collectstatic. Be sure to change the values for DJANGO_SETTINGS_MODULE and UWSGI_WSGI_FILE to the correct paths for your application (note that the former requires a Python package path, while the latter requires a file system path). In the event you’re not serving static media directly from the container (e.g., with Whitenoise), the collectstatic command can also be removed.

Finally, the --http-auto-chunked and --http-keepalive options to uWSGI are needed in the event the container will be hosted behind an Amazon Elastic Load Balancer (ELB), because Django doesn’t set a valid Content-Length header by default, unless the ConditionalGetMiddleware is enabled. See the note at the end of the uWSGI documentation on HTTP support for further detail.

Building and testing the container

Now that you have the essentials in place, you can build your Docker image locally as follows:

docker build -t my-app .

This will go through all the commands in your Dockerfile, and if successful, store an image with your local Docker server that you could then run:

docker run -e DATABASE_URL=none -t my-app

This command is merely a smoke test to make sure uWSGI runs, and won’t connect to a database or any other external services.

Running commands during container start-up

As an optional final step, I recommend creating an ENTRYPOINT script to run commands as needed during container start-up. This will let us accomplish any number of things, such as making sure Postgres is available or running migrate or collectstatic during container start-up. Save the following to a file named docker-entrypoint.sh in the same directory as your Dockerfile:

set -e

until psql $DATABASE_URL -c '\l'; do
  >&2 echo "Postgres is unavailable - sleeping"
  sleep 1

>&2 echo "Postgres is up - continuing"

if [ "x$DJANGO_MANAGEPY_MIGRATE" = 'xon' ]; then
    /venv/bin/python manage.py migrate --noinput

if [ "x$DJANGO_MANAGEPY_COLLECTSTATIC" = 'xon' ]; then
    /venv/bin/python manage.py collectstatic --noinput

exec "$@"

Next, add the following line to your Dockerfile, just above the CMD statement:

ENTRYPOINT ["/code/docker-entrypoint.sh"]

This will (a) make sure a database is available (usually only needed when used with Docker Compose), (b) run outstanding migrations, if any, if the DJANGO_MANAGEPY_MIGRATE is set to on in your environment, and (c) run collectstatic if DJANGO_MANAGEPY_COLLECTSTATIC is set to on in your environment. Even if you add this entrypoint script as-is, you could still choose to run migrate or collectstatic in separate steps in your deployment before releasing the new container. The only reason you might not want to do this is if your application is highly sensitive to container start-up time, or if you want to avoid any database calls as the container starts up (e.g., for local testing). If you do rely on these commands being run during container start-up, be sure to set the relevant variables in your container’s environment.

Creating a production-like environment locally with Docker Compose

To run a complete copy of production services locally, you can use Docker Compose. The following docker-compose.yml will create a barebones, ephemeral, AWS-like container environment with Postgres and Redis for testing your production environment locally.

This is intended for local testing of your production environment only, and will not save data from stateful services like Postgres upon container shutdown.

version: '2'

      POSTGRES_DB: app_db
      POSTGRES_USER: app_user
      POSTGRES_PASSWORD: changeme
    restart: always
    image: postgres:9.6
      - "5432"
    restart: always
    image: redis:3.0
      - "6379"
      DATABASE_URL: postgres://app_user:changeme@db/app_db
      REDIS_URL: redis://redis
      context: .
      dockerfile: ./Dockerfile
      - db:db
      - redis:redis
      - "8000:8000"

Copy this into a file named docker-compose.yml in the same directory as your Dockerfile, and then run:

docker-compose up --build -d

This downloads (or builds) and starts the three containers listed above. You can view output from the containers by running:

docker-compose logs

If all services launched successfully, you should now be able to access your application at http://localhost:8000/ in a web browser.

Extra: Blocking Invalid HTTP_HOST header errors with uWSGI

To avoid Django’s Invalid HTTP_HOST header errors (and prevent any such spurious requests from taking up any more CPU cycles than absolutely necessary), you can also configure uWSGI to return an HTTP 400 response immediately without ever invoking your application code. This can be accomplished by adding a command line option to uWSGI in your Dockerfile script, e.g., --route-host=’^(?!www.myapp.com$) break:400' (note, the single quotes are required here, to prevent the shell from attempting to interpret the regular expression). If preferred (for example, in the event you use a different domain for staging and production), you can accomplish the same end by setting an environment variable via your hosting platform: UWSGI_ROUTE_HOST=‘^(?!www.myapp.com$) break:400'.

That concludes this high-level introduction to containerizing your Python/Django app for hosting on AWS Elastic Beanstalk (EB), Elastic Container Service (ECS), or elsewhere. Each application and Dockerfile will be slightly different, but I hope this provides a good starting point for your containers. Shameless plug: If you’re looking for a simple (and at least temporarily free) way to test your Docker containers on AWS using an Elastic Beanstalk Multicontainer Docker environment or the Elastic Container Service, checkout AWS Container Basics (more on this soon). Good luck!

Update 1 (March 31, 2017): There is no need for depends_on in container definitions that already include links. This has been removed. Thanks Anderson Lima for the tip!

Update 2 (March 31, 2017): Adding --no-cache-dir to the pip install command saves a additional disk space, as this prevents pip from caching downloads and caching wheels locally. Since you won't need to install requirements again after the Docker image has been created, this can be added to the pip install command. The post has been updated. Thanks Hemanth Kumar for the tip!

Tim HopperSome Reflections on Being Turned Down for a Lot of Data Science Jobs

👉 The decision was close, but the team has decided to keep looking for someone who might have more direct neural net experience.

👉 Honestly, I think the way you communicated your thought process and results was confusing for some people in the room.

👉 He's needing someone with an image analysis background for data scientist we're hiring now.

👉 Quite honestly given your questions [about vacation policy] and the fact that you are considering other options, [we] may not be the best choice for you.

These quotes above are some of the reasons I've been given for why I wasn't offered a data science job after interviewing. I've been told a variety of other reasons as well: company decided against hiring remotes after interviewing (I've heard this at least 3 times), company thought I changed jobs too frequently, company decided it didn't have necessary data infrastructure in place for data science work. Multiple companies gave no particular reason; some of these were at least kind enough to notify me they weren't interested. One company hired someone with a Ph.D. from MIT soon after turning me down.

In the last five years, I've clearly interviewed for a lot of data science jobs, and I've also been turned down for a lot of data science jobs. I've spent a good bit of time reflecting on why I wasn't offered this job or that. Several folks have asked me if I had any advice to share on the experience, and I hope to offer that here.

You never really know

I learned with graduate school applications years ago: you rarely truly know why you were turned down. Maybe my GRE scores weren't high enough, or maybe the reviewer rushed through my application in the 5 minutes before lunch. Maybe my statement of interest was too weak, or maybe the department needed to accept an alumni's child.

The same goes for companies. I'm fairly skeptical that the reasons I have been given for why I was passed by are the full story, and I suspect you will rarely (if ever) know the real reasons why you weren't offered a job. I try to use the reasons I hear as a way to help me refine my skills and better present myself, but I don't put too much weight in them.

Some advice anyway

That said, here are a few takeaways from interviewing for probably 20 data science jobs since 2012.

  • Companies often use interviews as a time to figure out what they're really looking for. I suspect this rarely intentional. But actually interviewing candidates forces a team to talk through what they're actually looking for, and they often realize they had differing perspectives prior to the interview.
  • Companies where "data science" is a new addition need your help in understanding what data science can do for them. As much as possible, use the interview to sell your vision for what data science can offer at the company, how you'll get it off the ground, and what the ROI might be.
  • Being the wrong fit for what a company needs is not ideal. I've come to appreciate a company trying to ensure my abilities align with their needs. You'd hope this was always the case, but I've been hired when it wasn't. That said, I hesitate to say you should always look for this: if you need a job, and someone offers you a job, you should feel free to take it!
  • Data infrastructure is important and many companies are lacking it. Many data scientists can attest to being hired at a company only to discover the data they needed wasn't available, and they spent months or years building the tools required for them to start their analysis. Many companies are naive about how much engineering effort is required for effective data science. Don't assume that a company with a grand vision for data science necessarily knows what it will take to accomplish that vision.
  • Many companies are still uneasy about data science being done remotely. I think this is silly, but I'm biased.
  • There's little consistency as to what you might be asked in a data science interview. I've been asked about Java design patterns, how to solve combinatorics problems, to describe my favorite machine learning model, to explain the SMO algorithm, my opinions about the TensorFlow API, how I do software testing, to analyze a never-before-seen dataset and prepare a presentation in a 4 hour window, the list goes on. I spent a flight to the west coast reading up on the statistics of A/B testing only to be asked largely soft-skills type questions for an entire interview. I've largely given up attempting any special preparation for interviews.
  • Networking is still king. Hiring is hard, and interviewing is hard; having a prior relationship with an applicant is attractive and reduces hiring uncertainty. In my own experience, my friendships and connections with the data science community on Twitter has shaped my career. Don't downplay the benefits of networking.


So how do you get a data science job? I don't know.

I've been unbelievably fortunate to be continuously employed since college, but I'm not sure how to tell you to repeat that. The best I have to offer is to reiterate the conclusion of my recent talk about data science as a career. Learn and know the hard stuff: linear algebra, probability, statistics, machine learning, math modeling, data structures, algorithms, distributed systems, etc. You probably won't use this knowledge every day in your job, but interviewers love to ask about it anyway.

At the same time, don't forget about the even harder skills: communication, careful thought, prose writing skill, software writing skill, software engineering, tenacity, Stack Overflow. You will use these every day in your job, and they'll help you present yourself well in an interview.1

Further Reading

  1. With the exception of Stack Overflow. Using Stack Overflow in an interview is strangely taboo. 

Philip SemanchukThanks to PYPTUG!

The logo of the Python Piedmont Triad Users Group
Thanks to PYPTUG, the Python Piedmont Triad Users Group, for inviting me to speak at their monthly meeting last night!

I gave a slightly expanded version of the talk I gave at PyData Carolinas 2016 about connecting Python to compiled languages like C, C++, and Fortran. (Slides from that talk are here.) I appreciate the time and attention of everyone who attended last night, especially Francois Dion for organizing and reminding us of some of the interesting new things in Python 3.5 and 3.6.

Last night’s talk wasn’t recorded, but you can see the version of the talk I gave at PyData at https://www.youtube.com/watch?v=aUSokzzsEko , or you can watch the embedded version below.


Caktus GroupOpening External Links: Same Tab or New?

The Debate

My teammates and I recently engaged in a spirited debate over whether outbound links (links to external websites) should open in the same or in a new tab. “Same tab” was a default behavior for a set of external links on a project we were working on. A suggestion had been made, however, that the behavior be changed.

Two main arguments emerged: some of us felt that opening outbound links in a new tab was a behavior so well established that most users expected it, and it was therefore a better experience. Others suggested that users who prefer a “new tab” behavior can always opt for it (by using a right-click or a keyboard shortcut), but that forcing this behavior on all users constituted hijacking their browsing experience.

In Search of Answers on WWW

Reminding myself that my own experience is not the experience of the users for whom we build web applications, I set out to search for information available on the topic online, and then conducted guerrilla usability testing with a small group of users similar to the user base of the website around which the debate had started.

My online research revealed the following:

  • Marketing communities advocate for opening external links in a new tab. The primary argument is to prevent users from leaving the website. An increased page bounce rate can negatively affect page ranking, hence allowing people to leave a page is considered bad SEO practice.
  • User experience and web development communities advocate for opening external links in the same tab. The arguments include:
    • Anything that takes control away from the user is bad experience; users should be able to decide whether or not they want to open a link in a new tab;
    • If users do not know that a given link will take them to an external website, they may get disoriented when the website opens in a new tab;
    • Opening any link in a new tab may present an accessibility issue (it breaks the workflow for users who browse the web leveraging assistive technologies)
    • Opening a new tab on a smartphone makes it difficult for the user to return to the original website
  • User opinions I have found on the Internet included preferences for either behavior:
    • Some users prefer the “new tab” behavior. That is usually the case when external links serve as reference material in support of the main article a user may be reading. In those cases opening external links in a new tab makes it easier for the user to return to the original article and to continue reading. It becomes even more important if the user continues browsing deeper by following links from external websites they have already opened. Opening external links in the same tab may, in those cases, lead to a so-called “back-button fatigue” if the user is forced to click the browser's back button multiple times in order to return to the website where their browsing began.
    • Some users prefer the “same tab” behavior. They find the experience to be seamless when they can browse back and forth between external websites and the website of origin by clicking the browser’s back button. They also argue that with the default behavior set to “same tab,” users who prefer “new tab” behavior can still achieve it through a keyboard shortcut or a right-click. Users who prefer “same tab” behavior, however, have no recourse if the “new tab” behavior is imposed on them.
  • Within UX and accessibility communities, some proponents of the “same tab” behavior are willing to make an exception and allow for external links to open in a new tab as long as outbound links are clearly marked as external (for example, through a use of text or an appropriate icon).
  • There is a security concern related to opening links in a new tab. A vulnerability of target=”_blank” attribute may leave users open to a phishing attack unless the target=”_blank”attribute is accompanied by rel="noopener" attribute.

Informal Poll Results

In addition to researching information published online, I also polled my peers in UX online communities. While I found a strong, collective, professional advocacy for the “same tab” behavior, individual preferences of UX professionals as users were split between the two behaviors. People cited the same arguments in favor or against either behavior that I summarized based on my Internet search.

Guerrilla User Testing

In the end, for us UX-ers, it’s all about the user. Given the range of opinions on the matter, I thought a quick and dirty usability test would help answer the question about what makes sense to the type of user that visits the website my teammates and I were working on.

I tested five users. Three out of five either expressed an expectation of or demonstrated a preference for external web pages getting opened in a new tab. Two of them explicitly chose to open outbound links in a new tab once they realized that by default the links opened pages in the same tab. None of these users were bothered by the “same tab” behavior. They did, however, note that on many other websites external links open pages in a new tab by default. The remaining two users reported not thinking about or even noticing that the pages opened in the same tab. All five users browsed seamlessly from linked pages back to the original page using the browser’s back button.


The qualitative results of my inquiry suggest a split of preferences among users between the “same tab” and “new tab” behaviors. From the user experience perspective, the strongest (in my opinion) arguments in favor of opening outbound links in the same tab lie in accessibility and mobile use considerations. For most stakeholders, I suspect, the SEO argument outweighs any reasoning that stems from user experience. However, as the push for building accessible websites increases as does the rate at which users access content on mobile devices, stakeholders may find themselves between a rock and a hard place searching to strike a balance between attracting new audience through SEO efforts and retaining existing users by tending to their accessibility and mobile browsing needs.

Finally, marketing as well as UX and web development communities may consider giving up the struggle for a final answer. A decision about opening external links in the same or in a new tab may have to be made on a project by project basis by finding the right balance between business and user value-add.

External resources

(Websites linked below will open in a new tab.)

Caktus GroupPython type annotations

When it comes to programming, I have a belt and suspenders philosophy. Anything that can help me avoid errors early is worth looking into.

The type annotation support that's been gradually added to Python is a good example. Here's how it works and how it can be helpful.


The first important point is that the new type annotation support has no effect at runtime. Adding type annotations in your code has no risk of causing new runtime errors: Python is not going to do any additional type-checking while running.

Instead, you'll be running separate tools to type-check your programs statically during development. I say "separate tools" because there's no official Python type checking tool, but there are several third-party tools available.

So, if you chose to use the mypy tool, you might run:

$ mypy my_code.py

and it might warn you that a function that was annotated as expecting string arguments was going to be called with an integer.

Of course, for this to work, you have to be able to add information to your code to let the tools know what types are expected. We do this by adding "annotations" to our code.

One approach is to put the annotations in specially-formatted comments. The obvious advantage is that you can do this in any version of Python, since it doesn't require any changes to the Python syntax. The disadvantages are the difficulties in writing these things correctly, and the coincident difficulties in parsing them for the tools.

To help with this, Python 3.0 added support for adding annotations to functions (PEP-3107), though without specifying any semantics for the annotations. Python 3.6 adds support for annotations on variables (PEP-526).

Two additional PEPs, PEP-483 and PEP-484, define how annotations can be used for type-checking.

Since I try to write all new code in Python 3, I won't say any more about putting annotations in comments.

Getting started

Enough background, let's see what all this looks like.

Python 3.6 was just released, so I’ll be using it. I'll start with a new virtual environment, and install the type-checking tool mypy (whose package name is mypy-lang).:

$ virtualenv -p $(which python3.6) try_types
$ . try_types/bin/activate
$ pip install mypy-lang

Let's see how we might use this when writing some basic string functions. Suppose we're looking for a substring inside a longer string. We might start with:

def search_for(needle, haystack):
    offset = haystack.find(needle)
    return offset

If we were to call this with anything that's not text, we'd consider it an error. To help us avoid that, let's annotate the arguments:

def search_for(needle: str, haystack: str):
    offset = haystack.find(needle)
    return offset

Does Python care about this?:

$ python search1.py

Python is happy with it. There's not much yet for mypy to check, but let's try it:

$ mypy search1.py

In both cases, no output means everything is okay.

(Aside: mypy uses information from the files and directories on its command line plus all packages they import, but it only does type-checking on the files and directories on its command line.)

So far, so good. Now, let's call our function with a bad argument by adding this at the end:

search_for(12, "my string")

If we tried to run this, it wouldn't work:

$ python search2.py
Traceback (most recent call last):
    File "search2.py", line 4, in <module>
        search_for(12, "my string")
    File "search2.py", line 2, in search_for
        offset = haystack.find(needle)
TypeError: must be str, not int

In a more complicated program, we might not have run that line of code until sometime when it would be a real problem, and so wouldn't have known it was going to fail. Instead, let's check the code immediately:

$ mypy search2.py
search2.py:4: error: Argument 1 to "search_for" has incompatible type "int"; expected "str"

Mypy spotted the problem for us and explained exactly what was wrong and where.

We can also indicate the return type of our function:

def search_for(needle: str, haystack: str) -> str:
    offset = haystack.find(needle)
    return offset

and ask mypy to check it:

$ mypy search3.py
search3.py: note: In function "search_for":
search3.py:3: error: Incompatible return value type (got "int", expected "str")

Oops, we're actually returning an integer but we said we were going to return a string, and mypy was smart enough to work that out. Let's fix that:

def search_for(needle: str, haystack: str) -> int:
    offset = haystack.find(needle)
    return offset

And see if it checks out:

$ mypy search4.py

Now, maybe later on we forget just how our function works, and try to use the return value as a string:

x = len(search_for('the', 'in the string'))

Mypy will catch this for us:

$ mypy search5.py
search5.py:5: error: Argument 1 to "len" has incompatible type "int"; expected "Sized"

We can't call len() on an integer. Mypy wants something of type Sized -- what's that?

More complicated types

The built-in types will only take us so far, so Python 3.5 added the typing module, which both gives us a bunch of new names for types, and tools to build our own types.

In this case, typing.Sized represents anything with a __len__ method, which is the only kind of thing we can call len() on.

Let's write a new function that'll return a list of the offsets of all of the instances of some string in another string. Here it is:

from typing import List

def multisearch(needle: str, haystack: str) -> List[int]:
    # Not necessarily the most efficient implementation
    offset = haystack.find(needle)
    if offset == -1:
        return []
    return [offset] + multisearch(needle, haystack[offset+1:])

Look at the return type: List[int]. You can define a new type, a list of a particular type of elements, by saying List and then adding the element type in square brackets.

There are a number of these - e.g. Dict[keytype, valuetype] - but I'll let you read the documentation to find these as you need them.

mypy passed the code above, but suppose we had accidentally had it return None when there were no matches:

def multisearch(needle: str, haystack: str) -> List[int]:
    # Not necessarily the most efficient implementation
    offset = haystack.find(needle)
    if offset == -1:
        return None
    return [offset] + multisearch(needle, haystack[offset+1:])

mypy should spot that there's a case where we don't return a list of integers, like this:

$ mypy search6.py

Uh-oh - why didn't it spot the problem here? It turns out that by default, mypy considers None compatible with everything. To my mind, that's wrong, but luckily there's an option to change that behavior:

$ mypy --strict-optional search6.py
search6.py: note: In function "multisearch":
search6.py:7: error: Incompatible return value type (got None, expected List[int])

I shouldn't have to remember to add that to the command line every time, though, so let's put it in a configuration file just once. Create mypy.ini in the current directory and put in:

strict_optional = True

And now:

$ mypy search6.py
search6.py: note: In function "multisearch":
search6.py:7: error: Incompatible return value type (got None, expected List[int])

But speaking of None, it's not uncommon to have functions that can either return a value or None. We might change our search_for method to return None if it doesn't find the string, instead of -1:

def search_for(needle: str, haystack: str) -> int:
    offset = haystack.find(needle)
    if offset == -1:
        return None
        return offset

But now we don't always return an int and mypy will rightly complain:

$ mypy search7.py
search7.py: note: In function "search_for":
search7.py:4: error: Incompatible return value type (got None, expected "int")

When a method can return different types, we can annotate it with a Union type:

from typing import Union

def search_for(needle: str, haystack: str) -> Union[int, None]:
    offset = haystack.find(needle)
    if offset == -1:
        return None
        return offset

There's also a shortcut, Optional, for the common case of a value being either some type or None:

from typing import Optional

def search_for(needle: str, haystack: str) -> Optional[int]:
    offset = haystack.find(needle)
    if offset == -1:
        return None
        return offset

Wrapping up

I've barely touched the surface, but you get the idea.

One nice thing is that the Python libraries are all annotated for us already. You might have noticed above that mypy knew that calling find on a str returns an int - that's because str.find is already annotated. So you can get some benefit just by calling mypy on your code without annotating anything at all -- mypy might spot some misuses of the libraries for you.

For more reading:

Tim HopperLogistic Regression Rules Everything Around Me

Fred Benenson spent 6 years doing data science at Kickstarter. When he left last year, he wrote a fantastic recap of his experience.

His "list of things I've discovered over the years" is particularly good. Here are a few of the things that resonated with me:

  • The more you can work with someone to help refine their question the easier it will be to answer
  • Conducting a randomized controlled experiment via an A/B test is always better than analyzing historical data
  • Metrics are crucial to the story a company tells itself; it is essential to honestly and rigorously define them
  • Good experimental design is difficult; don't allow a great testing framework to let you get lazy with it
  • Data science (A/B testing, etc.) can help you how to optimize for a particular outcome, but it will never tell you which particular outcome to optimize for
  • Always seek to record and attain data in its rawest form, whether you're instrumenting something yourself or retrieving it from an API
  • I highly recommend reading the whole post.

    Philip SemanchukPandas Surprise


    Part of learning how to use any tool is exploring its strengths and weaknesses. I’m just starting to use the Python library Pandas, and my naïve use of it exposed a weakness that surprised me.


    A photo of the many shapes and colors in Lucky Charms cerealThanks to bradleypjohnson for sharing this Lucky Charms photo under CC BY 2.0.

    I have a long list of objects, each with the properties “color” and “shape”. I want to count the frequency of each color/shape combination. A sample of what I’m trying to achieve could be represented in a grid like this –

           circle square star
    blue        8     41   18
    orange      5     33   25
    red        53     64   58

    At first I implemented this with a dictionary of collections.Counter instances where the top level dictionary is keyed by shape, like so –

    import collections
    SHAPES = ('square', 'circle', 'star', )
    frequencies = {shape: collections.Counter() for shape in SHAPES}

    Then I counted my frequencies using the code below. (For simplicity, assume that my objects are simple 2-tuples of (shape, color)).

    for shape, color in all_my_objects:
        frequencies[shape][color] += 1

    So far, so good.

    Enter the Pandas

    This looked to me like a perfect opportunity to use a Pandas DataFrame which would nicely support the operations I wanted to do after tallying the frequencies, like adding a column to represent the total number (sum) of instances of each color.

    It was especially easy to try out a DataFrame because my counting loop ( for...all_my_objects) wouldn’t change, only the definition of frequencies. (Note that the code below requires I know in advance all the possible colors I can expect to see, which the Dict + Counter version does not. This isn’t a problem for me in my real-world application.)

    import pandas as pd
    frequencies = pd.DataFrame(columns=SHAPES, index=COLORS, data=0,
    for shape, color in all_my_objects:
        frequencies[shape][color] += 1

    It Works, But…

    Both versions of the code get the job done, but using the DataFrame as a frequency counter turned out to be astonishingly slow. A DataFrame is simply not optimized for repeatedly accessing individual cells as I do above.

    How Slow is it?

    To isolate the effect pandas was having on performance, I used Python’s timeit module to benchmark some simpler variations on this code. In the version of Python I’m using (3.6), the default number of iterations for each timeit test is 1 million.

    First, I timed how long it takes to increment a simple variable, just to get a baseline.

    Second, I timed how long it takes to increment a variable stored inside a collections.Counter inside a dict. This mimics the first version of my code (above) for a frequency counter. It’s more complex than the simple variable version because Python has to resolve two hash table references (one inside the dict, and one inside the Counter). I expected this to be slower, and it was.

    Third, I timed how long it takes to increment one cell inside a 2×2 NumPy array. Since Pandas is built atop NumPy, this gives an idea of how the DataFrame’s backing store performs without Pandas involved.

    Fourth, I timed how long it takes to increment one cell inside a 2×2 Pandas DataStore. This is what I had used in my real code.

    Raw Benchmark Results

    Here’s what timeit showed me. Sorry for the cramped formatting.

    $ python
     Python 3.6.0 (v3.6.0:41df79263a11, Dec 22 2016, 17:23:13)
     [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
     Type "help", "copyright", "credits" or "license" for more information.
     >>> import timeit
     >>> timeit.timeit('data += 1', setup='data=0')
     >>> timeit.timeit('data[0][0]+=1',setup='from collections import Counter;data={0:Counter()}')
     >>> timeit.timeit('data[0][0]+=1',setup='import numpy as np;data=np.zeros((2,2))')
     >>> timeit.timeit('data[0][0]+=1',setup='import pandas as pd;data=pd.DataFrame(data=[[0,0],[0,0]],dtype="int")')

    Benchmark Results Summary

    Here’s a summary of the results from above (decimals truncated at 3 digits). The rightmost column shows the results normalized so the fastest method (incrementing a simple variable) equals 1.

    Actual (seconds) Normalized (seconds)
    Simple variable 0.092 1
    Dict + Counter 0.683 7.398
    Numpy 2D array 0.890 9.639
    Pandas DataFrame 157.564 1704.784

    As you can see, resolving the index references in the middle two cases (Dict + Counter in one case, NumPy array indices in the other) slows things down, which should come as no surprise. The NumPy array is a little slower than the Dict + Counter.

    The DataFrame, however, is about 150 – 200 times slower than either of those two methods. Ouch!

    I can’t really even give you a graph of all four of these methods together because the time consumed by the DataFrame throws the chart scale out of whack.

    Here’s a bar chart of the first three methods –

    A bar chart of the first three methods in the preceding table

    Here’s a bar chart of all four –

    A bar chart of all four methods in the preceding table

    Why Is My DataFrame Access So Slow?

    One of the nice features of DataFrames is that they support dictionary-like labels for rows and columns. For instance, if I define my frequencies to look like this –

    >>> SHAPES = ('square', 'circle', 'star', )
    >>> COLORS = ('red', 'blue', 'orange')
    >>> pd.DataFrame(columns=SHAPES, index=COLORS, data=0, dtype='int')
            square  circle  star
    red          0       0     0
    blue         0       0     0
    orange       0       0     0

    Then frequencies['square']['orange'] is a valid reference.

    Not only that, DataFrames support a variety of indexing and slicing options including –

    • A single label, e.g. 5 or 'a'
    • A list or array of labels ['a', 'b', 'c']
    • A slice object with labels 'a':'f'
    • A boolean array
    • A callable function with one argument

    Here are those techniques applied in order to the frequencies DataFrame so you can see how they work –

    >>> frequencies['star']
    red       0
    blue      0
    orange    0
    Name: star, dtype: int64
    >>> frequencies[['square', 'star']]
            square  star
    red          0     0
    blue         0     0
    orange       0     0
    >>> frequencies['red':'blue']
          square  circle  star
    red        0       0     0
    blue       0       0     0
    >>> frequencies[[True, False, True]]
            square  circle  star
    red          0       0     0
    orange       0       0     0
    >>> frequencies[lambda x: 'star']
    red       0
    blue      0
    orange    0
    Name: star, dtype: int64

    This flexibility has a price. Slicing (which is what is invoked by the square brackets) calls an object’s __getitem__() method. The parameter to __getitem__()  is the whatever was inside the square brackets. A DataFrame’s __getitem__() has to figure out what the passed parameter represents. Determining whether the parameter is a label reference, a callable, a boolean array, or something else takes time.

    If you look at the DataFrame’s __getitem__() implementation, you can see all the code that has to execute to resolve a reference. (I linked to the version of the code that was current when I wrote this in February of 2017. By the time you read this, the actual implementation may differ.) Not only does __getitem__() have a lot to do, but because I’m accessing a cell (rather than a whole row or column), there’s two slice operations, so __getitem__() gets invoked twice each time I increment my counter.

    This explains why the DataFrame is so much slower than the other methods. The dictionary and Counter both only support key lookup in a hash table, and a NumPy array has far fewer slicing options than a DataFrame, so its __getitem__() implementation can be much simpler.

    Better DataFrame Indexing?

    DataFrames support a few methods that exist explicitly to support “fast” getting and setting of scalars. Those methods are .at() (for label lookups) and .iat() (for integer-based index lookups). It also provides get_value() and set_value(), but those methods are deprecated in the version I have (0.19.2).

    “Fast” is how the Panda’s documentation describes these methods. Let’s use timeit to get some hard data. I’ll try at() and iat(); I’ll also try get_value()/set_value() even though they’re deprecated.

    >>> timeit.timeit("data.at['red','square']+=1",setup="import pandas as pd;data=pd.DataFrame(columns=('square','circle','star'),index=('red','blue','orange'),data=0,dtype='int')")
    >>> timeit.timeit('data.iat[0,0]+=1',setup='import pandas as pd;data=pd.DataFrame(data=[[0,0],[0,0]],dtype="int")')
    >>> timeit.timeit('data.set_value(0,0,data.get_value(0,0)+1)',setup='import pandas as pd;data=pd.DataFrame(data=[[0,0],[0,0]],dtype="int")')

    These methods are better, but they’re still pretty bad. Let’s put those numbers in context by comparing them to other techniques. This time, for normalized results, I’m going to use my Dict + Counter method as the baseline of 1 and compare all other methods to that. The row “DataFrame (naïve)” refers to naïve slicing, like frequencies[0][0].

    Actual (seconds) Normalized (seconds)
    Dict + Counter 0.683 1
    Numpy 2D array 0.890 1.302
    DataFrame (get/set) 15.050 22.009
    DataFrame (at) 36.331 53.130
    DataFrame (iat) 42.015 61.441
    DataFrame (naïve) 157.564 230.417

    The best I can do with a DataFrame uses deprecated methods, and is still over 20 times slower than the Dict + Counter. If I use non-deprecated methods, it’s over 50 times slower.


    I like label-based access to my frequency counters, I like the way I can manipulate data in a DataFrame (not shown here, but it’s useful in my real-world code), and I like speed. I don’t necessarily need blazing fast speed, I just don’t want slow.

    I can have my cake and eat it too by combining methods. I do my counting with the Dict + Counter method, and use the result as initialization data to a DataFrame constructor.

    SHAPES = ('square', 'circle', 'star', )
    frequencies = {shape: collections.Counter() for shape in SHAPES}
    for shape, color in all_my_objects:
        frequencies[shape][color] += 1
    frequencies = pd.DataFrame(data=frequencies)

    The frequencies DataFrame now looks something like this –

             circle square star
     blue         8     41   18
     orange       5     33   25
     red         53     64   58

    The rows and columns appear in essentially random order; they’re ordered by whatever order Python returns the dict keys during DataFrame initialization. Getting them in a specific order is left as an exercise for the reader.

    There’s one more detail to be aware of. If a particular (shape, color) combination doesn’t appear in my data, it will be represented by NaN in the DataFrame. They’re easy to set to 0 with frequencies.fillna(0).


    What I was trying to do with Pandas – unfortunately, the very first thing I ever tried to do with it – didn’t play to its strengths. It didn’t break my code, but it slowed it down by a factor of ~1700. Since I had thousands of items to process, the difference was hard to overlook!

    Pandas looks great for some things, and I expect I’ll continue using it. This was just a bump in the road, albeit an interesting one.

    Caktus GroupCaktus Attends Wagtail CMS Sprint in Reykjavik

    Caktus CEO Tobias McNulty and Sales Engineer David Ray recently had the opportunity to attend a development sprint for the Wagtail Content Management System (CMS) in Reykjavik, Iceland. The two-day software development sprint attracted 15 attendees hailing from a total of 5 countries across North America and Europe.

    Wagtail sprinters in Reykjavik

    Wagtail was originally built for the Royal College of Art by UK firm Torchbox and is now one of the fastest-growing open source CMSs available. Being longtime champions of the Django framework, we’re also thrilled that Wagtail is Django-based. This makes Wagtail a natural fit for content-heavy sites that might still benefit from the customization made possible through the CMS’ Django roots.

    Tobias & Tom in Reykjavik

    The team worked on a wide variety of projects, including caching optimizations, an improved content model, a new React-based page explorer, the integration of a new rich-text editor (Draft.js), performance enhancements, other new features, and bug fixes.

    David & Scot in Reykjavik

    Team Wagtail Bakery stole the show with a brand-new demo site that’s visually appealing and better demonstrates the level of customization afforded by the Wagtail CMS. The new demo site, which is still in development as of the time of this post, can be found at wagtail/bakerydemo on GitHub.

    Wagtail Bakery on laptop screen

    After the sprint was over, our hosts at Overcast Software were kind enough to take us on a personalized tour of the countryside around Reykjavik. We left Iceland with significant progress on a number of Wagtail pull requests, new friends, and a new appreciation for the country's magical landscapes.

    Wagtail sprinters on road trip, in front of waterfall

    We were thrilled to attend and are delighted to be a part of the growing Wagtail community. If you're interested in participating in the next Wagtail sprint, it is not far away. Wagtail Space is taking place in Arnhem, The Netherlands March 21st-25th and is being organized to accommodate both local and remote sprinters. We hope to connect with you then!

    Caktus GroupHow to write a bug report

    Here are some brief thoughts on writing good bug reports in general.

    Main elements

    There are four crucial elements when writing a bug report:

    • What did you do
    • What did you see
    • What did you expect to see
    • Why did you expect to see that

    What did you do

    This is sometimes called "Steps to reproduce".

    The purpose of this part is so the person trying to fix the bug can reproduce it. If they can't reproduce it, they probably can't fix it.

    The most common problem here is not enough detail.

    To help avoid that, it's a good idea to write this as though the person reading it knows nothing about the application or site that you ran into the problem on.

    Use words like "typed" and "clicked", not "I did such-and-such task".

    Say what you did, not what you meant. Use words like "typed" and "clicked", not "chose" or "selected" or "tried to".

    Good starting points: operating system name and version, browser name and version, what URLs you visited (exactly), what you typed and clicked. Pretend you're walking someone through what you did.


    I'm running Ubuntu 16.04.1 with Gnome desktop, using Chrome 54.0.2840.90 (64-bit).
    I recreated this in a incognito window (no extensions).
    I typed "https://www.example.com" into the address bar,
    Then I clicked the "Help" link in the top right.

    What did you see

    This is the obvious bit to include.

    Again, more detail is better. In particular, the exact wording of any messages - copy and paste if you can. A message that sounds generic to you might mean something very important to a developer trying to figure out the bug, if only they know the exact message you saw.

    Focus here on exactly what you saw, and not on interpreting it. If it's relevant, provide screenshots, labeling them to match the steps for reproducing the problem. If the problem is an observed behavior, still try to describe it in terms of what you saw in each step, and not how you interpret what you saw. Or provide a video of the bug happening.

    And if it doesn’t happen every time, be sure to say so, with a rough idea of how frequently you see it when you do the same things.


    After clicking "Help", a page loaded with URL
    "https://www.example.com/about" and the title "About WidgetCo".
    When I click the “Close” button, about one time in ten, the window doesn’t close.

    What did you expect to see

    This is often overlooked. It's surprising how often I get a report "the site did X" and my reaction is "well, the site is supposed to do X, what's the problem?" Then we have to go back and forth trying to figure out why the person submitted the bug report.

    It's much better to include this explicitly, even if it seems obvious to you.


    I expected to see a page with a title like "Help" or
    "Using this web site".

    Why did you expect to see that?

    This is the other often overlooked part.

    This can help save a lot of wasted time when it turns out that there's a typo in the documentation, or the user is missing some other part of the requirements, etc.

    A documentation error or unclear requirements are just as much bugs as broken code, but it's nice to zero in on what the problem is sooner than later.

    Did you expect to see "Y" because requirement 1.2.3 said it should happen? Was it on page N of version 2.1 of the user manual? Did the site help at URL xxxxx say something that led you to expect "Y"? Or maybe your officemate told you it worked that way.

    Another benefit is that in trying to find an authority on what was supposed to happen, you might discover that you misunderstood something and what you're seeing isn't a bug at all. Or you realize that what you thought was an authority really isn't.


    In my experience with other sites, links named "Help" go to
    pages with help information for using the site, not pages
    with information about the company.

    Other comments (optional)

    You can offer other information that you think would be helpful, but please do it separately from the previous elements - keep the facts separate from the opinions - and keep it concise.


    This looks to me like it's probably just the wrong link.
    Let me know if I can help test a fix or anything.
    My wife’s cousin’s girlfriend says it might be the frangistan coupling.

    Exceptions to the rules

    None of this is carved in stone. For example, if starting the application caused my laptop to hang so hard that I had to power it down, I can probably omit describing what I expected to see and why.

    More detail is generally better, but keep in mind that developers are human too, and probably won’t read the whole bug report carefully if it looks overly long.

    General tips

    Be very clear whether you are describing unexpected behavior, or asking for a change in behavior. Surprisingly, in some "bug reports", you can't really tell which the user means.

    Some phrases that should probably never appear in a bug report:

    • XXX didn't work
    • XXX doesn't work
    • XXX needs fixing
    • XXX should do YYY
    • XXX looks wrong

    Many applications and tools, and less often websites, have specific instructions on how they'd like bug reports to be submitted, what information is most helpful to include, etc. Look for and follow those instructions.

    Don't be emotional. If you're really annoyed about some behavior that's blocking your work, that's perfectly understandable. But it'll be more productive to take some time to cool down, then stick to the facts in your bug report.

    If you know that this behavior has changed - maybe this exact function worked for you in version 1.23 - then mention it in your comments. That kind of information is extremely helpful.

    If you haven't tried this on the most recent version of things, try it there. It might already be fixed.


    The most unique part of what I've described here is making sure to say why you expected what you expected. I know I saw that in a "how to write a good bug report" somewhere before, probably on the web, but it's been a long, long time. If anyone recognizes where that came from, please let me know in the comments.

    Meanwhile, here are some other pages that seems particularly good, and go into more detail about various of these points.

    Tim HopperHow I Quit My Ph.D. and Learned to Love Data Science

    I recently gave to the Duke Big Data Initiative entitled Dr. Hopper, or How I Quit My Ph.D. and Learned to Love Data Science. The talk was well received, and my slides seemed to resonate in the Twitter data science community.

    I've started a long-form blog post with the same message, but it's not done yet. In the mean time, I wanted to share the slides that want along with the talk.

    Philip SemanchukCoercing Objects to Integer, Revisited


    I recently wrote a blog post that involved exception handling, and gave short shrift to the part of exception handling I didn’t want to talk about in order to focus on the part I did want to talk about. For some readers, that clearly backfired.


    My recent blog post about coercing Python objects to integers caught people’s attention in a way I hadn’t intended. The point I was trying to make was that an innocent-looking call like int(an_object) calls the method an_object.__int__(), and since that can be arbitrary code, it can raise arbitrary exceptions. Therefore, it’s insufficient to catch only the usual exceptions of ValueError and TypeError if you don’t know the type of an_object in advance.

    Here’s the code I suggested –

    def int_or_else(value, else_value=None):
        """Given a value, returns the value as an int if possible.
        If not, returns else_value which defaults to None.
            return int(value)
        # I don't like catch-all excepts, but since objects can raise arbitrary
        # exceptions when executing __int__(), then any exception is
        # possible here, even if only TypeError and ValueError are
        # really likely.
        except Exception:
            return else_value

    Several commenters objected to the fact that this code discards (and therefore silences/masks/hides) all exceptions. Here’s why I made that choice.

    The Two Parts of Exception Handling

    In Python, there’s two parts to consider about exception handling — what to catch, and what to do with the exception once you’ve caught it. My intention was to write only about the former.

    The latter is an interesting topic, too. Once you’ve caught an exception, you might want to log it and then discard it, log it and then re-raise it, re-raise it as a different exception, silence it, let it pass up to the caller, modify its attributes and re-raise it, etc. There’s enough material for an entire blog post about different ways to react to an exception, and the pros and cons of each.

    Someday I might write that post about different ways to react to trapped exceptions, and if I do, I’ll dedicate the entire post to the subject to give it the attention it deserves. That other blog post – that was not it. In fact, it was the opposite. I gave the topic of processing the trapped exception as little attention as possible so as not to detract attention from what I wanted to be the main topic (what exceptions need to be trapped).

    That backfired.


    My post was not advocacy of discarding exceptions, nor was it advocacy of not discarding exceptions. What’s the right choice? It depends. One situation where you might want to discard exceptions is in a blog post where you’re trying to keep the code as brief as possible for readability. Then again, you might regret that. :-)

    In the future, I’ll be clearer about what shortcuts I’m taking for brevity of presentation.

    Agree? Disagree? I’d like to hear from you. I like it when people agree with me. Those who disagree can expand my horizons, and I like that too. In short, all civil comments are welcome. I feel I’ve spent enough time thinking about this topic for now, but that doesn’t make me right! Let me know what you think.

    Caktus GroupHow to make a jQuery

    Learn to live without jQuery by learning how to clone it

    jQuery is one of the earliest libraries every web developer learns, and often is the first experience with programming of any sort someone has. It provides a very safe cushion between a developer and the rough edges of web development. But, it can also obscure learning Javascript itself and learning what web APIs are capable of without the abstraction over them that jQuery adds.

    jQuery came about at a time when it was very much needed. Limitations in browsers and differences between them created enormous hardships for developers. These days, the landscape is very different, and everyone should consider withholding on adding jQuery to their projects until absolutely necessary. Forgoing it encourages you to learn the Javascript language on its own, not just as a tool to employ one massive library. You will be exposed to the native APIs of the web, to better understand the things you’re doing. This improved understanding gives you a chance to be more directly exposed to new and changing web standards.

    Let’s learn to recreate the most helpful parts of jQuery piece by piece. This exercise will help you learn what it actually does under the hood. When you know how those features work, you can find ways to avoid using jQuery for the most common cases.

    Selecting Elements on the Page

    The first and most prominent jQuery feature is selecting one or several elements from the page based on CSS selectors. When jQuery was first dropped on our laps, this was a mildly revolutionary ability to easily locate a piece of your page by a reliable, understandable address. In jQuery, selection looks like this:


    The power of the simple jQuery selection has since been adapted into a standard new pair of document methods: querySelector() and querySelectorAll(). These take CSS selectors like jQuery and give you the first or all matching elements in an array, but that array isn’t as powerful as a jQuery set, so let’s replicate what jQuery does by smartening up the results a bit.

    Simply wrapping querySelectorAll() is more than trivial. We'll call our little jQuery clone njq(), short for "Not jQuery" and use it the way you would use $().

    function njq(selector) {
        return document.querySelectorAll(selector)

    And now we can use njq() just like jQuery for selections.


    But, of course, jQuery gives us a lot more than this, so a simple wrapper won't do. To really match its power we need to add a few things:

    • Default to an empty set of elements
    • Wrap the original HTML element objects if we're given one
    • Wrap the results such that we can attach behaviors to them

    These simple additions give us a more robust example of what jQuery can do.

    var empty = $() // getting an empty set
    var html = $('<h2>test</h2>') // from an HTML snippet
    var wrapped = $(an_html_element) // wrapping an HTML Element object
    wrapped.hide() // using attached behaviors, in this case calling hide()

    So let's add these abilities. We'll implement the empty set version, wrapping Element objects, accepting arrays of elements, and attaching extra methods. We'll start by adding one of the most useful jQuery methods: the each() method used to loop over all the elements it holds.

    function njq(arg) {
        let results
        if (typeof arg === 'undefined') {
            results = []
        } else if (arg instanceof Element) {
            results = [arg]
        } else if (typeof arg === 'string') {
            // If the argument looks like HTML, parse it into a DOM fragment
            if (arg.startsWith('<')) {
                let fragment = document.createRange().createContextualFragment(arg)
                results = [fragment]
            } else {
                // Convert the NodeList from querySelectorAll into a proper Array
                results = Array.prototype.slice.call(document.querySelectorAll(arg))
        } else {
            // Assume an array-like argument and convert to an actual array
            results = Array.prototype.slice.call(arg)
        results.__proto__ = njq.methods
        return results
    njq.methods = {
        each: function(func) {
            Array.prototype.forEach.call(this, func)

    This is a good foundation, but jQuery selection has a few other required helpers we need to consider our version even close to complete. To be more complete, we have to add helpers for search both up and down the HTML tree from the elements in a result set.

    Walking down the tree is done with the find() method that selects within the children of the results. Here we learn a second form of querySelectorAll(), which is called on an individual element, not an entire document, and only selects within its children. Like so:

    var list = $('ul')
    var items = list.find('li')

    The only extra work we have left to do is to ensure we don't add any duplicates to the result set, by tracking which elements we've already added as we call querySelectorAll() on each element in the original elements and combine all their results together.

    njp.methods.find = function(selector) {
        var seen = new Set()
        var results = njq()
        this.each((el) => {
            Array.prototype.forEach.call(el.querySelectorAll(selector), (child) => {
                if (!seen.has(child)) {
        return results

    Now we can use find() in our own version:

    var list = njq('ul')
    var items = list.find('li')

    Searching down the HTML tree was useful and straight forward, but we aren't complete if we can't do it in the reverse: searching up the tree from the original selection. This is where we'll clone jQuery's closest() method.

    In jQuery, closest() helps when you already have an element, and want to find something up the tree in it. In this example, we find all the bold text in a page and then find what paragraph they're from:

    var paragraphs_with_bold = $('b').closest('p')

    Of course, multiple elements we have may have the same ancestors, so we need to handle duplicate results in this method, as we did before. We won't get much help from the DOM directly, so we walk up the chain of parent elements one at a time, looking for matches. The only help the DOM gives us here is Element.matches(selector), which tells us if a given element matches a CSS selector we're looking for. When we find matches we add them to our results. We stop searching immediately for each element's first match, because we're only looking for the "closest", after all.

    njq.methods.closest = function(selector) {
        var closest = new Set()
        this.each((el) => {
            let curEl = el
            while (curEl.parentElement && !curEl.parentElement.matches(selector)) {
                curEl = curEl.parentElement
            if (curEl.parentElement) {
        return njq(closest)

    We've put the basic pieces of selection in place now. We can query the page for elements, and we can query those results to drill down or walk up the HTML tree for related elements. All of this is useful, and we can walk over our results with the each() method we started with.

    var paragraphs_with_bold = njq('b').closest('p')

    Basic Manipulations

    We can't do very much with the results, yet, so let's add some of the first manipulation helpers everyone learned with jQuery: manipulating classes.

    Manipulating classes means you can turn a class on or off for a whole set of elements, changing its styles, and often hiding or showing entire bits of the page. Here are our simple class helpers: addClass() and removeClass() will add or remove a single class from all the elements in the result set, toggleClass() will add the class to all the elements that don't already have it, while removing it from all the elements which presently do have the class.

    The jQuery methods we're reimplementing work like this:


    Thankfully, the DOM's native APIs make all of these very simple. We'll use our existing each() method to walk over all the results, but manipulating the class in each of them is a simple call to methods on the elements' classList interface, a specialized array just for managing element classes.

    njq.methods.toggleClass = function(className) {
        this.each((el) => {
    njq.methods.addClass = function(className) {
        this.each((el) => {
    njq.methods.removeClass = function(className) {
        this.each((el) => {

    Now we have a very simple jQuery clone that can walk around the DOM tree and do basic manipulations of classes to change the styling. This, by itself, has enough parts to be useful, but some times just adding or removing classes isn't enough. Some times you need to manipulate styles and other properties directly, so we're going to add a few more small manipulation utilities:

    • We want to change the text in elements
    • We want to swap out entire HTML bodies of elements
    • We want to inspect and change attributes on elements
    • We want to inspect and change CSS styles on elements

    These are all simple operations with jQuery.


    Changing the contents of an element directly, whether text or HTML, is as simple as a single attribute we'll wrap with our helpers: text() and html(), wrapping the innerText and innerHTML properties, specifically. Like nearly all of our methods we're building on top of each() to apply these operations to the whole set.

    njq.methods.text = function(t) {
        this.each((el) => el.innerText = t)
    njq.methods.html = function(t) {
        this.each((el) => el.innerHTML = t)

    Now we'll start to get into methods that need to do multiple things. Setting the text or HTML is useful, but often reading it is useful, too. Many of our methods will follow this same pattern, so if a new value isn't provided, then instead we want to return the current value. Copying jQuery, when we read things we'll only read them from the first element in a set. If you need to read them from multiple elements, you can walk over them with each() to do that on your own.

    var msg_text = $('#message').text()

    These two methods are easily enhanced to add read versions:

    njq.methods.text = function(t) {
        if (arguments.length === 0) {
            return this[0].innerText
        } else {
            this.each((el) => el.innerText = t)
    njq.methods.html = function(t) {
        if (arguments.length === 0) {
            return this[0].innerHTML
        } else {
            this.each((el) => el.innerHTML = t)

    Next, all elements have attributes and styles and we want helpers to read and manipulate those in our result sets. In jQuery, these are the attr() and css() helpers, and that's what we'll replicate in our version. First, the attribute helper.

    $("img#greatphoto").attr("title", "Photo by Kelly Clark");

    Just like our text() and html() helpers, we read the value from the first element in our set, but set the new value for all of them.

    njq.methods.attr = function(name, value) {
        if (typeof value === 'undefined') {
            return this[0].getAttribute(name)
        } else {
            this.each((el) => el.setAttribute(name, value))

    Working with styles, we allow three different versions of the css() helper.

    First, we allow reading the CSS property from the first element. Easy.

    var fontSize = parseInt(njq('#message').css('font-size'))
    njq.methods.css = function(style) {
        if (typeof style === 'string') {
            return getComputedStyle(this[0])[style]

    Second, we change the value if we get a new value passed as a second argument.

    var fontSize = parseInt(njq('#message').css('font-size'))
    if (fontSize > 20) {
        njq('#message').css('font-size', '20px')
    njq.methods.css = function(style, value) {
        if (typeof style === 'string') {
            if (typeof value === 'undefined') {
                return getComputedStyle(this[0])[style]
            } else {
                this.each((el) => el.style[style] = value)

    Finally, because it's very common you want to change multiple CSS properties, and probably at the same time, the css() helper will accept a hash-object mapping property names to new property values and set them all at once:

        'background-color': 'navyblue',
        'color': 'white',
        'font-size: 40px',
    njq.methods.css = function(style, value) {
        if (typeof style === 'string') {
            if (typeof value === 'undefined') {
                return getComputedStyle(this[0])[style]
            } else {
                this.each((el) => el.style[style] = value)
        } else {
            this.each((el) => Object.assign(el.style, style))

    Our jQuery clone is really shaping up. With it, we've replicated all these things jQuery does for us:

    • Selecting elements across a page
    • Selecting either descendents or ancestors of elements
    • Toggling, adding, or removing classes across a set of elements
    • Reading and modifying the attributes an element has
    • Reading and modifying the CSS properties an element has
    • Reading and changing the text contents of an element
    • Reading and changing the HTML contents of an element

    That's a lot of helpful DOM manipulation! If we stopped here, this would already be useful.

    Of course, we're going to continue adding more features to our little jQuery clone. Eventually we'll add more ways to manipulate the HTML in the page, before we come back to manipulation let's start adding support for events to let a user interact with the page.

    Event Handling

    Events in Javascript can come from a lot of sources. The kinds of events we're interested in are user interface events. The first event you probably care about is the click event, but we'll handle it just like any other.

    $("#dataTable tbody tr").on("click", function() {
        console.log( $( this ).text() )

    Like some of our other helpers, we're wrapping what is now a standard facility in the APIs the web defines to interact with a page. We're wrapping addEventListener(), the standard DOM API available on all elements to bind a function to be called when an event happens on that element. For example, if you bind a function to the click event of an image, the function will be called.

    We might need some information about the event, so we're going to trigger our callback with this bound to the element you were listening to and we'll pass the Event object, which describes all about the event in question, as a parameter.

    njq.methods.on = function(event, cb) {
        this.each((el) => {
            // addEventListener will invoke our callback
            // with two parameters: the element the event
            // comes from and the event object itself.
            el.addEventListener(event, cb)

    This is a useful start, but events can do so much more. First, before we make our event listening more powerful, let's make sure we can hit the undo button by adding a way to remove them.

    var $test = njq("#test");
    function handler1() {
        $test.off("click", handler2)
    function handler2() {
    $test.on("click", handler1);
    $test.on("click", handler2);

    The standard addEventListener() comes paired with removeEventListener(), which we can use since our event binding was simple:

    njq.methods.off = function(event, cb) {
        this.each((el) => {
            el.removeEventListener(event, cb)

    Event Delegation

    When your page is changing through interactions it can be difficult to maintain event bindings on the right elements, especially when those elements could move around, be removed, or even replaced. Delegation is a very useful way to bind event handlers not to a specific element, but to to a query of elements that changes with the contents of your page.

    For example, you might want to let any <li> elements that get added to a list be removed when you click on them, but you want this to happen even when new items are added to the list after your event binding code ran.

    <h3>Grocery List</h3>
        <li>Peanut Butter</li>
    njq('ul').on('click', 'li', function(ev) {

    This very useful, but complicates our event binding a bit. Let's dive in to adding this feature.

    First, we have to accept on() being called with either 2 or 3 arguments, with the 3 argument version accepting a delegation selector as the second argument. We can use Javascript's special arguments variable to make this straight forward.

    njq.methods.on = function(event, cb) {
        let delegate, cb
        // When called with 2 args, accept 2nd arg as callback
        if (arguments.length === 2) {
            cb = arguments[1]
        // When called with 3 args, accept 2nd arg as delegate selector,
        // 3rd arg as callback
        } else {
            delegate = arguments[1]
            cb = arguments[2]
        this.each((el) => {
            el.addEventListener(event, cb)

    Our event handler is still being invoked for every instance of the event. In order to implement delegation properly, we want to block the handler when the event didn't come from the right child element matching the delegation selector.

    njq.methods.on = function(event, cb) {
        let delegate, cb
        // When called with 2 args, accept 2nd arg as callback
        if (arguments.length === 2) {
            cb = arguments[1]
        // When called with 3 args, accept 2nd arg as delegate selector,
        // 3rd arg as callback
        } else {
            delegate = arguments[1]
            cb = arguments[2]
        this.each((el) => {
            el.addEventListener(event, function(ev) {
                // If this was a delegate event binding,
                // skip the event if the event target is not inside
                // the delegate selection.
                if (typeof delegate !== 'undefined') {
                    if (!root.find(delegate).includes(ev.target)) {
                // Invoke the event handler with the event arguments
                cb.apply(this, arguments)
            }, cb, false)

    We've wrapped our event listener in a helper function, where we check the event target each time the event is triggered and only invoke our callback when it matches.

    Advanced Manipulations

    We have a good foundation now. We can find the elements we need in the structure of our page, modify properties of those elements like attributes and CSS styles, and respond to events from the user on the page.

    Now that we've got that in place, we could start making larger manipulations of the page. We could start adding new elements, moving them around, or cloning them. These advanced manipulations will the final set of helpers we add to our library.


    One of the most useful operations is adding a new element to the end another. You might use this to add a new <li> to the end of a list, or add a new paragraph of text to an existing page.

    There are a few ways we want to allow appending, and we'll add each one at a time.

    First, we'll allow simply appending some text.

    njq.methods.append = function(content) {
        if (typeof content === 'string') {
            this.each((el) => el.innerHTML += content)

    Then, we'll allow adding elements. These might come from queries our library has done on the page.

    njq.methods.append = function(content) {
        if (typeof content === 'string') {
            this.each((el) => el.innerHTML += content)
        } else if (content instanceof Element) {
            this.each((el) => el.appendChild(content.cloneNode(true)))

    Finally, to make it easier to select elements and append them somewhere else, we'll accept an array of elements in addition to just one. Remember, our njq query objects are themselves arrays of elements.

    njq.methods.append = function(content) {
        if (typeof content === 'string') {
            this.each((el) => el.innerHTML += content)
        } else if (content instanceof Element) {
            this.each((el) => el.appendChild(content.cloneNode(true)))
        } else if (content instanceof Array) {
            content.forEach((each) => this.append(each))


    As long as we are adding to the end of elements, we'll want to add to the beginning as well. This is a nearly identical to the append() version.

    njq.methods.prepend = function(content) {
        if (typeof content === 'string') {
            // We add the next text to the start of the element's inner HTML
            this.each((el) => el.innerHTML = content + el.innerHTML)
        } else if (content instanceof Element) {
            // We use insertBefore here instead of appendChild
            this.each((el) => el.parentNode.insertBefore(content.cloneNode(true), el))
        } else if (content instanceof Array) {
            content.forEach((each) => this.prepend(each))


    jQuery offers two replacement methods, which work in opposite directions.


    The first, replaceAll(), will use the elements from $('.left') and use those to replace everything from $('.right'). If you had this HTML:

    <h2>Some Uninteresting Header Text</h2>
    <p>A very important story to tell.</p>

    you could run this to replace the tag entirely, not just its contents:

    $('<h1>Some Exciting Header Text</h1>').replaceAll('h2')

    and your HTML would now look like this:

    <h1>Some Exciting Header Text</h1>
    <p>A very important story to tell.</p>

    The second, replaceWith(), does the opposite by using elements from $('.right') to replace everything in $('.left')

    $('h2').replaceWith('<h1>Some Exciting Header Text</h1>')

    So let's add these to "Not jQuery".

    njq.methods.replaceWith = function(replacement) {
        let $replacement = njq(replacement)
        let combinedHTML = []
        $replacement.each((el) => {
        let fragment = document.createRange().createContextualFragment(combinedHTML)
        this.each((el) => {
            el.parentNode.replaceChild(fragment, el)
    njq.methods.replaceAll = function(target) {

    Since this is a little more complex than some of our simpler methods, let's step through how it works. First, notice that we only really implemented one of them, and the second simply re-uses that with the parameters reversed. Now, both of these replacement methods will replace the target with all of the elements from the source, so the first thing we do is extract a combined HTML of all those.

    let $replacement = njq(replacement)
    let combinedHTML = []
    $replacement.each((el) => {
    let fragment = document.createRange().createContextualFragment(combinedHTML)

    Now we can replace all the target elements with this new fragment that contains our new content. To replace an element, we have to ask its parent to replace the correct child, using replaceChild():

    this.each((el) => {
        el.parentNode.replaceChild(fragment, el)


    The last yet easiest to implement helper is a clone() method. Allowing us to copy elements, and all their children, will make other helpers more powerful by allowing us to either move or copy them. This can be combined with other helpers we've already added, so that you have control over prepend and append operations moving or copying the elements they move around.

    njq.methods.clone = function() {
        return njq(Array.prototype.map.call(this, (el) => el.cloneNode(true)))

    Now You Made a jQuery

    We've replicated a lot of the things jQuery gives us out of the box. Our little jQuery clone is able to select elements on a page and relative to other elements via find() and closest(). Events can be handled with simple on(event, callback) bindings and more complex on(event, selector, callback) delegations. The contents, attributes, and styles of elements can be read and manipulated with text(), html(), attr(), and css(). We can even manipulate whole segments of the DOM tree with append(), prepend(), replaceAll() and replaceWith().

    jQuery certainly offers a much deeper and wider toolbox of goodies. We weren't aiming to create a call-for-call 100% compatible replacement, just to understand what happens under the hood. If you learned anything from this exercise, learn that the tools you use are all transparent and can be learned from. They're all layers on an onion that you can peel back and learn from.

    Philip SemanchukA Postcard of Tunisia

    Earlier on this blog I briefly mentioned working with some Libyans in Tunis, the capital of Tunisia. We chose to meet at that location because it’s close to Libya but much safer than Tripoli. Now that I’ve been back for a while and had a chance to catch up, I wanted to write more about my experience.

    A photo of the translator translating liveEnglish-to-Arabic translation on the fly!


    I was there with Tobias McNulty of Caktus Group. We (Tobias and I) trained the Libyan employees of Libya’s High National Election Commission (HNEC) in the maintenance and use of the HNEC-commissioned SMS-based voter registration system that I had helped to develop while working with Caktus. The system has been open sourced as Smart Elect.

    If the big picture was promoting democracy, the medium picture was training system admins and developers. And the very small picture was working together on the nitty gritty of features and bug fixes, like figuring out that if a @property method raises an exception when invoked by hasattr(), the exception isn’t propagated under Python 2.7.

    The admin training consisted of a comprehensive review of the system, including the obscure corners and edge case handling. The developers were eager to get their hands dirty, so after some organizational review, we dove into fixing bugs and implementing some new features that HNEC wanted.

    A photo of a traineeAbdullah (Photo by Tobias McNulty)

    Tobias and I worked with the developers as both mentors and peers. Grinding through bugs from start to finish was really valuable. Our trainees have good development experience, but working in groups with us allowed them to participate in our approach to debugging, problem reporting, development, and test. It seemed a little different from what they were used to. We were very methodical about creating an issue in our tracker, creating a branch for that issue, reviewing one another’s code, documenting the fix, etc. “It’s a lot of process,” said one trainee after working through one particular bug with us. He’s right. I wish I had thought to ask if Libyan culture has a proverb similar to “For want of a nail…“. I could have said, “For want of filing an issue in the tracker, a voter was disenfranchised,” but it doesn’t have the same ring to it.

    A photo of Tobias and a traineeTobias and Ahmed

    This was my first trip to Africa, and, grand notions aside, what stood out to me was how mundane much of the experience was. The guys we worked with would have fit right in at any coding meetup I’ve been to. They had opinions about laptops. They were distracted by their phones. Everyone enjoyed a successful bug hunt. I remember one trainee being tired at 5PM, saying he had no more left in him, and seeing him there grinning 2 hours later when we finally solved the problem we’d been working on.

    Outside of the training, I especially enjoyed the dinners at Sakura/Pasta Cosy and Chez Zina (my favorites, in that order).

    We also ate at Le Bon Vieux Temps, where the handwritten chalkboard menu is carted from table to table on a charming-but-impractical frame. Tunisia is principally French speaking, with Arabic on an almost equal footing. At Le Bon Vieux Temps (“The Good Old Times”), the menu was all in French, and my vestigial French came in handy for translating the menu into English for the Libyans who in turn peppered the waiter with questions in Arabic. (That night in the restaurant began and ended my career as a French-to-English translator.)

    On the weekends we rested, walked in the city, and paid a visit to the Bardo National Museum. The Bardo was famously attacked in 2015, and has since sprouted a razor wire fence around the entire property. Bored soldiers sat on a truck by the gate and motioned us to enter. It’s a nice museum, and I’m glad I went.

    A photo of my entrance pass to the Bardo Museum

    Inside the classroom and out, I got to know and really like our Libyan colleagues. They were generous with their good humor and kindness. If they lacked anything, it was a willingness to complain.

    Libya is a difficult place to live at the moment. I think we all know that in an abstract sense, but talking to my Libyan friends made it more concrete for me. Banks don’t have enough cash. Electricity isn’t reliable. People they know have been kidnapped. My friends have a lot on their minds, and yet they found rooom to squeeze in opinions about good software development practices.

    A photo of a traineeMunir

    I’m glad I got the chance to go, and to get to know the people I did. In addition to working with Tobias and the Libyans, I had a lot of non-work experiences I’ll remember for a long time. I walked among ruins in Carthage that are over 2000 years old. I drove solo (and lost) through rush hour traffic in Tunis and survived. I saw a Tunisian wedding, and got to use the word “ululating” for the first time outside of Scrabble or Bananagrams. I swam in the Mediterranean. I saw flocks of flamingoes (many, many thanks to Hichem and Claudia of Les Amis des Oiseaux).

    HNEC is now better positioned than ever to use the Smart Elect system, and I hope they do so again soon. That’s partly for egotistical reasons — I like to see my work get used. Who doesn’t? But more importantly, if it gets used, that means Libyans are voting to determine their own future.

    Caktus GroupCaktus at PyCaribbean

    For the first time, Caktus will be gold sponsors at PyCaribbean February 18-19th in Bayamon, Puerto Rico. We’re pleased to announce two speakers from our team.

    Erin Mullaney, Django developer, will give a talk on RapidPro, the open source SMS system backed by UNICEF. Kia Lam, UI Developer, will talk about how women can navigate the seas of the tech industry with a few guiding principles and new perspectives. Erin and Kia join fantastic speakers from organizations like 18F, the Python Software Foundation, IBM, and Red Hat.

    We hope you can join us, but if you can’t, there’ll be videos!

    Caktus GroupPlan for mistakes as a developer

    I Am Not Perfect.

    I've been programming professionally for 25 years, and the most important thing I have learned is this:

    • I am fallible.
    • I am very fallible.
    • In fact, I make mistakes all the time.

    I'm not unique in this. All humans are fallible.

    So, how do we still get our jobs done, knowing that we're likely to make mistakes in anything we try to do? We look for ways to compensate.

    Pilots use checklists, and have for decades. No matter how many times they've done a pre-flight check on a plane, they review their checklist to make sure they haven't missed anything, because they know it's important, people make mistakes, and the consequences of a mistake can be horrendous.

    The practice of medical care is moving in the same direction. There's a great book, The Checklist Manifesto by Atul Gawande, that I highly recommend if you haven't come across it before. It talks about the kind of mistakes that happen in medicine, and how adding checklists for even basic procedures had amazing results.

    I'm a big fan of checklists. I'm always pushing to get deploy and release processes, for example, nailed down in project documentation to help us make sure not to miss an important step.

    But my point is not just to use checklists, it's the reason behind the use of checklists: acknowledging that people make mistakes, and looking for ways to do things right regardless of that.

    For me, I try to find ways to do things that I'm less likely to get wrong now, and that make it harder for future me to screw them up. I know that future me will have forgotten a lot about the project by the time he looks at it again, maybe under pressure to fix a production bug.

    One of my favorite quotations about programming is by Brian Kernighan:

    Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?

    ("The Elements of Programming Style", 2nd edition, chapter 2)

    So I work hard to avoid mistakes, both now and in the future.

    • I try to keep things straightforward
    • I use features and tools like strict typing, lint, flake8, eslint, etc.
    • I try to make sure knowledge is recorded somewhere more reliable than my memory

    I also try to detect mistakes before they can cause bad things to happen. I'm a huge fan of

    • unit tests
    • parameter checking
    • error handling
    • QA testing
    • code reviews

    To sum all this up:

    Expect to make mistakes. You will anyhow.

    Plan for them.

    And don't beat yourself up for it.

    Tim HopperYour Old Tweets from This Day

    A while ago, I published a Bash script that will open a Twitter search page to show your old tweets from this day of the year. I have enjoyed using it to see what I was thinking about in days gone by.

    So I turned this into a Twitter account.

    If you follow @your_old_tweets, it'll tweet a link at you each day that will show you your old tweets from the day. It attempts to send it in the morning (assuming you have your timezone set).

    This runs on Amazon Lambda. The code is here.

    Caktus GroupShip It Day Q1 2017

    Last Friday, Caktus set aside client projects for our regular quarterly ShipIt Day. From gerrymandered districts to RPython and meetup planning, the team started off 2017 with another great ShipIt.

    Books for the Caktus Library

    Liza uses Delicious Library to track books in the Caktus Library. However, the tracking of books isn't visible to the team, so Scott used the FTP export feature of Delicious Library to serve the content on our local network. Scott dockerized Caddy and deployed it to our local Dokku PaaS platform and serves it over HTTPS, allowing the team to see the status of the Caktus Library.

    Property-based testing with Hypothesis

    Vinod researched using property-based testing in Python. Traditionally it's more used with functional programming languages, but Hypothesis brings the concept to Python. He also learned about new Django features, including testing optimizations introduced with setupTestData.

    Caktus Wagtail Demo with Docker and AWS

    David looked into migrating a Heroku-based Wagtail deployment to a container-driven deployment using Amazon Web Services (AWS) and Docker. Utilizing Tobias' AWS Container Basics isolated Elastic Container Service stack, David created a Dockerfile for Wagtail and deployed it to AWS. Down the road, he'd like to more easily debug performance issues and integrate it with GitLab CI.

    Local Docker Development

    During Code for Durham Hack Nights, Victor noticed local development setup was a barrier of entry for new team members. To help mitigate this issue, he researched using Docker for local development with the Durham School Navigator project. In the end, he used Docker Compose to run a multi-container docker application with PostgreSQL, NGINX, and Django.

    Caktus Costa Rica

    Daryl, Nicole, and Sarah really like the idea of opening a branch Caktus office in Costa Rica and drafted a business plan to do so! Including everything from an executive summary, to operational and financial plans, the team researched what it would take to run a team from Playa Hermosa in Central America. Primary criteria included short distances to an airport, hospital, and of course, a beach. They even found an office with our name, the Cactus House. Relocation would be voluntary!

    Improving the GUI test runner: Cricket

    Charlotte M. likes to use Cricket to see test results in real time and have the ability to easily re-run specific tests, which is useful for quickly verifying fixes. However, she encountered a problem causing the application to crash sometimes when tests failed. So she investigated the problem and submitted a fix via a pull request back to the project. She also looked into adding coverage support.

    Color your own NC Congressional District

    Erin, Mark, Basia, Neil, and Dmitriy worked on an app that visualizes and teaches you about gerrymandered districts. The team ran a mini workshop to define goals and personas, and help the team prioritize the day's tasks by using agile user story mapping. The app provides background information on gerrymandering and uses data from NC State Board of Elections to illustrate how slight changes to districts can vastly impact the election of state representatives. The site uses D3 visualizations, which is an excellent utility for rendering GeoJSON geospatial data. In the future they hope to add features to compare districts and overlay demographic data.

    Releasing django_tinypng

    Dmitriy worked on testing and documenting django_tinypng, a simple Django library to allows optimization of images by using TinyPNG. He published the app to PyPI so it's easily installable via pip.

    Learning Django: The Django Girls Tutorial

    Gerald and Graham wanted to sharpen their Django skills by following the Django Girls Tutorial. Gerald learned a lot from the tutorial and enjoyed the format, including how it steps through blocks of code describing the syntax. He also learned about how the Django Admin is configured. Graham knew that following tutorials can sometimes be a rocky process, so he worked together with Graham so they could talk through problems together and Graham was able to learn by reviewing and helping.

    Planning a new meetup for Digital Project Management

    When Elizabeth first entered the Digital Project Management field several years ago, there were not a lot of resources available specifically for digital project managers. Most information was related to more traditional project management, or the PMP. She attended the 2nd Digital PM Summit with her friend Jillian, and loved the general tone of openness and knowledge sharing (they also met Daryl and Ben there!). The Summit was a wonderful resource. Elizabeth wanted to bring the spirit of the Summit back to the Triangle, so during Ship It Day, she started planning for a new meetup, including potential topics and meeting locations. One goal is to allow remote attendance through Google Hangouts, to encourage openness and sharing without having to commute across the Triangle. Elizabeth and Jillian hope to hold their first meetup in February.

    Kanban: Research + Talk

    Charlotte F. researched Kanban to prepare for a longer talk to illustrate how Kanban works in development and how it differs from Scrum. Originally designed by Toyota to improve manufacturing plants, Kanban focuses on visualizing workflows to help reveal and address bottlenecks. Picking the right tool for the job is important, and one is not necessarily better than the other, so Charlotte focused on outlining when to use one over the other.

    Identifying Code for Cleanup

    Calvin created redundant, a tool for identifying technical debt. Last ShipIt he was able to locate completely identical files, but he wanted to improve on that. Now the tool can identify functions that are almost the same and/or might be generalizable. It searches for patterns and generates a report of your codebase. He's looking for codebases to test it on!

    RPython Lisp Implementation, Revisited

    Jeff B. continued exploring how to create a Lisp implementation in RPython, the framework behind the PyPy project project. RPython is a restricted subset of the Python language. In addition to learning about RPython, he wanted to better understand how PyPy is capable of performance enhancements over CPython. Jeff also converted his parser to use Alex Gaynor's RPLY project.

    Streamlined Time Tracking

    At Caktus, time tracking is important, and we've used a variety of tools over the years. Currently we use Harvest, but it can be tedius to use when switching between projects a lot. Dan would like a tool to make this process more efficient. He looked into Project Hampster, but settled on building a new tool. His implementation makes it easy to switch between projects with a single click. It also allows users to sync daily entries to Harvest.

    Tim HopperTop Ten Favorite Photos of 2016

    I spent a lot of time with my camera in 2016. Here are some of the results.

    2016 Top Ten

    Caktus GroupNew year, new Python: Python 3.6

    Python 3.6 was released in the tail end of 2016. Read on for a few highlights from this release.

    New module: secrets

    Python 3.6 introduces a new module in the standard library called secrets. While the random module has long existed to provide us with pseudo-random numbers suitable for applications like modeling and simulation, these were not "cryptographically random" and not suitable for use in cryptography. secrets fills this gap, providing a cryptographically strong method to, for instance, create a new, random password or a secure token.

    New method for string interpolation

    Python previously had several methods for string interpolation, but the most commonly used was str.format(). Let’s look at how this used to be done. Assuming 2 existing variables, name and cookies_eaten, str.format() could look like this:

    "{0} ate {1} cookies".format(name, cookies_eaten)

    Or this:

    "{name} ate {cookies_eaten} cookies".format(name=name, cookies_eaten=cookies_eaten)

    Now, with the new f-strings, the variable names can be placed right into the string without the extra length of the format parameters:

    f"{name} ate {cookies_eaten} cookies"

    This provides a much more pythonic way of formatting strings, making the resulting code both simpler and more readable.

    Underscores in numerals

    While it doesn’t come up often, it has long been a pain point that long numbers could be difficult to read in the code, allowing bugs to creep in. For instance, suppose I need to multiply an input by 1 billion before I process the value. I might say:

    bill_val = input_val * 1000000000

    Can you tell at a glance if that number has the right number of zeroes? I can’t. Python 3.6 allows us to make this clearer:

    bill_val = input_val * 1_000_000_000

    It’s a small thing, but anything that reduces the chance I’ll introduce a new bug is great in my book!

    Variable type annotations

    One key characteristic of Python has always been its flexible variable typing, but that isn’t always a good thing. Sometimes, it can help you catch mistakes earlier if you know what type you are expecting to be passed as parameters, or returned as the results of a function. There have previously been ways to annotate types within comments, but the 3.6 release of Python is the first to bring these annotations into official Python syntax. This is a completely optional aspect of the language, since the annotations have no effect at runtime, but this feature makes it easier to inspect your code for variable type inconsistencies before finalizing it.

    And much more…

    In addition to the changes mentioned above, there have been improvements made to several modules in the standard library, as well as to the CPython implementation. To read about all of the updates this new release includes, take a look at the official notes.

    Tim HopperCompare RSA Key with Fingerprint in Github

    When you add an SSH key to your Github account, Github shows you the hexadecimal form of the MD5 hash of your public key.

    If you ever need to compare that against a key file on your computer, you can run:

    ssh-keygen -E md5 -lf ~/.ssh/id_rsa.pub

    I learned this from StackOverflow.

    Caktus GroupResponsive web design

    What is responsive web design?

    Responsive web design is an approach to web design and development whereby websites and web applications respond to a screen size of the device on which they’re being accessed. The response includes layout changes, rearrangement of content, and in some cases selective display or hiding of content elements. Using a responsive web design approach you can optimize web pages to achieve great user experience on a range of devices, from smartphones to desktop.

    Responsive web design is typically accomplished by writing a set of styling rules (CSS media queries) that define how page layout should be rendered between breakpoints. Breakpoints are the pixel values at which rendition of a layout in the browser changes (or breaks); they correspond to screen widths of different devices on which web pages can be accessed.

    Why choose responsive web design?

    There is a clear advantage in leveraging responsive web design. With a responsive website, the same HTML with all static assets such as CSS, JavaScript, and images are served in the browser on any device. The width of the viewport in which the website is being viewed is detected by the browser and the appropriate styling rules are used to render the layout accordingly. You only write and maintain one codebase; and any code edits over time only have to be made once for the changes to be reflected on all devices. Long-term, the cost of maintenance is greatly reduced.

    In adaptive web design, on the other hand, you develop different versions of the layout, each optimized for a different screen size. A script on the server detects the device used to access the website, and the appropriate version of HTML, CSS, JavaScript, and images is served in the browser. In adaptive web design approach, edits to the codebase have to be made in each version of the website separately, which means higher long-term maintenance cost.

    There is also an option of building a native application for iOS, Android, or other mobile operating systems. While native applications often offer better functionality, unless the core of the business for which you build is mobile, a responsive website is a great alternative to consider. Building native applications is a lot more expensive, especially if you need to support multiple operating systems. Additionally, responsive websites are more discoverable by search engines since their content can be crawled, indexed, and ranked.

    Why traditional mockups hinder responsive design and drain resources

    A common approach is to design three sets of high fidelity mockups: for the smartphone (screen width of 320px), for the tablet (screen width of 768px), and for the desktop (screen width of 1024px). Sometimes four or six sets of mockups are designed to account for portrait and landscape orientations of mobile devices and for high definition desktop screens. But even the latter approach leaves out a number of viewport widths and disregards the fact that mobile web is not a collection of discrete breakpoints set apart by hundreds of pixels; it is a continuum.

    Delivering high fidelity mockups for each of the target breakpoints often drains resources and results in disappointment. Mockups themselves have to go through a cycle of design, edits, and approval, a process that is effort-heavy and leads to a false sense of satisfaction that a design has been perfected. As soon as the translation of the high fidelity mockup into code begins, you discover that page elements do not behave in the perfect way the mockups would suggest.

    At the cross-roads of the two realities--the perfection of a high fidelity mockup and the practicalities of living code and a browser--you can take one of two paths:

    • Adjust the design to align it with a page behavior in a browser
    • Write a lot of extra code to force the page into the behavior dictated by mockups

    The latter is what happens most of the time, because by this stage in the process so much effort has already gone into the design, and so much has been invested both in terms of resources and commitment to the design, that it is very hard to make any major design concessions.

    Getting smarter about designing for responsive web

    Let’s start out by stating the obvious. Any design is constrained by the medium in which it is executed and by the context in which it will live. When working with interiors, designers must take into account the space and its shape, lighting conditions, even elements of the exterior environment in order to execute a successful design. An architect must take into account the land and the surroundings in which a building will stand. An industrial designer must consider the properties of the material that will be used to produce an object she is designing.

    The same rigor applies to designing for responsive web. You’re missing the constraints of the medium and the context in which your design will live if you do not acknowledge at the onset that the perfect layout of the page you conceive of will break in the browser as the user accesses the page on a range of devices or simply resizes the browser window.

    Short of designing directly in code, there is no perfect method that would allow a designer to work with and to convey the continuous nature of responsive pages, and to anticipate how content will reflow as the width of the viewport changes incrementally. But there are ways to approach designing for responsive web that help making a transition from a static design to a responsive web page somewhat easier:

    • Low fidelity wireframes and prototypes. The longer you work with low fidelity wireframes and prototypes the better chance you have of identifying places where the page layout breaks in the browser before a major commitment to a high fidelity design is undertaken. At Caktus, we favor the approach of moving on to code early, well before the design reaches high fidelity. That allows us to shape the design to work with the medium, rather than to force it into the medium.
    • Mobile first. Designing for smaller screens first encourages you to think about content in terms of priorities. It’s an opportunity to take a hard look at all elements of a page and to decide which ones are essential and which ones are not. If you prioritize content for smaller screens first to create great experience, you will have a much easier time translating that experience for desktop screen sizes.
    • Atomic design. Instead of thinking about a website as a collection of pages, start thinking about it as a system of components. Design components that can be adjusted and rearranged across viewports; then make a plan for how those components should reflow as the width of the viewport changes.
    • Style guides. Building a style guide alongside components of the website helps achieve consistency of user interface, user experience, and code. Establishing a style guide is a step that supports atomic design approach to web design. It is also an important design tool of lean UX.
    • Digital prototyping tools that help convey responsive layouts. With the growing number of prototyping tools, two are worth mentioning for their ability to simulate responsive layouts: UXPin and Axure. They both come with features that allow you to set breakpoints and to mockup layouts for each breakpoint range. Using these tools does not get around the issue of designing for discrete viewport widths rather than for a continuum. However, they offer an ability to create multiple breakpoints within a single mockup, and to preview that mockup in a browser, simulating responsive behavior. This encourages the designer to focus on planning for a changing layout instead of thinking about discrete viewport widths in isolation.


    Responsive web design is an economical long-term approach to building and maintaining a mobile website. When compared to adaptive approach, responsive web design is less expensive to maintain over a long period of time. When compared to native applications (iOS, Android, etc.), it is a less costly alternative to develop and it results in web presence that’s easier to discover by search engines. That’s why responsive web design is an approach we favor at Caktus Group.

    In order for responsive web design to truly deliver on the promise of higher ROI, it must be done right. Finalizing high fidelity design mockups ahead of development process runs a risk of draining resources and may result in disappointment. For that reason at Caktus we prefer to begin the development process while the design is still in its low fidelity stage. That allows us to identify problems early and to pivot to optimize the design as needed.

    Philip SemanchukHow Best to Coerce Python Objects to Integers?


    In my opinion, the best way in Python to safely coerce things to integers requires use of an (almost) “naked” except, which is a construct I rarely want to use. Read on to see how I arrived at this conclusion, or you can jump ahead to what I think is the best solution.

    The Problem

    Suppose you had to write a Python function to convert to integer string values representing temperatures, like this list —

    ['22', '24', '24', '24', '23', '27']

    The strings come from a file that a human has typed in, so even though most of the values are good, a few will have errors ('25C') that int() will reject.

    Let’s Explore Some Solutions

    You might write a function like this —

    def force_to_int(value):
        """Given a value, returns the value as an int if possible.
        Otherwise returns None.
            return int(value)
        except ValueError:
            return None

    Here’s that function in action at the Python prompt —

    >>> print(force_to_int('42'))
    >>> print(force_to_int('oops'))

    That works! However, it’s not as robust as it could be.

    Suppose this function gets input that’s even more unexpected, like None

    >>> print(force_to_int(None))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<stdin>", line 6, in force_to_int
    TypeError: int() argument must be a string or a number, not 'NoneType'

    Hmmm, let’s write a better version that catches TypeError in addition to ValueError

    def force_to_int(value):
        """Given a value, returns the value as an int if possible.
        Otherwise returns None.
            return int(value)
        except (ValueError, TypeError):
            return None

    Let’s give that a try at the Python prompt —

    >>> print(force_to_int(None))

    Aha! Now we’re getting somewhere. Let’s try some other types —

    >>> import datetime
    >>> print(force_to_int(datetime.datetime.now()))
    >>> print(force_to_int({}))
    >>> print(force_to_int(complex(3,3)))
    >>> print(force_to_int(ValueError))

    OK, looks good! Time to pop open a cold one and…

    Wait, I can still feed input to this function that will break it. Watch this —

    >>> class Unintable():
     ...    def __int__(self):
     ...        raise ArithmeticError
     >>> trouble = Unintable()
     >>> print(force_to_int(trouble))
     Traceback (most recent call last):
       File "<stdin>", line 1, in <module>
       File "<stdin>", line 6, in force_to_int
       File "<stdin>", line 3, in __int__


    While the class Unintable is contrived, it reminds us that classes control their own conversion to int, and can raise any error they please, even a custom error. A scenario that’s more realistic than the Unintable class might be a class that wraps an industrial sensor. Calling int() on an instance normally returns a value representing pressure or temperature. However, it might reasonably raise a SensorNotReadyError.

    And Finally, the Naked Except

    Since any exception is possible when calling int(), our code has to accomodate that. That requires the ugly “naked” except. A “naked” except is an except statement that doesn’t specify which exceptions it catches, so it catches all of them, even SyntaxError. They give bugs a place to hide, and I don’t like them. Here, I think it’s the only choice —

    def force_to_int(value):
        """Given a value, returns the value as an int if possible.
        Otherwise returns None.
            return int(value)
            return None

    At the Python prompt —

    >>> print(int_or_else(trouble))

    Now the bones of the function are complete.

    Complete, Except For One Exception

    Graham Dumpleton‘s comment below pointed out that there’s a difference between what I call a ‘naked’ except —


    And this —

    except Exception:

    The former traps even SystemExit which you don’t want to trap without good reason. From the Python documentation for SystemExit —

    It inherits from BaseException instead of Exception so that it is not accidentally caught by code that catches Exception. This allows the exception to properly propagate up and cause the interpreter to exit.

    The difference between these two is only a side note here, but I wanted to point it out because (a) it was educational for me and (b) it explains why I’ve updated this post to hedge on what I was originally calling a ‘naked’ except.

    The Final Version

    We can make this a bit nicer by allowing the caller to control the non-int return value, giving the “naked” except a fig leaf, and changing the function name —

    def int_or_else(value, else_value=None):
        """Given a value, returns the value as an int if possible. 
        If not, returns else_value which defaults to None.
            return int(value)
        # I don't like catch-all excepts, but since objects can raise arbitrary
        # exceptions when executing __int__(), then any exception is
        # possible here, even if only TypeError and ValueError are 
        # really likely.
        except Exception:
            return else_value

    At the Python prompt —

    >>> print(int_or_else(trouble))
    >>> print(int_or_else(trouble, 'spaghetti'))

    So there you have it. I’m happy with this function. It feels bulletproof. It contains an (almost) naked except, but that only covers one simple line of code that’s unlikely to hide anything nasty.

    You might also want to read a post I made about the exception handling choices in this post.

    I release this code into the public domain, and I’ll even throw in the valuable Unintable class for free!

    The image in this post is public domain and comes to us courtesy of Wikimedia Commons.

    Tim HopperQuerying data on S3 with Amazon Athena

    Athena Setup and Quick Start

    Last week, I needed to retrieve a subset of some log files stored in S3. This seemed like a good opportunity to try Amazon's new Athena service. According to Amazon:

    Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

    Athena is easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis.

    Athena uses Presto in the background to allow you to run SQL queries against data in S3. On paper, this seemed equivalent to and easier than mounting the data as Hive tables in an EMR cluster.

    The Athena user interface is similar to Hue and even includes an interactive tutorial where it helps you mount and query publically available data. It was easy for me to mount my private data using the same CREATE statement I'd run in Hive:

        - SCHEMA HERE
    LOCATION 's3://bucket/path/';

    At this point, I could write SQL queries against default.logs. Queries run from the Athena UI run in the background; even if you close the browser window, the query continues to run. Up to 5 queries can be run simultaneously.

    Query results can be downloaded from the UI as CSV files. Results are also written as a CSV file to an S3 bucket; by default, results go to s3://aws-athena-query-results-<account-id>-region/. You can change the bucket by clicking Settings in the Athena UI.

    Up to this point, I was thrilled with the Athena experience. However, after this, I started to uncover the limitations.

    Athena Limitations

    First, Athena doesn't allow you to create an external table on S3 and then write to it with INSERT INTO or INSERT OVERWRITE. Thus, you can't script where your output files are placed. More unsupported SQL statements are listed here.

    Next, the Athena UI only allowed one statement to be run at once. Because I wanted to load partitioned data, I had to run a bunch of statements of the form `ALTER TABLE default.logs ADD partition (d = numeric-date) LOCATION 's3://bucket/path/numeric-date/'; using the Athena UI would've required me to run these one day at a time. Thankfully, I was able to run them all at once in SQL Workbench.

    Third, Athena's output format is highly limited. It strictly outputs CSV files where every field is quoted. This was particularly problematic for me because I hoped to later load my data into Impala, and Impala can't extract text data from quoted fields! I was told by Athena support "We do plan to make improvements in this area but I don’t have an ETA yet."

    Finally, Athena fell flat on its face in the presence of bad records. I'm not sure whether I had bad GZIPs for malformed logs, but when I did, Athena stopped in its tracks. For my application, I needed my query engine to be able to ignore bad files. Adding to the frustration, even when a query failed, Athena would write partial output (up to the failure) to S3, yet the output files didn't provide any indication that they were partial, incomplete output.


    My first encounter with Athena was a flop. I ended up switching to EMR and filtering my logs with Hive. Until it offers more control over output and better error handling, Athena will be of limited value to me.

    Caktus GroupUsing Priority in Scrum to address team anxiety

    In Scrum, the backlog of tasks is ordered by the Product Owner from highest to lowest business value - not merely prioritized - so that the team knows what the most valuable items are. This helps to prevent Product Owners/Project Managers from being able to say two or more Product Backlog Items (PBIs) are the “same priority.” And this makes sense for the most part. However there are times when this information is not enough.

    I am the Product Owner of a team, and we are coming to the last few sprints for a project (re-styling an already existing website, with some new features being added for this phase), and there is still a significant amount of high business value tickets in the backlog. The team is feeling anxious and overwhelmed by the pure amount of tickets they see sitting there, regardless if they are aware that the list is refined. In order to assuage this anxiety, and also help make a plan to allow us to hit the deadline, I decided to make use of the priority field in JIRA.

    To keep things simple, I decided to use three priorities - High, Medium and Low. I started by ranking the backlog items on my own:

    • High priority PBIs are a must-have for this website to go live. These are items that I know as the client representative are non-negotiable.
    • Medium priority is for items that I think the client would want if we could get to them, but would probably be ok without for this phase of the project.
    • Low priority is for items that would not likely be missed by the client, the end-user, or our team.

    The PBIs included tasks as well as bugs. While Scrum states that bugs don’t belong in the backlog, that is where my team found it most useful to keep them.

    I then exported this list to be able to see all PBIs prioritized at a glance (PMs love Excel!), and reviewed it with my team to get their sense on whether my priorities matched their expectations. It was especially helpful on PBIs labeled as Technical Debt, since the developers have a better sense of which of these items are absolutely required for launch. It was also invaluable to ensure that our QA analyst had a say in what bugs were not critical for launch, and to ensure any critical bugs were not overlooked in my prioritizing.

    To my delight, a) the team didn’t change many of my priorities, and b) while this exercise obviously did not decrease the amount of work we still have to do, it did quell some of the anxiety around the seemingly endless backlog.

    And to those Agile purists out there, I am still refining the backlog in the “correct” way. But this exercise was valuable in helping align everyone’s priorities, and share with the team a bird’s-eye view of where we are at, and how far we have to go.

    Caktus GroupDjango is Boring, or Why Tech Startups (Should) Use Django

    I recently attended Django Under The Hood in Amsterdam, an annual gathering of Django core team members and developers from around the world. A common theme discussed at the conference this year is that “Django is boring.” While it’s not the first time this has been discussed, it still struck me as odd. Upon further reflection, however, I see Django’s “boringness” as a huge asset to the community and potential adopters of the framework.

    Caktus first began using Django in late 2007. This was well before the release of Django 1.0, in the days when startups and established companies alike ran production web applications using Subversion “trunk” (akin to the Git “master” branch) rather than using a released version of the software. Using Django was definitely not boring, because it required reading each commit merged to see if it added a new feature you could use and to make sure it wasn’t going to break your project . Although Django kept us on our toes in the early days, it was clear that Django was growing into a robust and stable framework with hope for the future.

    With the help of thousands of volunteers from around the world, Django’s progressed a lot since the early days of “tracking trunk.” What does it mean that the people developing Django itself consider it “boring,” and how does that change our outlook for the future of the framework? If you’re a tech startup looking for a web framework, why would you choose the “boring” option? Following are several reasons that Caktus still uses Django for all new custom web/SMS projects, reasons I think apply equally well in the startup environment.

    1. Django has long taken pride in its “batteries included” philosophy.

    Django strives to be a framework that solves common problems in web development in the best way possible. In my original post on the topic nearly 8 years ago, some of the key features included with Django were the built-in admin interface and a strong focus on data integrity, two features missing from Ruby on Rails, the other major web framework at the time.

    Significant features that have arrived in Django since that time include support for aggregates and query expressions in the ORM, a built-in application for geographic applications (django.contrib.gis), a user messages framework, CSRF protection, Python 3 support, a configurable User model, improved database transaction management, support for database migrations, support for full-text search in Postgres, and countless other features, bug fixes, and security updates. The entire time, Django’s emphasis on backwards compatibility and its generous deprecation policy have made it perfectly reasonable to plan to support and grow applications over 10 years or more.

    2. The community around Django continues to grow.

    In the tradition of open source software, users of the framework new and old support each other via the mailing list, IRC channel, blog posts, StackOverflow, and cost-effective conferences around the globe. The ecosystem of reusable apps continues to grow, with 3317 packages available on https://djangopackages.org/ as of the time of this post.

    A common historical pattern has been for apps or features to live external to Django until they’re “proven” in production by a large number of users, after which they might be merged into Django proper. Django also recently adopted the concept of “official” packages, where a third-party app might not make sense to merge into Django proper, but it’s sufficiently important to the wider Django community that the core team agrees to take ownership of its ongoing maintenance.

    The batteries included in Django itself and the wealth of reusable apps not only help new projects get off the ground quickly, they also provide solutions that have undergone rigorous code review by experts in the relevant fields. This is particularly important in startup environments when the focus must be on building business-critical features quickly. The last thing a startup wants to do, for example, is focus on business-critical features at the expense of security or reliability; with Django, one doesn’t have to make this compromise.

    3. Django is written in Python.

    Python is one of the most popular, most taught programming languages in the world. Availability of skilled staff is a key concern for startups hoping to grow their team in the near future, so the prevalence of Python should reassure those teams looking to grow.

    Similarly, Python as a programming language prides itself on readability; one should be able to understand the code one wrote 6-12 months ago. Although this is by no means new nor unique to Django, Python’s straightforward approach to development is another reason some developers might consider it “boring.” Both by necessity and convention, Python espouses the idea of clarity over cleverness in code, as articulated by Brian Kernighan in The Elements of Programming Style. Python’s philosophy about coding style is described in more detail in PEP 20 -- The Zen of Python. Leveraging this philosophy helps increase readability of the code and the bus factor of the project.

    4. The documentation included with Django is stellar.

    Not only does the documentation detail the usage of each and every feature in Django, it also includes detailed release notes, including any backwards-incompatible changes, along with each release. Again, while Django’s rigorous documentation practices aren’t anything new, writing and reading documentation might be considered “boring” by some developers.

    Django’s documentation is important for two key reasons. First, it helps both new and existing users of the framework quickly determine how to use a given feature. Second, it serves as a “contract” for backwards-compatibility in Django; that is, if a feature is documented in Django, the project pledges that it will be supported for at least two additional releases (unless it’s already been deprecated in the release notes). Django’s documentation is helpful both to one-off projects that need to be built quickly, and to projects that need to grow and improve through numerous Django releases.

    5. Last but not least, Django is immensely scalable.

    The framework is used at companies like EventBrite, Disqus, and Instagram to handle web traffic and mobile app API usage on behalf of 500M+ users. Even after being acquired by Facebook, Instagram swapped out their database server but did not abandon Django. Although early startups don’t often have the luxury of worrying about this much traffic, it’s always good to know that one’s web framework can scale to handle dramatic and continuing spikes in demand.

    At Caktus, we’ve engineered solutions for several projects using AWS Auto Scaling that create servers only when they’re needed, thereby maximizing scalability and minimizing hosting costs.

    Django into the future

    Caktus has long been a proponent of the Django framework, and I’m happy to say that remains true today. We established ourselves early on as one of the premiere web development companies specializing in Django, we’ve written previously about why we use Django in particular, and Caktus staff are regular contributors not only to Django itself but also to the wider community of open source apps and discussion surrounding the framework.

    Django can be considered a best of breed collection of solutions to nearly all the problems common to web development and restful, mobile app API development that can be solved in generic ways. This is “boring” because most of the common problems have been solved already; there’s not a lot of low-hanging fruit for new developers to contribute. This is a good thing for startups, because it means there’s less need to build features manually that aren’t specific to the business.

    The risk of adopting any “bleeding edge” technology is that the community behind it will lose interest and move on to something else, leaving the job of maintaining the framework up to the few companies without the budget to switch frameworks. There’s a secondary risk specific to more “fragmented” frameworks as well. Because of Django’s “batteries included” philosophy and focus on backwards compatibility, one can be assured that the features one selects today will continue to work well together in the future, which won’t always be the case with frameworks that rely on third-party packages to perform business-critical functions such as user management.

    These risks couldn’t be any stronger in the world of web development, where the framework chosen must be considered a tried and true partner. A web framework is not a service, like a web server or a database, that can be swapped out for another similar solution with some effort. Switching web frameworks, especially if the programming language changes, may require rewriting the entire application from scratch, so it’s important to make the right choice up front. Django has matured substantially over the last 10 years, and I’m happy to celebrate that it’s now the “boring” option for web development. This means startups choosing Django today can focus more on what makes their projects special, and less on implementing common patterns in web development or struggling to perform a framework upgrade with significant, backwards-incompatible changes. It’s clear we made the right choice, and I can’t wait to see what startups adopt and grow on Django in the future.

    Caktus GroupCSS Grid, not Frameworks, are the Future

    At the 2016 An Event Apart Conference in San Francisco, I peeked under the hood of a new technology that would finally address all the layout woes that we as designers and developers face: CSS Grid Layout Module. At first I was a little skeptical - except for Microsoft Edge, browser support for Grid is currently non-existent - however its official release is actually not that far off. Currently it is enabled behind a flag in Chrome and Firefox, or you can download the latest nightly or developer versions of Firefox or Safari. Here’s my brief synopsis of why I think CSS Grid is going to change the landscape of the web forever, and why I think it’s so important from a design and developer perspective.

    Many website designs today are stuck in what I would call an aesthetic rut. That is, they are all comprised of similar design patterns (similar icons, sections, hero images, etc.) and are structured with common layout patterns. As many speakers at the conference pointed out, this gets boring, fast. The CSS Grid Layout Module is meant to address these concerns by implementing a dynamic method of creating elegant layouts easily, and across two dimensions. Where Flexbox only handled layout in one dimension at a time (either column or row direction), CSS Grid handles layout for columns and rows simultaneously. CSS Grid makes possible what we used to do in traditional print layout: the utilization of white space to create movement and depth, with very little code that is both responsive and easily adaptable to new content.

    CSS Grid involves very little markup. A simple display: grid with its subset of attributes is all it takes. Rather than bore you with examples, check out this nifty guide. What used to comprise hundreds of lines of code wrapped in a framework (Bootstrap, Foundation, Skeleton) is now accomplished with a few lines, and presumably fewer dependencies mean an increase in performance and decrease in page load times. Grid is a great tool to prototype and design with, simply because you can now get up and running with no setup or dependencies - everything is baked into the browser.

    The true power and beauty of Grid is that it allows for both complete control over layout placement, or you can let the browser do the work. You can specify column (or row) spacing, and have CSS Grid decide where to place your content. If you want to leverage more control over where your content goes on the page, you can specify where it goes with grid-column: start line/end line or grid-row: start line/end line, or a combination of both.

    One of the most exciting things about CSS Grid is that we can use it now to prototype and plan for the future. My challenge to you, as designers and developers, to use it now so that when CSS Grid is released, not only will your project already take advantage of all the new and wonderful possibilities using CSS Grid, you will have also adapted the future friendly approach for your project. Need inspiration? Check out My reinterpretation of a Japanese Magazine Cover with CSS Grid.

    Caktus GroupDjango Under the Hood 2016 Recap

    Caktus was a proud sponsor of Django Under the Hood (DUTH) 2016 in Amsterdam this year. Organized by Django core developers and community members, DUTH is a highly technical conference that delves deep into Django.

    Django core developers and Caktus Technical Manager Karen Tracey and CEO/Co-founder Tobias McNulty both flew to Amsterdam to spend time with fellow Djangonauts. Since not all of us could go, we wanted to ask them what Django Under the Hood was like.

    Can you tell us more about Django Under the Hood?

    Tobias: This was my first Django Under the Hood. The venue was packed. It’s an in-depth, curated talk series by invite-only speakers. It was impeccably organized. Everything is thought through. They even have little spots where you can pick up toothbrush and toothpaste.

    Karen: I’ve been to all three. They sell out very quickly. Core developers are all invited, get tickets, and some funding depending on sponsorship. This is the only event where some costs are covered for core developers. DjangoCon EU and US have core devs going, but they attend it however they manage to get funds for it.

    What was your favorite part of Django Under the Hood?

    Tobias: The talks: they’re longer and more detailed than typical conference talks; they’re curated and confined to a single track so the conference has a natural rhythm to it. I really liked the talks, but also being there with the core team. Just being able to meet these people you see on IRC and mailing list, there’s a big value to that. I was able to put people in context. I’d met quite a few of the core team before but not all.

    Karen: I don’t have much time to contribute to Django because of heavy involvement in cat rescue locally and a full time job, but this is a great opportunity to have at least a day to do Django stuff at the sprint and to see a lot of people I don’t otherwise have a chance to see.

    All the talk videos are now online. Which talk do you recommend we watch first?

    Karen: Depends on what you’re interested in. I really enjoyed the Instagram one. As someone who contributed to the Django framework, to see it used and scaled to the size of Instagram 500 million plus users is interesting.

    Tobias: There were humorous insights, like the Justin Bieber effect. Originally they’d sharded their database by user ID, so everybody on the ops team had memorized his user ID to be prepared in case he posted anything. At that scale, maximizing the number of requests they can serve from a single server really matters.

    Karen: All the monitoring was interesting too.

    Tobias: I liked Ana Balica’s testing talk. It included a history of testing in Django, which was educational to me. Django didn’t start with a framework for testing your applications. It was added as a ticket in the low thousands. She also had practical advice on how to treat your test suite as part of the application, like splitting out functional tests and unit tests. She had good strategies to make your unit tests as fast as possible so you can run them as often as needed.

    What was your favorite tip or lesson?

    Tobias: Jennifer Akullian gave a keynote on mental health that had a diagram of how to talk about feelings in a team. You try to dig into what that means. She talked about trying to destigmatize mental health in tech. I think that’s an important topic we should be discussing more.

    Karen: I learned things in each of the talks. I have a hard time picking out one tip that sticks with me. I’d like to look into what Ana Balica said about mutation testing and learn more about it.

    What are some trends you’re seeing in Django?

    Karen: The core developers met for a half-day meeting the first day of the conference. We talked about what’s going on with DJango, what’s happened in the past year, what’s the future of Django. The theme was “Django is boring.”

    Tobias: “Django is boring” because it is no longer unknown. It’s an established, common framework now used by big organizations like NASA, Instagram, Pinterest, US Senate, etc. At the start, it was a little known bootstrappy cutting edge web framework. The reasons why we hooked up with Django nine years ago at Caktus, like security and business efficacy, all of those arguments are ever so much stronger today. That can make it seem boring for developers but it’s a good thing for business.

    Karen: It’s been around for awhile. Eleven years. A lot of the common challenges in Django have been solved. Not that there aren’t cutting edge web problems. But should you solve some problems elsewhere? For example, in third party, reusable apps like channels, REST framework.

    Tobias: There was also recognition that Django is so much more than the software. It’s the community and all the packages around it. That’s what make Dango great.

    Where do you see Django going in the future?

    Karen: I hate those sorts of questions. I don’t know how to answer that. It’s been fun to see the Django community grow and I expect to see continued growth.

    Tobias: That’s not my favorite question either. But Django has a role in fostering and continuing to grow the community it has. Django can set an example for open source communities on how to operate and fund themselves in sustainable ways. Django is experimenting with funding right now. How do we make open source projects like this sustainable without relying on people with full-time jobs volunteering their nights and weekends? This is definitely not a “solved problem,” and I look forward to seeing the progress Django and other open source communities make in the coming years.

    Thank you to Tobias and Karen for sharing their thoughts.

    Philip SemanchukCreating PDF Documents Using LibreOffice and Python, Part 4

    This is the fourth and final post in a series on creating PDFs using LibreOffice and Python. The first three parts are here:

    They’re all a supplement to a talk I gave at PyOhio 2016.

    This final post is here to point you to a working code example that you can download from my Bitbucket repository. It’s enough to get you started so you can experiment with your own goals in mind.


    One thing I mention in the code that’s worth repeating here is that the code uses ElementTree to manipulate XML. It’s sufficient for this demo, and the fact that it’s part of the Python standard library means you can run the demo without installing any third party libraries. For real world (i.e. non-demo) usage, I recommend lxml as a more robust and helpful alternative to ElementTree.

    A Curious Coincidence: Stinkin’ Badges

    Treasure of the Sierra Madre movie posterThe title of my PyOhio talk was “We Don’t Need No Stinkin’ PDF Library: Build PDFs with Python the Lazy Way”. You know the “we don’t need no stinkin’ [whatever]” meme, don’t you? It’s from the Mel Brooks movie Blazing Saddles. (You can find the clip on YouTube.) Did you know that Blazing Saddles is quoting another movie?

    The night before I gave my talk, I walked from my AirBnB to a nearby bar and bottle shop. (It’s simply called “The Bottle Shop”. Ohioans are plain dealers, apparently). I settled in there, happy with a pint of stout. On the big screen they were playing an old black and white Western — The Treasure of the Sierra Madre.

    I didn’t realize until it happened on the screen that this movie is the inspiration for the “We don’t need no stinkin’ badges” quote, although no one ever actually says “We don’t need no stinkin’ badges”. The actual line is “Badges? We ain’t got no badges. We don’t need no badges! I don’t have to show you any stinkin’ badges!”

    It’s pretty close to the line from B. Traven’s novel of the same name.

    I didn’t have time in my talk to mention Blazing Saddles, the mysterious B. Traven, The Treasure of the Sierra Madre, Humphrey Bogart, The Bottle Shop, nor the stout. But I was amused by our brief coincidence in Columbus.

    Caktus GroupOn building relationships - Digital Project Management Summit Recap

    Photo of Elizabeth speaking to DPM 2016 Summit by David Jordan.

    When I first became a digital project manager, I struggled to find professional resources. There was a plethora of information available for traditional project management, but not much specifically for digital project management. Luckily, a colleague recommended the Digital PM Summit, sponsored by the Bureau of Digital.

    It's one of the first, and still one of the only, professional conferences in the United States for digital project managers, and it’s grown every year. I initially attended the Summit three years ago in Austin, TX and it was an eye-opening, informative, and motivational experience. I met many people who did the same work that I did! I don't know where they were hiding before, but I was thankful to finally connect with them. It was such a relief to learn that others had the same challenges that I did, and that I was not alone.

    I attended the Summit every year since, and this year, I was invited to speak. I was one of twenty-two expert speakers and I was thrilled about the opportunity to present on one of the most important aspects of digital project management—relationships. I’ve found that positive working relationships are key not only to project success, but also to my success, my team’s success, and our client’s success. As project managers, we must focus on process and logistics to deliver quality projects on time, and all of that involves people.

    Investing in Relationships is Key to Project Success

    Projects are always about people, no matter where you work, and no matter what the project involves. Building positive working relationships can be challenging, but I’ve found that the best project managers invest in their team, clients, and stakeholders’ success, and not just in the project’s success. While it is possible to launch a project that successfully meets its goals, if the people involved are miserable, was it really a success? After all, the project is not going to pat you on the back, but the people involved would.

    I became a better project manager when I realized the importance of relationships, and when I recognized how much I could impact the people around me. Several years ago, as a brand new PM, I didn’t have the confidence that I do now, and it was difficult for me to take the lead. After a couple years, and thanks in large part to the Digital PM Summit, through which I learned skills that I could apply on the job, I became a more effective PM. I’m more flexible and adaptable, which is key to collaboration, and my interpersonal communication skills have improved.

    The importance of collaboration and communication were key points within my Digital PM Summit presentation, “Think Outside the Project Management Triangle.” The Project Management Triangle, or Iron Triangle, is a widely-known model of the typical constraints of project management that impact project quality—resources (budget and workers), project scope (features and functionality), and schedule (time and prioritization)—these are all components that project managers must consider and work with. In my experience, the Triangle is too limiting and overlooks relationships. The Triangle is a good basic model, but the best PMs think outside of the Triangle to positively leverage relationships in order to balance resources, scope, and schedule.

    Project Management Triangle

    The Project Management Triangle, or Iron Triangle

    Approximately fifty project managers attended my talk at the Digital PM Summit on October 13, 2016. By that point, I’d worked at Caktus for four weeks, which impacted my presentation because it was the first time I’ve worked with external clients and the first time I wasn’t a lone PM.

    As an established, full service Django shop, Caktus includes a team of trained PMs who provide professional project management services to clients. Working with other PMs helped me feel more at home at Caktus, and I learned a lot from them in a few weeks. For example, the PM team taught me about the Agile Scrum process, which I was familiar with, but never practiced before. Scrum includes a product owner who serves as an extension of the client, championing the client’s goals and priorities to the development team. At Caktus, project managers also act as product owners. During the Digital PM Summit, some attendees were curious about how I made the shift from working in-house to working with external clients, and how the Scrum process impacted my transition. I was happy to inform them that while working with external clients is different from working in-house, there are still similarities, and that Scrum had been a refreshing change for me.

    Unity in the Project Management Community Raises our Standards

    It’s not unusual for a PM to be a lone wolf, like I was in my last job where I was connected with only one other digital PM who was in a different department. We quickly became friends and confidants based on our shared experiences. As a new digital PM, support from others is critical to success, and I’m glad the Bureau of Digital, which hosted the Digital PM Summit, provides a platform for project managers to connect and share their knowledge during and after the conference. I was honored to support their mission with my own presentation, and as it turned out, relationship building was a main theme during this year’s Summit.

    The Bureau of Digital’s leaders, Brett Harned, Carl Smith, and Lori Averitt have increasingly focused on building a supportive community of professional project managers through events like the Summit. This year, the conference brought together 223 talented individuals, and the conversations have not stopped, thanks to Slack, Twitter, and LinkedIn. The attendees are still sharing tips, tools, and strategies with each other, and they’re forming Meetup groups. Since digital project management is still evolving and growing, conversations and collaboration among practitioners and experts is crucial to creating a greater shared understanding of best practices and to raise industry standards as well as recognition, helping all of us to better serve our clients and teams. I’m thrilled to work at a place like Caktus that recognizes the value of digital project management, and supports my engagement within the PM community.

    The relationships I’ve made and the community support that I’ve received via the Digital PM Summit has been integral to my growth and success as a digital project manager. I would not be where I am today, and I certainly would not have presented at the Digital PM Summit, without support. What it comes down to is that no matter who you are or what your job is, none of us work or live in a bubble, and none of us are an island. We depend upon others. Perhaps Carl Smith, one of the conference organizers, said it best: "When you invest in others, they invest in you.”

    Additional Links

    Thinking Outside the Project Management Triangle

    Tim HopperGet Pykafka to work with rdkafka on Linux

    My former colleague's from Parse.ly wrote the fantastic pykafka library with an optional c-backed using rdkafka. I've had trouble getting it to work, and here are a few things I've learned:

    • The version of rdkafka installable with apt-get was out of data, and pykafka couldn't find the headers it need. I instead used the simple build instructions in the rdkafka README to build it from head.
    • I was getting the error ImportError: librdkafka.so.1: cannot open shared object file: No such file or directory when trying to use rdkafka from Pykafka. It could be set in the short term by using LD_LIBRARY_PATH=/usr/local/lib. However, I fixed it permanently by running sudo ldconfig after building rdkafka.
    • Pykafka has to be installed after building rdkafka. At the moment, Pykafka tries to build a C-extension to connect to rdkafka, and if that fails, it will install without offering the rdkafka backend. Check the output of pip install pykafka to see if the rdkafka extension built.

    Caktus GroupRapidCon 2016: RapidPro Developer's Recap

    Developer Erin Mullaney was just in Amsterdam for RapidCon, a UNICEF-hosted event for developers using RapidPro, an SMS tool built on Django. The teams that have worked on RapidPro and its predecessor RapidSMS have gotten to know each other virtually over the years. This marks the second time they’ve all come from across the globe to share learnings on RapidPro and to discuss its future.

    RapidPro has the potential to transform how field officers build surveys, collect data, and notify populations. It allows users with no technical background to quickly build surveys and message workflows. With over 100% cell phone saturation in many developing regions, SMS presents a cheap, fast means of reaching many quickly.

    Erin worked closely with UNICEF Uganda in the development of a data analytics and reporting tool called TracPro for RapidPro. The organizers invited her to speak about the tool with other RapidPro users.

    How was the conference?

    Erin: The conference was amazing and I was ecstatic to go. Meeting the folks who work at UNICEF for the first time was exciting because we normally only speak via audio over Skype. It was nice to see them in person. We had an evening event, so it was fun to get to know them better in a social atmosphere. It was also a great opportunity to get together with other technical people who are very familiar with RapidPro and to think about ways we could increase usage of this very powerful product.

    What was your talk about?

    Erin: The title of my talk was “TracPro: How it Works and What it Does”. TracPro is an open source dashboard for use with RapidPro. You can use it for activities like real-time monitoring of education surveys. Nyaruka originally built it for UNICEF and it’s now being maintained by Caktus.

    I was one of two developers who worked on TracPro at Caktus. We worked to flesh out the data visualizations including bar charts, line charts over date ranges and maps. We also improved celery tasks and added other features like syncing more detailed contact data from RapidPro.

    What do you hope your listeners came away with?

    Erin: I delved into the code for how we synced data locally via Celery and the RapidPro API and how we did it in a way that is not server-intensive. I also had examples on how to build the visualizations. Both of those features were hopefully helpful for people thinking of building their own dashboards. Building custom dashboards in a short amount of time is really easy and fun. For example, it took a ShipIt Day I to build a custom RapidPro dashboard for PyCon that called the RapidPro API.

    What did you learn about RapidCon?

    Erin: People discussed the tools they were building. UNICEF talked about a new project, eTools, being used for monitoring. That sounds like an interesting project that will grow.

    RapidPro has had exponential usage and growth and Nyaruka and UNICEF are working really hard to manage that. It was interesting to learn about the solutions Nyaruka is looking at to deal with incredibly large data sets from places with a ton of contacts. They’ll be erasing unnecessary data and looking at other ways to minimize these giant databases.

    UNICEF is pretty happy with how RapidPro is working now and don’t expect to add too many new features to it. They’re looking ahead to managing dashboard tools like TracPro. So their focus is really on these external dashboards and building them out. The original RapidPro was really not for dashboards.

    What was the best part of RapidCon for you?

    Erin: It was pretty cool to be in a room and selected for this. I was one of only two women. Having them say “You have this knowledge that other developers don’t have” was rewarding. I felt like I had a value-add to this conference based on the past year and a half working on RapidPro-related projects.

    Will you be sharing more of your RapidPro knowledge in the future?

    Erin: So far, we’ve been the only one giving a talk about RapidPro, it seems. I gave a RapidPro talk at PyData Carolinas this year with Rebecca Muraya, Reach More People: SMS Data Collection with RapidPro and during a PyCon 2016 sponsor workshop. I’ve been encouraged to give this talk at more conferences and spread the word about RapidPro in order to get the word out further. I plan to submit it to a few 2017 conferences for sure!

    Thank you Erin for sharing your experience with us!

    To view another RapidPro talk Erin gave during PyData 2016 Carolinas, view the video here.

    Tim HopperData Scientists Need More Automation

    Many data scientists aren't lazy enough.

    Whether we are managing production services or running computations on AWS machines, many data scientists are working on computers besides their laptops.

    For me, this often takes the form of SSH-ing into remote boxes1, manually configuring the system with a combination of apt installs, Conda environments, and bash scripts.

    To run my service or scripts, I open a tmux window, activate my virtual environement, and start the process.2

    When I need to check my logs or see the output, I SSH back into each box, reconnect to tmux (after I remember the name of my session), and tail my logs. When running on multiple boxes, I repeat this process N times. If I need to restart a process, I flip through my tmux tabs until I find the correct process, kill it with a Ctrl-C, and use the up arrow to reload the last run command.

    All of this works, of course. And as we all know, a simple solution that works can be preferable to a fragile solution that requires constant maintenance. That said, I suspect many of us aren't lazy enough. We don't spend enough time automating tasks and processes. Even when we don't save time by doing it, we may save mental overhead.

    I recently introduced several colleagues to some Python-based tools that can help. Fabric is a "library and command-line tool for streamlining the use of SSH for application deployment or systems administration tasks." Fabric allows you to encapsulate sequences of commands as you might with a Makefile. It's killer feature is the ease with which it lets you execute those commands on remote machines over SSH. With Fabric, you could tail all the logs on all your nodes with a single command executed in your local terminal. There are a number of talks about Fabric on Youtube if you want to learn more. One of my colleagues reduced his daily workload by writing his system management tasks into a Fabric file.

    Another great tool is Supervisor. If you run long running processes in tmux/screen/nohup, Supervisor might be for you. It allows you to define the tasks you want to run in an INI file and "provides you with one place to start, stop, and monitor your processes". Supervisor will log the stdout and stderr to a log location of your choice. It can be a little confusing to set up, but will likely make your life easier in the longer run.

    A tool I want to learn but haven't is Ansible, "a free-software platform for configuring and managing computers which combines multi-node software deployment, ad hoc task execution, and configuration management". Unlike Chef and Puppet, Ansible doesn't require an agent on the systems you need to configure; it does all the configuration over SSH. You can use Ansible to configure your systems and install your dependencies, even Supervisor! Ansible is written in Python and, mercifully, doesn't require learning a Ruby-based DSL (as does Chef).

    Recently I've been thinking that Fabric, Supervisor, and Ansible combined become a powerful toolset for management and configuration of data science systems. Each tool is also open source and can be installed in a few minutes. Each tool is well documented and offers helpful tutorials on getting started; however, learning to use them effectively may require some effort.

    I would love to see someone create training materials on these tools (and others!) focused on how data scientists can take improve their system management, configuration, and operations. A screencast series may be the perfect thing. Someone please help data scientists be lazier, do less work, and reduce the mental overhead of dealing with computers!

    1. Thankfully I recently started taking better advantage of aliases in my ssh config

    2. When I have to do this on multiple machines, I'm occasionally clever enough to use tmux to broadcast the commands to multiple terminal windows. 

    Caktus GroupCommon web site security vulnerabilities

    I recently decided I wanted to understand better what Cross-Site Scripting and Cross-Site Request Forgery were, and how they compared to that classic vulnerability, SQL Injection.

    I also looked into some ways that sites protect against those attacks.


    SQL Injection

    SQL Injection is a classic vulnerability. It probably dates back almost to punch cards.

    Suppose a program uses data from a user in a database query.

    For example, the company web site lets users enter a name of an employee, free-form, and the site will search for that employee and display their contact information.

    A naive site might build a SQL query as a string using code like this, including whatever the user entered as NAME:

    "SELECT * FROM employees WHERE name LIKE '" + NAME + "'"

    If NAME is "John Doe", then we get:

    SELECT * FROM employees WHERE name LIKE 'John Doe'

    which is fine. But suppose someone types this into the NAME field:


    then the site will end up building this query:

    SELECT * FROM employees WHERE name LIKE 'John Doe'; DROP TABLE EMPLOYEES;'

    which might delete the whole employee directory. It could instead do something less obvious but even more destructive in the long run.

    This is called a SQL Injection attack, because the attacker is able to inject whatever they want into a SQL command that the site then executes.

    Cross Site Scripting

    Cross Site Scripting, or XSS, is a similar idea. If an attacker can get their Javascript code embedded into a page on the site, so that it runs whenever someone visits that page, then the attacker's code can do anything on that site using the privileges of the user.

    For example, maybe an attacker posts a comment on a page that looks to users like:

    Great post!

    but what they really put in their comment was:

    Great post!<script> do some nefarious Javascript stuff </script>

    If the site displays comments by just embedding the text of the comment in the page, then whenever a user views the page, the browser will run the Javascript - it has no way to know this particular Javascript on the page was written by an attacker rather than the people running the site.

    This Javascript is running in a page that was served by the site, so it can do pretty much anything the user who is currently logged in can do. It can fetch all their data and send it somewhere else, or if the user is particularly privileged, do something more destructive, or create a new user with similar privileges and send its credentials somewhere the bad guy can retrieve them and use them later, even after the vulnerability has been discovered and fixed.

    So, clearly, a site that accepts data uploaded by users, stores it, and then displays it, needs to be careful of what's in that data.

    But even a site that doesn't store any user data can be vulnerable. Suppose a site lets users search by going to http://example.com/search?q=somethingtosearchfor (Google does something similar to this), and then displays a page showing what the search string was and what the results were. An attacker can embed Javascript into the search term part of that link, put that link somewhere people might click on it, and maybe label it "Cute Kitten Pictures". When a user clicks the link to see the kittens, her browser visits the site and tries the search. It'll probably fail, but if the site embeds the search term in the results page unchanged (which Google doesn't do), the attacker's code will run.

    Why is it called Cross-Site Scripting? Because it allows an attacker to run their script on a site they don't control.


    Cross Site Request Forgeries

    The essence of a CSRF attack is a malicious site making a request to another site, the site under attack, using the current user's permissions.

    That last XSS example could also be considered a CSRF attack.

    As another, extreme example, suppose a site implemented account deletion by having a logged-in user visit (GET) /delete-my-account. Then all a malicious site would have to do is link to yoursite.com/delete-my-account and if a user who was logged into yoursite.com clicked the link, they'd make the /delete-my-account request and their account would be gone.

    In a more sophisticated attack, a malicious site can build a form or make AJAX calls that do a POST or other request to the site under attack when a user visits the malicious site.

    Protecting against vulnerabilities

    Protections in the server and application

    SQL Injection protection

    Django's ORM, and most database interfaces I've seen, provide a way to specify parameters to queries directly, rather than having the programmer build the whole query as a string. Then the database API can do whatever is appropriate to protect against malicious content in the parameters.

    XSS protection

    Django templates apply "escaping" to all embedded content by default. This marks characters that ordinarily would be special to the browser, like "<", so that the browser will just display the "<" instead of interpreting it. That means if content includes "<SCRIPT>...</SCRIPT>", instead of the browser executing the "..." part, the user will just see "<SCRIPT>...</SCRIPT>" on the page.

    CSRF protection

    We obviously can't disable links to other sites - that would break the entire web. So to protect against CSRF, we have to make sure that another site cannot build any request to our site that would actually do anything harmful.

    The first level of protection is simply making sure that request methods like GET don't change anything, or display unvalidated data. That blocks the simplest possible attack, where a simple link from another site causes harm when followed.

    A malicious site can still easily build a form or make AJAX calls that do a POST or other request to the site under attack, so how do we protect against that?

    Django's protection is to always include a user-specific, unguessable string as part of such requests, and reject any such request that doesn't include it. This string is called the CSRF token. Any form on a Django site that does a POST etc has to include it as one of the submitted parameters. Since the malicious site doesn't know the token, it cannot generate a malicious POST request that the Django site will pay any attention to.

    Protections in the browser

    Modern browsers implement a number of protections against these kinds of attacks.

    "But wait", I hear you say. "How can I trust browsers to protect my application, when I have no control over the browser being used?"

    I frequently have to remind myself that browser protections are designed to protect the user sitting in front of the browser, who for these attacks, is the victim, not the attacker. The user doesn't want their account hacked on your site any more than you do, and these browser protections help keep the attacker from doing that to the user, and incidentally to your site.

    Same-origin security policy

    All modern browsers implement a form of Same Origin Policy, which I'll call SOP. In some cases, it prevents a page loaded from one site from accessing resources on other sites, that is, resources that don't have the same origin.

    The most important thing about SOP is that AJAX calls are restricted by default. Since an AJAX call can use POST and other data-modifying HTTP requests, and would send along the user's cookies for the target site, an AJAX call could do anything it wanted using the user's permissions on the target site. So browsers don't allow it.

    What kind of attack does this prevent? Suppose the attacker sets up a site with lots of cute kitten pictures, and gets a user victim to access it. Without SOP, pages on that site could run Javascript that made AJAX calls (in the background) to the user's bank. Such calls would send along whatever cookies the user's browser had stored for the bank site, so the bank would treat them as coming from the user. But with SOP, the user's browser won't let those AJAX calls to another site happen. They can only talk to the attacker's own site, which doesn't do the attacker any good.


    Content Security Policy (CSP)

    CSP is a newer mechanism that browsers can use to better protect from these kinds of attacks.

    If a response includes the CSP header, then by default the browser will not allow any inline javascript, CSS, or use of javascript "eval" on the page. This blocks many forms of XSS. Even if an attacker manages to trick the server into including malicious code on the page, the browser will refuse to execute it.

    For example, if someone uploads a comment that includes a <script> tag with some Javascript, and the site includes that in the page, the browser just won't run the Javascript.


    I've barely touched the surface on these topics here. Any web developer ought to have at least a general knowledge of common vulnerabilities, if only to know what areas might require more research on a given project.

    A reasonable place to start is Django's Security Overview.

    The OWASP Top Ten is a list of ten of the most commonly exploited vulnerabilities, with links to more information about each. The ones I've described here are numbers 1, 3, and 8 on the list, so you can see there are many more to be aware of.

    Tim HopperSpeeding up PyMC3 NUTS Sampler

    I'm trying to use the NUTS sampler in PyMC3

    However, it was running at 2 iterations per second on my model, while the Metropolis Hastings sampler ran 450x faster.

    I showed my example to some of the PyMC3 devs on Twitter, and Thomas Wiecki showed me this trick:

    It resulted in a 25x speedup of the NUTS sampler. The code looks like this

    with pm.Model() as model:
        mu, sds, elbo = pm.variational.advi(n=200000)
        step = pm.NUTS(scaling=np.power(model.dict_to_array(sds), 2))
        return pm.sample(niter,

    Tim HopperFilter by date in a Pandas MultiIndex

    I always forget how to do this.

    The pandas DataFrame.loc method allows for label-based filtering of data frames. The Pandas docs show how it can be used to filter a MultiIndex:

    In [39]: df
                         A         B         C
    first second
    bar   one     0.895717  0.410835 -1.413681
          two     0.805244  0.813850  1.607920
    baz   one    -1.206412  0.132003  1.024180
          two     2.565646 -0.827317  0.569605
    foo   one     1.431256 -0.076467  0.875906
          two     1.340309 -1.187678 -2.211372
    qux   one    -1.170299  1.130127  0.974466
          two    -0.226169 -1.436737 -2.006747
    In [40]: df.loc['bar']
                   A         B         C
    one     0.895717  0.410835 -1.413681
    two     0.805244  0.813850  1.607920
    In [41]: df.loc['bar', 'two']
    A    0.805244
    B    0.813850
    C    1.607920
    Name: (bar, two), dtype: float64

    It turns out you can easily use it to filter a DateTimeIndex level by a single date with df['2016-11-07'] or a range of dates with df['2016-11-07:2016-11-11']. This applies whether or not its a MultiIndex.

    If you get an error like KeyError: 'Key length (1) was greater than MultiIndex lexsort depth (0)', it's because "MultiIndex Slicing requires the index to be fully lexsorted". You may fix your problem by calling df = df.sort_index().