A planet of blogs from our members...

Caktus GroupPython type annotations

When it comes to programming, I have a belt and suspenders philosophy. Anything that can help me avoid errors early is worth looking into.

The type annotation support that's been gradually added to Python is a good example. Here's how it works and how it can be helpful.

Introduction

The first important point is that the new type annotation support has no effect at runtime. Adding type annotations in your code has no risk of causing new runtime errors: Python is not going to do any additional type-checking while running.

Instead, you'll be running separate tools to type-check your programs statically during development. I say "separate tools" because there's no official Python type checking tool, but there are several third-party tools available.

So, if you chose to use the mypy tool, you might run:

$ mypy my_code.py

and it might warn you that a function that was annotated as expecting string arguments was going to be called with an integer.

Of course, for this to work, you have to be able to add information to your code to let the tools know what types are expected. We do this by adding "annotations" to our code.

One approach is to put the annotations in specially-formatted comments. The obvious advantage is that you can do this in any version of Python, since it doesn't require any changes to the Python syntax. The disadvantages are the difficulties in writing these things correctly, and the coincident difficulties in parsing them for the tools.

To help with this, Python 3.0 added support for adding annotations to functions (PEP-3107), though without specifying any semantics for the annotations. Python 3.6 adds support for annotations on variables (PEP-526).

Two additional PEPs, PEP-483 and PEP-484, define how annotations can be used for type-checking.

Since I try to write all new code in Python 3, I won't say any more about putting annotations in comments.

Getting started

Enough background, let's see what all this looks like.

Python 3.6 was just released, so I’ll be using it. I'll start with a new virtual environment, and install the type-checking tool mypy (whose package name is mypy-lang).:

$ virtualenv -p $(which python3.6) try_types
$ . try_types/bin/activate
$ pip install mypy-lang

Let's see how we might use this when writing some basic string functions. Suppose we're looking for a substring inside a longer string. We might start with:

def search_for(needle, haystack):
    offset = haystack.find(needle)
    return offset

If we were to call this with anything that's not text, we'd consider it an error. To help us avoid that, let's annotate the arguments:

def search_for(needle: str, haystack: str):
    offset = haystack.find(needle)
    return offset

Does Python care about this?:

$ python search1.py
$

Python is happy with it. There's not much yet for mypy to check, but let's try it:

$ mypy search1.py
$

In both cases, no output means everything is okay.

(Aside: mypy uses information from the files and directories on its command line plus all packages they import, but it only does type-checking on the files and directories on its command line.)

So far, so good. Now, let's call our function with a bad argument by adding this at the end:

search_for(12, "my string")

If we tried to run this, it wouldn't work:

$ python search2.py
Traceback (most recent call last):
    File "search2.py", line 4, in <module>
        search_for(12, "my string")
    File "search2.py", line 2, in search_for
        offset = haystack.find(needle)
TypeError: must be str, not int

In a more complicated program, we might not have run that line of code until sometime when it would be a real problem, and so wouldn't have known it was going to fail. Instead, let's check the code immediately:

$ mypy search2.py
search2.py:4: error: Argument 1 to "search_for" has incompatible type "int"; expected "str"

Mypy spotted the problem for us and explained exactly what was wrong and where.

We can also indicate the return type of our function:

def search_for(needle: str, haystack: str) -> str:
    offset = haystack.find(needle)
    return offset

and ask mypy to check it:

$ mypy search3.py
search3.py: note: In function "search_for":
search3.py:3: error: Incompatible return value type (got "int", expected "str")

Oops, we're actually returning an integer but we said we were going to return a string, and mypy was smart enough to work that out. Let's fix that:

def search_for(needle: str, haystack: str) -> int:
    offset = haystack.find(needle)
    return offset

And see if it checks out:

$ mypy search4.py
$

Now, maybe later on we forget just how our function works, and try to use the return value as a string:

x = len(search_for('the', 'in the string'))

Mypy will catch this for us:

$ mypy search5.py
search5.py:5: error: Argument 1 to "len" has incompatible type "int"; expected "Sized"

We can't call len() on an integer. Mypy wants something of type Sized -- what's that?

More complicated types

The built-in types will only take us so far, so Python 3.5 added the typing module, which both gives us a bunch of new names for types, and tools to build our own types.

In this case, typing.Sized represents anything with a __len__ method, which is the only kind of thing we can call len() on.

Let's write a new function that'll return a list of the offsets of all of the instances of some string in another string. Here it is:

from typing import List

def multisearch(needle: str, haystack: str) -> List[int]:
    # Not necessarily the most efficient implementation
    offset = haystack.find(needle)
    if offset == -1:
        return []
    return [offset] + multisearch(needle, haystack[offset+1:])

Look at the return type: List[int]. You can define a new type, a list of a particular type of elements, by saying List and then adding the element type in square brackets.

There are a number of these - e.g. Dict[keytype, valuetype] - but I'll let you read the documentation to find these as you need them.

mypy passed the code above, but suppose we had accidentally had it return None when there were no matches:

def multisearch(needle: str, haystack: str) -> List[int]:
    # Not necessarily the most efficient implementation
    offset = haystack.find(needle)
    if offset == -1:
        return None
    return [offset] + multisearch(needle, haystack[offset+1:])

mypy should spot that there's a case where we don't return a list of integers, like this:

$ mypy search6.py
$

Uh-oh - why didn't it spot the problem here? It turns out that by default, mypy considers None compatible with everything. To my mind, that's wrong, but luckily there's an option to change that behavior:

$ mypy --strict-optional search6.py
search6.py: note: In function "multisearch":
search6.py:7: error: Incompatible return value type (got None, expected List[int])

I shouldn't have to remember to add that to the command line every time, though, so let's put it in a configuration file just once. Create mypy.ini in the current directory and put in:

[mypy]
strict_optional = True

And now:

$ mypy search6.py
search6.py: note: In function "multisearch":
search6.py:7: error: Incompatible return value type (got None, expected List[int])

But speaking of None, it's not uncommon to have functions that can either return a value or None. We might change our search_for method to return None if it doesn't find the string, instead of -1:

def search_for(needle: str, haystack: str) -> int:
    offset = haystack.find(needle)
    if offset == -1:
        return None
    else:
        return offset

But now we don't always return an int and mypy will rightly complain:

$ mypy search7.py
search7.py: note: In function "search_for":
search7.py:4: error: Incompatible return value type (got None, expected "int")

When a method can return different types, we can annotate it with a Union type:

from typing import Union

def search_for(needle: str, haystack: str) -> Union[int, None]:
    offset = haystack.find(needle)
    if offset == -1:
        return None
    else:
        return offset

There's also a shortcut, Optional, for the common case of a value being either some type or None:

from typing import Optional

def search_for(needle: str, haystack: str) -> Optional[int]:
    offset = haystack.find(needle)
    if offset == -1:
        return None
    else:
        return offset

Wrapping up

I've barely touched the surface, but you get the idea.

One nice thing is that the Python libraries are all annotated for us already. You might have noticed above that mypy knew that calling find on a str returns an int - that's because str.find is already annotated. So you can get some benefit just by calling mypy on your code without annotating anything at all -- mypy might spot some misuses of the libraries for you.

For more reading:

Tim HopperLogistic Regression Rules Everything Around Me

Fred Benenson spent 6 years doing data science at Kickstarter. When he left last year, he wrote a fantastic recap of his experience.

His "list of things I've discovered over the years" is particularly good. Here are a few of the things that resonated with me:

  • The more you can work with someone to help refine their question the easier it will be to answer
  • Conducting a randomized controlled experiment via an A/B test is always better than analyzing historical data
  • Metrics are crucial to the story a company tells itself; it is essential to honestly and rigorously define them
  • Good experimental design is difficult; don't allow a great testing framework to let you get lazy with it
  • Data science (A/B testing, etc.) can help you how to optimize for a particular outcome, but it will never tell you which particular outcome to optimize for
  • Always seek to record and attain data in its rawest form, whether you're instrumenting something yourself or retrieving it from an API
  • I highly recommend reading the whole post.

    Philip SemanchukPandas Surprise

    Summary

    Part of learning how to use any tool is exploring its strengths and weaknesses. I’m just starting to use the Python library Pandas, and my naïve use of it exposed a weakness that surprised me.

    Background

    A photo of the many shapes and colors in Lucky Charms cerealThanks to bradleypjohnson for sharing this Lucky Charms photo under CC BY 2.0.

    I have a long list of objects, each with the properties “color” and “shape”. I want to count the frequency of each color/shape combination. A sample of what I’m trying to achieve could be represented in a grid like this –

           circle square star
    blue        8     41   18
    orange      5     33   25
    red        53     64   58

    At first I implemented this with a dictionary of collections.Counter instances where the top level dictionary is keyed by shape, like so –

    import collections
    SHAPES = ('square', 'circle', 'star', )
    frequencies = {shape: collections.Counter() for shape in SHAPES}

    Then I counted my frequencies using the code below. (For simplicity, assume that my objects are simple 2-tuples of (shape, color)).

    for shape, color in all_my_objects:
        frequencies[shape][color] += 1

    So far, so good.

    Enter the Pandas

    This looked to me like a perfect opportunity to use a Pandas DataFrame which would nicely support the operations I wanted to do after tallying the frequencies, like adding a column to represent the total number (sum) of instances of each color.

    It was especially easy to try out a DataFrame because my counting loop ( for...all_my_objects) wouldn’t change, only the definition of frequencies. (Note that the code below requires I know in advance all the possible colors I can expect to see, which the Dict + Counter version does not. This isn’t a problem for me in my real-world application.)

    import pandas as pd
    frequencies = pd.DataFrame(columns=SHAPES, index=COLORS, data=0,
                               dtype='int')
    for shape, color in all_my_objects:
        frequencies[shape][color] += 1

    It Works, But…

    Both versions of the code get the job done, but using the DataFrame as a frequency counter turned out to be astonishingly slow. A DataFrame is simply not optimized for repeatedly accessing individual cells as I do above.

    How Slow is it?

    To isolate the effect pandas was having on performance, I used Python’s timeit module to benchmark some simpler variations on this code. In the version of Python I’m using (3.6), the default number of iterations for each timeit test is 1 million.

    First, I timed how long it takes to increment a simple variable, just to get a baseline.

    Second, I timed how long it takes to increment a variable stored inside a collections.Counter inside a dict. This mimics the first version of my code (above) for a frequency counter. It’s more complex than the simple variable version because Python has to resolve two hash table references (one inside the dict, and one inside the Counter). I expected this to be slower, and it was.

    Third, I timed how long it takes to increment one cell inside a 2×2 NumPy array. Since Pandas is built atop NumPy, this gives an idea of how the DataFrame’s backing store performs without Pandas involved.

    Fourth, I timed how long it takes to increment one cell inside a 2×2 Pandas DataStore. This is what I had used in my real code.

    Raw Benchmark Results

    Here’s what timeit showed me. Sorry for the cramped formatting.

    $ python
     Python 3.6.0 (v3.6.0:41df79263a11, Dec 22 2016, 17:23:13)
     [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
     Type "help", "copyright", "credits" or "license" for more information.
     >>> import timeit
     >>> timeit.timeit('data += 1', setup='data=0')
     0.09242476700455882
     >>> timeit.timeit('data[0][0]+=1',setup='from collections import Counter;data={0:Counter()}')
     0.6838196019816678
     >>> timeit.timeit('data[0][0]+=1',setup='import numpy as np;data=np.zeros((2,2))')
     0.8909121589967981
     >>> timeit.timeit('data[0][0]+=1',setup='import pandas as pd;data=pd.DataFrame(data=[[0,0],[0,0]],dtype="int")')
     157.56428507200326
     >>>

    Benchmark Results Summary

    Here’s a summary of the results from above (decimals truncated at 3 digits). The rightmost column shows the results normalized so the fastest method (incrementing a simple variable) equals 1.

    Actual (seconds) Normalized (seconds)
    Simple variable 0.092 1
    Dict + Counter 0.683 7.398
    Numpy 2D array 0.890 9.639
    Pandas DataFrame 157.564 1704.784

    As you can see, resolving the index references in the middle two cases (Dict + Counter in one case, NumPy array indices in the other) slows things down, which should come as no surprise. The NumPy array is a little slower than the Dict + Counter.

    The DataFrame, however, is about 150 – 200 times slower than either of those two methods. Ouch!

    I can’t really even give you a graph of all four of these methods together because the time consumed by the DataFrame throws the chart scale out of whack.

    Here’s a bar chart of the first three methods –

    A bar chart of the first three methods in the preceding table

    Here’s a bar chart of all four –

    A bar chart of all four methods in the preceding table

    Why Is My DataFrame Access So Slow?

    One of the nice features of DataFrames is that they support dictionary-like labels for rows and columns. For instance, if I define my frequencies to look like this –

    >>> SHAPES = ('square', 'circle', 'star', )
    >>> COLORS = ('red', 'blue', 'orange')
    >>> pd.DataFrame(columns=SHAPES, index=COLORS, data=0, dtype='int')
            square  circle  star
    red          0       0     0
    blue         0       0     0
    orange       0       0     0
    >>>

    Then frequencies['square']['orange'] is a valid reference.

    Not only that, DataFrames support a variety of indexing and slicing options including –

    • A single label, e.g. 5 or 'a'
    • A list or array of labels ['a', 'b', 'c']
    • A slice object with labels 'a':'f'
    • A boolean array
    • A callable function with one argument

    Here are those techniques applied in order to the frequencies DataFrame so you can see how they work –

    >>> frequencies['star']
    red       0
    blue      0
    orange    0
    Name: star, dtype: int64
    >>> frequencies[['square', 'star']]
            square  star
    red          0     0
    blue         0     0
    orange       0     0
    >>> frequencies['red':'blue']
          square  circle  star
    red        0       0     0
    blue       0       0     0
    >>> frequencies[[True, False, True]]
            square  circle  star
    red          0       0     0
    orange       0       0     0
    >>> frequencies[lambda x: 'star']
    red       0
    blue      0
    orange    0
    Name: star, dtype: int64
    
    

    This flexibility has a price. Slicing (which is what is invoked by the square brackets) calls an object’s __getitem__() method. The parameter to __getitem__()  is the whatever was inside the square brackets. A DataFrame’s __getitem__() has to figure out what the passed parameter represents. Determining whether the parameter is a label reference, a callable, a boolean array, or something else takes time.

    If you look at the DataFrame’s __getitem__() implementation, you can see all the code that has to execute to resolve a reference. (I linked to the version of the code that was current when I wrote this in February of 2017. By the time you read this, the actual implementation may differ.) Not only does __getitem__() have a lot to do, but because I’m accessing a cell (rather than a whole row or column), there’s two slice operations, so __getitem__() gets invoked twice each time I increment my counter.

    This explains why the DataFrame is so much slower than the other methods. The dictionary and Counter both only support key lookup in a hash table, and a NumPy array has far fewer slicing options than a DataFrame, so its __getitem__() implementation can be much simpler.

    Better DataFrame Indexing?

    DataFrames support a few methods that exist explicitly to support “fast” getting and setting of scalars. Those methods are .at() (for label lookups) and .iat() (for integer-based index lookups). It also provides get_value() and set_value(), but those methods are deprecated in the version I have (0.19.2).

    “Fast” is how the Panda’s documentation describes these methods. Let’s use timeit to get some hard data. I’ll try at() and iat(); I’ll also try get_value()/set_value() even though they’re deprecated.

    >>> timeit.timeit("data.at['red','square']+=1",setup="import pandas as pd;data=pd.DataFrame(columns=('square','circle','star'),index=('red','blue','orange'),data=0,dtype='int')")
    36.33179204000044
    >>> timeit.timeit('data.iat[0,0]+=1',setup='import pandas as pd;data=pd.DataFrame(data=[[0,0],[0,0]],dtype="int")')
    42.01523362501757
    >>> timeit.timeit('data.set_value(0,0,data.get_value(0,0)+1)',setup='import pandas as pd;data=pd.DataFrame(data=[[0,0],[0,0]],dtype="int")')
    15.050199927005451
    >>>

    These methods are better, but they’re still pretty bad. Let’s put those numbers in context by comparing them to other techniques. This time, for normalized results, I’m going to use my Dict + Counter method as the baseline of 1 and compare all other methods to that. The row “DataFrame (naïve)” refers to naïve slicing, like frequencies[0][0].

    Actual (seconds) Normalized (seconds)
    Dict + Counter 0.683 1
    Numpy 2D array 0.890 1.302
    DataFrame (get/set) 15.050 22.009
    DataFrame (at) 36.331 53.130
    DataFrame (iat) 42.015 61.441
    DataFrame (naïve) 157.564 230.417

    The best I can do with a DataFrame uses deprecated methods, and is still over 20 times slower than the Dict + Counter. If I use non-deprecated methods, it’s over 50 times slower.

    Workaround

    I like label-based access to my frequency counters, I like the way I can manipulate data in a DataFrame (not shown here, but it’s useful in my real-world code), and I like speed. I don’t necessarily need blazing fast speed, I just don’t want slow.

    I can have my cake and eat it too by combining methods. I do my counting with the Dict + Counter method, and use the result as initialization data to a DataFrame constructor.

    SHAPES = ('square', 'circle', 'star', )
    frequencies = {shape: collections.Counter() for shape in SHAPES}
    for shape, color in all_my_objects:
        frequencies[shape][color] += 1
    
    frequencies = pd.DataFrame(data=frequencies)

    The frequencies DataFrame now looks something like this –

             circle square star
     blue         8     41   18
     orange       5     33   25
     red         53     64   58

    The rows and columns appear in essentially random order; they’re ordered by whatever order Python returns the dict keys during DataFrame initialization. Getting them in a specific order is left as an exercise for the reader.

    There’s one more detail to be aware of. If a particular (shape, color) combination doesn’t appear in my data, it will be represented by NaN in the DataFrame. They’re easy to set to 0 with frequencies.fillna(0).

    Conclusion

    What I was trying to do with Pandas – unfortunately, the very first thing I ever tried to do with it – didn’t play to its strengths. It didn’t break my code, but it slowed it down by a factor of ~1700. Since I had thousands of items to process, the difference was hard to overlook!

    Pandas looks great for some things, and I expect I’ll continue using it. This was just a bump in the road, albeit an interesting one.

    Caktus GroupCaktus Attends Wagtail CMS Sprint in Reykjavik

    Caktus CEO Tobias McNulty and Sales Engineer David Ray recently had the opportunity to attend a development sprint for the Wagtail Content Management System (CMS) in Reykjavik, Iceland. The two-day software development sprint attracted 15 attendees hailing from a total of 5 countries across North America and Europe.

    Wagtail sprinters in Reykjavik

    Wagtail was originally built for the Royal College of Art by UK firm Torchbox and is now one of the fastest-growing open source CMSs available. Being longtime champions of the Django framework, we’re also thrilled that Wagtail is Django-based. This makes Wagtail a natural fit for content-heavy sites that might still benefit from the customization made possible through the CMS’ Django roots.

    Tobias & Tom in Reykjavik

    The team worked on a wide variety of projects, including caching optimizations, an improved content model, a new React-based page explorer, the integration of a new rich-text editor (Draft.js), performance enhancements, other new features, and bug fixes.

    David & Scot in Reykjavik

    Team Wagtail Bakery stole the show with a brand-new demo site that’s visually appealing and better demonstrates the level of customization afforded by the Wagtail CMS. The new demo site, which is still in development as of the time of this post, can be found at wagtail/bakerydemo on GitHub.

    Wagtail Bakery on laptop screen

    After the sprint was over, our hosts at Overcast Software were kind enough to take us on a personalized tour of the countryside around Reykjavik. We left Iceland with significant progress on a number of Wagtail pull requests, new friends, and a new appreciation for the country's magical landscapes.

    Wagtail sprinters on road trip, in front of waterfall

    We were thrilled to attend and are delighted to be a part of the growing Wagtail community. If you're interested in participating in the next Wagtail sprint, it is not far away. Wagtail Space is taking place in Arnhem, The Netherlands March 21st-25th and is being organized to accommodate both local and remote sprinters. We hope to connect with you then!

    Caktus GroupHow to write a bug report

    Here are some brief thoughts on writing good bug reports in general.

    Main elements

    There are four crucial elements when writing a bug report:

    • What did you do
    • What did you see
    • What did you expect to see
    • Why did you expect to see that

    What did you do

    This is sometimes called "Steps to reproduce".

    The purpose of this part is so the person trying to fix the bug can reproduce it. If they can't reproduce it, they probably can't fix it.

    The most common problem here is not enough detail.

    To help avoid that, it's a good idea to write this as though the person reading it knows nothing about the application or site that you ran into the problem on.

    Use words like "typed" and "clicked", not "I did such-and-such task".

    Say what you did, not what you meant. Use words like "typed" and "clicked", not "chose" or "selected" or "tried to".

    Good starting points: operating system name and version, browser name and version, what URLs you visited (exactly), what you typed and clicked. Pretend you're walking someone through what you did.

    Example:

    I'm running Ubuntu 16.04.1 with Gnome desktop, using Chrome 54.0.2840.90 (64-bit).
    I recreated this in a incognito window (no extensions).
    I typed "https://www.example.com" into the address bar,
    Then I clicked the "Help" link in the top right.
    

    What did you see

    This is the obvious bit to include.

    Again, more detail is better. In particular, the exact wording of any messages - copy and paste if you can. A message that sounds generic to you might mean something very important to a developer trying to figure out the bug, if only they know the exact message you saw.

    Focus here on exactly what you saw, and not on interpreting it. If it's relevant, provide screenshots, labeling them to match the steps for reproducing the problem. If the problem is an observed behavior, still try to describe it in terms of what you saw in each step, and not how you interpret what you saw. Or provide a video of the bug happening.

    And if it doesn’t happen every time, be sure to say so, with a rough idea of how frequently you see it when you do the same things.

    Examples:

    After clicking "Help", a page loaded with URL
    "https://www.example.com/about" and the title "About WidgetCo".
    
    When I click the “Close” button, about one time in ten, the window doesn’t close.
    

    What did you expect to see

    This is often overlooked. It's surprising how often I get a report "the site did X" and my reaction is "well, the site is supposed to do X, what's the problem?" Then we have to go back and forth trying to figure out why the person submitted the bug report.

    It's much better to include this explicitly, even if it seems obvious to you.

    Example:

    I expected to see a page with a title like "Help" or
    "Using this web site".
    

    Why did you expect to see that?

    This is the other often overlooked part.

    This can help save a lot of wasted time when it turns out that there's a typo in the documentation, or the user is missing some other part of the requirements, etc.

    A documentation error or unclear requirements are just as much bugs as broken code, but it's nice to zero in on what the problem is sooner than later.

    Did you expect to see "Y" because requirement 1.2.3 said it should happen? Was it on page N of version 2.1 of the user manual? Did the site help at URL xxxxx say something that led you to expect "Y"? Or maybe your officemate told you it worked that way.

    Another benefit is that in trying to find an authority on what was supposed to happen, you might discover that you misunderstood something and what you're seeing isn't a bug at all. Or you realize that what you thought was an authority really isn't.

    Example:

    In my experience with other sites, links named "Help" go to
    pages with help information for using the site, not pages
    with information about the company.
    

    Other comments (optional)

    You can offer other information that you think would be helpful, but please do it separately from the previous elements - keep the facts separate from the opinions - and keep it concise.

    Example:

    This looks to me like it's probably just the wrong link.
    
    Let me know if I can help test a fix or anything.
    
    My wife’s cousin’s girlfriend says it might be the frangistan coupling.
    

    Exceptions to the rules

    None of this is carved in stone. For example, if starting the application caused my laptop to hang so hard that I had to power it down, I can probably omit describing what I expected to see and why.

    More detail is generally better, but keep in mind that developers are human too, and probably won’t read the whole bug report carefully if it looks overly long.

    General tips

    Be very clear whether you are describing unexpected behavior, or asking for a change in behavior. Surprisingly, in some "bug reports", you can't really tell which the user means.

    Some phrases that should probably never appear in a bug report:

    • XXX didn't work
    • XXX doesn't work
    • XXX needs fixing
    • XXX should do YYY
    • XXX looks wrong

    Many applications and tools, and less often websites, have specific instructions on how they'd like bug reports to be submitted, what information is most helpful to include, etc. Look for and follow those instructions.

    Don't be emotional. If you're really annoyed about some behavior that's blocking your work, that's perfectly understandable. But it'll be more productive to take some time to cool down, then stick to the facts in your bug report.

    If you know that this behavior has changed - maybe this exact function worked for you in version 1.23 - then mention it in your comments. That kind of information is extremely helpful.

    If you haven't tried this on the most recent version of things, try it there. It might already be fixed.

    More

    The most unique part of what I've described here is making sure to say why you expected what you expected. I know I saw that in a "how to write a good bug report" somewhere before, probably on the web, but it's been a long, long time. If anyone recognizes where that came from, please let me know in the comments.

    Meanwhile, here are some other pages that seems particularly good, and go into more detail about various of these points.

    Tim HopperHow I Quit My Ph.D. and Learned to Love Data Science

    I recently gave to the Duke Big Data Initiative entitled Dr. Hopper, or How I Quit My Ph.D. and Learned to Love Data Science. The talk was well received, and my slides seemed to resonate in the Twitter data science community.

    I've started a long-form blog post with the same message, but it's not done yet. In the mean time, I wanted to share the slides that want along with the talk.

    Philip SemanchukCoercing Objects to Integer, Revisited

    Summary

    I recently wrote a blog post that involved exception handling, and gave short shrift to the part of exception handling I didn’t want to talk about in order to focus on the part I did want to talk about. For some readers, that clearly backfired.

    Background

    My recent blog post about coercing Python objects to integers caught people’s attention in a way I hadn’t intended. The point I was trying to make was that an innocent-looking call like int(an_object) calls the method an_object.__int__(), and since that can be arbitrary code, it can raise arbitrary exceptions. Therefore, it’s insufficient to catch only the usual exceptions of ValueError and TypeError if you don’t know the type of an_object in advance.

    Here’s the code I suggested –

    def int_or_else(value, else_value=None):
        """Given a value, returns the value as an int if possible.
        If not, returns else_value which defaults to None.
        """
        try:
            return int(value)
        # I don't like catch-all excepts, but since objects can raise arbitrary
        # exceptions when executing __int__(), then any exception is
        # possible here, even if only TypeError and ValueError are
        # really likely.
        except Exception:
            return else_value

    Several commenters objected to the fact that this code discards (and therefore silences/masks/hides) all exceptions. Here’s why I made that choice.

    The Two Parts of Exception Handling

    In Python, there’s two parts to consider about exception handling — what to catch, and what to do with the exception once you’ve caught it. My intention was to write only about the former.

    The latter is an interesting topic, too. Once you’ve caught an exception, you might want to log it and then discard it, log it and then re-raise it, re-raise it as a different exception, silence it, let it pass up to the caller, modify its attributes and re-raise it, etc. There’s enough material for an entire blog post about different ways to react to an exception, and the pros and cons of each.

    Someday I might write that post about different ways to react to trapped exceptions, and if I do, I’ll dedicate the entire post to the subject to give it the attention it deserves. That other blog post – that was not it. In fact, it was the opposite. I gave the topic of processing the trapped exception as little attention as possible so as not to detract attention from what I wanted to be the main topic (what exceptions need to be trapped).

    That backfired.

    Conclusion

    My post was not advocacy of discarding exceptions, nor was it advocacy of not discarding exceptions. What’s the right choice? It depends. One situation where you might want to discard exceptions is in a blog post where you’re trying to keep the code as brief as possible for readability. Then again, you might regret that. :-)

    In the future, I’ll be clearer about what shortcuts I’m taking for brevity of presentation.

    Agree? Disagree? I’d like to hear from you. I like it when people agree with me. Those who disagree can expand my horizons, and I like that too. In short, all civil comments are welcome. I feel I’ve spent enough time thinking about this topic for now, but that doesn’t make me right! Let me know what you think.

    Caktus GroupHow to make a jQuery

    Learn to live without jQuery by learning how to clone it

    jQuery is one of the earliest libraries every web developer learns, and often is the first experience with programming of any sort someone has. It provides a very safe cushion between a developer and the rough edges of web development. But, it can also obscure learning Javascript itself and learning what web APIs are capable of without the abstraction over them that jQuery adds.

    jQuery came about at a time when it was very much needed. Limitations in browsers and differences between them created enormous hardships for developers. These days, the landscape is very different, and everyone should consider withholding on adding jQuery to their projects until absolutely necessary. Forgoing it encourages you to learn the Javascript language on its own, not just as a tool to employ one massive library. You will be exposed to the native APIs of the web, to better understand the things you’re doing. This improved understanding gives you a chance to be more directly exposed to new and changing web standards.

    Let’s learn to recreate the most helpful parts of jQuery piece by piece. This exercise will help you learn what it actually does under the hood. When you know how those features work, you can find ways to avoid using jQuery for the most common cases.

    Selecting Elements on the Page

    The first and most prominent jQuery feature is selecting one or several elements from the page based on CSS selectors. When jQuery was first dropped on our laps, this was a mildly revolutionary ability to easily locate a piece of your page by a reliable, understandable address. In jQuery, selection looks like this:

    $('#some-element')
    

    The power of the simple jQuery selection has since been adapted into a standard new pair of document methods: querySelector() and querySelectorAll(). These take CSS selectors like jQuery and give you the first or all matching elements in an array, but that array isn’t as powerful as a jQuery set, so let’s replicate what jQuery does by smartening up the results a bit.

    Simply wrapping querySelectorAll() is more than trivial. We'll call our little jQuery clone njq(), short for "Not jQuery" and use it the way you would use $().

    function njq(selector) {
        return document.querySelectorAll(selector)
    }
    

    And now we can use njq() just like jQuery for selections.

    njq('#some-element')
    

    But, of course, jQuery gives us a lot more than this, so a simple wrapper won't do. To really match its power we need to add a few things:

    • Default to an empty set of elements
    • Wrap the original HTML element objects if we're given one
    • Wrap the results such that we can attach behaviors to them

    These simple additions give us a more robust example of what jQuery can do.

    var empty = $() // getting an empty set
    var html = $('<h2>test</h2>') // from an HTML snippet
    var wrapped = $(an_html_element) // wrapping an HTML Element object
    wrapped.hide() // using attached behaviors, in this case calling hide()
    

    So let's add these abilities. We'll implement the empty set version, wrapping Element objects, accepting arrays of elements, and attaching extra methods. We'll start by adding one of the most useful jQuery methods: the each() method used to loop over all the elements it holds.

    function njq(arg) {
        let results
        if (typeof arg === 'undefined') {
            results = []
        } else if (arg instanceof Element) {
            results = [arg]
        } else if (typeof arg === 'string') {
            // If the argument looks like HTML, parse it into a DOM fragment
            if (arg.startsWith('<')) {
                let fragment = document.createRange().createContextualFragment(arg)
                results = [fragment]
            } else {
                // Convert the NodeList from querySelectorAll into a proper Array
                results = Array.prototype.slice.call(document.querySelectorAll(arg))
            }
        } else {
            // Assume an array-like argument and convert to an actual array
            results = Array.prototype.slice.call(arg)
        }
        results.__proto__ = njq.methods
        return results
    }
    
    njq.methods = {
        each: function(func) {
            Array.prototype.forEach.call(this, func)
        },
    }
    

    This is a good foundation, but jQuery selection has a few other required helpers we need to consider our version even close to complete. To be more complete, we have to add helpers for search both up and down the HTML tree from the elements in a result set.

    Walking down the tree is done with the find() method that selects within the children of the results. Here we learn a second form of querySelectorAll(), which is called on an individual element, not an entire document, and only selects within its children. Like so:

    var list = $('ul')
    var items = list.find('li')
    

    The only extra work we have left to do is to ensure we don't add any duplicates to the result set, by tracking which elements we've already added as we call querySelectorAll() on each element in the original elements and combine all their results together.

    njp.methods.find = function(selector) {
        var seen = new Set()
        var results = njq()
        this.each((el) => {
            Array.prototype.forEach.call(el.querySelectorAll(selector), (child) => {
                if (!seen.has(child)) {
                    seen.add(child)
                    results.push(child)
                }
            })
        })
        return results
    }
    

    Now we can use find() in our own version:

    var list = njq('ul')
    var items = list.find('li')
    

    Searching down the HTML tree was useful and straight forward, but we aren't complete if we can't do it in the reverse: searching up the tree from the original selection. This is where we'll clone jQuery's closest() method.

    In jQuery, closest() helps when you already have an element, and want to find something up the tree in it. In this example, we find all the bold text in a page and then find what paragraph they're from:

    var paragraphs_with_bold = $('b').closest('p')
    

    Of course, multiple elements we have may have the same ancestors, so we need to handle duplicate results in this method, as we did before. We won't get much help from the DOM directly, so we walk up the chain of parent elements one at a time, looking for matches. The only help the DOM gives us here is Element.matches(selector), which tells us if a given element matches a CSS selector we're looking for. When we find matches we add them to our results. We stop searching immediately for each element's first match, because we're only looking for the "closest", after all.

    njq.methods.closest = function(selector) {
        var closest = new Set()
        this.each((el) => {
            let curEl = el
            while (curEl.parentElement && !curEl.parentElement.matches(selector)) {
                curEl = curEl.parentElement
            }
            if (curEl.parentElement) {
                closest.add(curEl.parentElement)
            }
        })
        return njq(closest)
    }
    

    We've put the basic pieces of selection in place now. We can query the page for elements, and we can query those results to drill down or walk up the HTML tree for related elements. All of this is useful, and we can walk over our results with the each() method we started with.

    var paragraphs_with_bold = njq('b').closest('p')
    

    Basic Manipulations

    We can't do very much with the results, yet, so let's add some of the first manipulation helpers everyone learned with jQuery: manipulating classes.

    Manipulating classes means you can turn a class on or off for a whole set of elements, changing its styles, and often hiding or showing entire bits of the page. Here are our simple class helpers: addClass() and removeClass() will add or remove a single class from all the elements in the result set, toggleClass() will add the class to all the elements that don't already have it, while removing it from all the elements which presently do have the class.

    The jQuery methods we're reimplementing work like this:

    $('#submit').addClass('primary')
    $('.message').removeClass('new')
    $('#modal').toggleClass('shown')
    

    Thankfully, the DOM's native APIs make all of these very simple. We'll use our existing each() method to walk over all the results, but manipulating the class in each of them is a simple call to methods on the elements' classList interface, a specialized array just for managing element classes.

    njq.methods.toggleClass = function(className) {
        this.each((el) => {
            el.classList.toggle(className)
        })
    }
    
    njq.methods.addClass = function(className) {
        this.each((el) => {
            el.classList.add(className)
        })
    }
    
    njq.methods.removeClass = function(className) {
        this.each((el) => {
            el.classList.remove(className)
        })
    }
    

    Now we have a very simple jQuery clone that can walk around the DOM tree and do basic manipulations of classes to change the styling. This, by itself, has enough parts to be useful, but some times just adding or removing classes isn't enough. Some times you need to manipulate styles and other properties directly, so we're going to add a few more small manipulation utilities:

    • We want to change the text in elements
    • We want to swap out entire HTML bodies of elements
    • We want to inspect and change attributes on elements
    • We want to inspect and change CSS styles on elements

    These are all simple operations with jQuery.

    $('#message-box').text(new_message_text)
    $('#page').html(new_content)
    

    Changing the contents of an element directly, whether text or HTML, is as simple as a single attribute we'll wrap with our helpers: text() and html(), wrapping the innerText and innerHTML properties, specifically. Like nearly all of our methods we're building on top of each() to apply these operations to the whole set.

    njq.methods.text = function(t) {
        this.each((el) => el.innerText = t)
    }
    
    njq.methods.html = function(t) {
        this.each((el) => el.innerHTML = t)
    }
    

    Now we'll start to get into methods that need to do multiple things. Setting the text or HTML is useful, but often reading it is useful, too. Many of our methods will follow this same pattern, so if a new value isn't provided, then instead we want to return the current value. Copying jQuery, when we read things we'll only read them from the first element in a set. If you need to read them from multiple elements, you can walk over them with each() to do that on your own.

    var msg_text = $('#message').text()
    

    These two methods are easily enhanced to add read versions:

    njq.methods.text = function(t) {
        if (arguments.length === 0) {
            return this[0].innerText
        } else {
            this.each((el) => el.innerText = t)
        }
    }
    
    njq.methods.html = function(t) {
        if (arguments.length === 0) {
            return this[0].innerHTML
        } else {
            this.each((el) => el.innerHTML = t)
        }
    }
    

    Next, all elements have attributes and styles and we want helpers to read and manipulate those in our result sets. In jQuery, these are the attr() and css() helpers, and that's what we'll replicate in our version. First, the attribute helper.

    $("img#greatphoto").attr("title", "Photo by Kelly Clark");
    

    Just like our text() and html() helpers, we read the value from the first element in our set, but set the new value for all of them.

    njq.methods.attr = function(name, value) {
        if (typeof value === 'undefined') {
            return this[0].getAttribute(name)
        } else {
            this.each((el) => el.setAttribute(name, value))
        }
    }
    

    Working with styles, we allow three different versions of the css() helper.

    First, we allow reading the CSS property from the first element. Easy.

    var fontSize = parseInt(njq('#message').css('font-size'))
    
    njq.methods.css = function(style) {
        if (typeof style === 'string') {
            return getComputedStyle(this[0])[style]
        }
    }
    

    Second, we change the value if we get a new value passed as a second argument.

    var fontSize = parseInt(njq('#message').css('font-size'))
    if (fontSize > 20) {
        njq('#message').css('font-size', '20px')
    }
    
    njq.methods.css = function(style, value) {
        if (typeof style === 'string') {
            if (typeof value === 'undefined') {
                return getComputedStyle(this[0])[style]
            } else {
                this.each((el) => el.style[style] = value)
            }
        }
    }
    

    Finally, because it's very common you want to change multiple CSS properties, and probably at the same time, the css() helper will accept a hash-object mapping property names to new property values and set them all at once:

    njq('.banner').css({
        'background-color': 'navyblue',
        'color': 'white',
        'font-size: 40px',
    })
    
    njq.methods.css = function(style, value) {
        if (typeof style === 'string') {
            if (typeof value === 'undefined') {
                return getComputedStyle(this[0])[style]
            } else {
                this.each((el) => el.style[style] = value)
            }
        } else {
            this.each((el) => Object.assign(el.style, style))
        }
    }
    

    Our jQuery clone is really shaping up. With it, we've replicated all these things jQuery does for us:

    • Selecting elements across a page
    • Selecting either descendents or ancestors of elements
    • Toggling, adding, or removing classes across a set of elements
    • Reading and modifying the attributes an element has
    • Reading and modifying the CSS properties an element has
    • Reading and changing the text contents of an element
    • Reading and changing the HTML contents of an element

    That's a lot of helpful DOM manipulation! If we stopped here, this would already be useful.

    Of course, we're going to continue adding more features to our little jQuery clone. Eventually we'll add more ways to manipulate the HTML in the page, before we come back to manipulation let's start adding support for events to let a user interact with the page.

    Event Handling

    Events in Javascript can come from a lot of sources. The kinds of events we're interested in are user interface events. The first event you probably care about is the click event, but we'll handle it just like any other.

    $("#dataTable tbody tr").on("click", function() {
        console.log( $( this ).text() )
    })
    

    Like some of our other helpers, we're wrapping what is now a standard facility in the APIs the web defines to interact with a page. We're wrapping addEventListener(), the standard DOM API available on all elements to bind a function to be called when an event happens on that element. For example, if you bind a function to the click event of an image, the function will be called.

    We might need some information about the event, so we're going to trigger our callback with this bound to the element you were listening to and we'll pass the Event object, which describes all about the event in question, as a parameter.

    njq.methods.on = function(event, cb) {
        this.each((el) => {
            // addEventListener will invoke our callback
            // with two parameters: the element the event
            // comes from and the event object itself.
            el.addEventListener(event, cb)
        })
    }
    

    This is a useful start, but events can do so much more. First, before we make our event listening more powerful, let's make sure we can hit the undo button by adding a way to remove them.

    var $test = njq("#test");
    
    function handler1() {
        console.log("handler1")
        $test.off("click", handler2)
    }
    
    function handler2() {
        console.log("handler2")
    }
    
    $test.on("click", handler1);
    $test.on("click", handler2);
    

    The standard addEventListener() comes paired with removeEventListener(), which we can use since our event binding was simple:

    njq.methods.off = function(event, cb) {
        this.each((el) => {
            el.removeEventListener(event, cb)
        })
    }
    

    Event Delegation

    When your page is changing through interactions it can be difficult to maintain event bindings on the right elements, especially when those elements could move around, be removed, or even replaced. Delegation is a very useful way to bind event handlers not to a specific element, but to to a query of elements that changes with the contents of your page.

    For example, you might want to let any <li> elements that get added to a list be removed when you click on them, but you want this to happen even when new items are added to the list after your event binding code ran.

    <h3>Grocery List</h3>
    <ul>
        <li>Apples</li>
        <li>Bread</li>
        <li>Peanut Butter</li>
    </ul>
    
    njq('ul').on('click', 'li', function(ev) {
        njq(ev.target).remove()
    })
    

    This very useful, but complicates our event binding a bit. Let's dive in to adding this feature.

    First, we have to accept on() being called with either 2 or 3 arguments, with the 3 argument version accepting a delegation selector as the second argument. We can use Javascript's special arguments variable to make this straight forward.

    njq.methods.on = function(event, cb) {
        let delegate, cb
    
        // When called with 2 args, accept 2nd arg as callback
        if (arguments.length === 2) {
            cb = arguments[1]
        // When called with 3 args, accept 2nd arg as delegate selector,
        // 3rd arg as callback
        } else {
            delegate = arguments[1]
            cb = arguments[2]
        }
    
        this.each((el) => {
            el.addEventListener(event, cb)
        })
    }
    

    Our event handler is still being invoked for every instance of the event. In order to implement delegation properly, we want to block the handler when the event didn't come from the right child element matching the delegation selector.

    njq.methods.on = function(event, cb) {
        let delegate, cb
    
        // When called with 2 args, accept 2nd arg as callback
        if (arguments.length === 2) {
            cb = arguments[1]
        // When called with 3 args, accept 2nd arg as delegate selector,
        // 3rd arg as callback
        } else {
            delegate = arguments[1]
            cb = arguments[2]
        }
    
        this.each((el) => {
            el.addEventListener(event, function(ev) {
                // If this was a delegate event binding,
                // skip the event if the event target is not inside
                // the delegate selection.
                if (typeof delegate !== 'undefined') {
                    if (!root.find(delegate).includes(ev.target)) {
                        return
                    }
                }
                // Invoke the event handler with the event arguments
                cb.apply(this, arguments)
            }, cb, false)
        })
    }
    

    We've wrapped our event listener in a helper function, where we check the event target each time the event is triggered and only invoke our callback when it matches.

    Advanced Manipulations

    We have a good foundation now. We can find the elements we need in the structure of our page, modify properties of those elements like attributes and CSS styles, and respond to events from the user on the page.

    Now that we've got that in place, we could start making larger manipulations of the page. We could start adding new elements, moving them around, or cloning them. These advanced manipulations will the final set of helpers we add to our library.

    Append

    One of the most useful operations is adding a new element to the end another. You might use this to add a new <li> to the end of a list, or add a new paragraph of text to an existing page.

    There are a few ways we want to allow appending, and we'll add each one at a time.

    First, we'll allow simply appending some text.

    njq.methods.append = function(content) {
        if (typeof content === 'string') {
            this.each((el) => el.innerHTML += content)
        }
    }
    

    Then, we'll allow adding elements. These might come from queries our library has done on the page.

    njq.methods.append = function(content) {
        if (typeof content === 'string') {
            this.each((el) => el.innerHTML += content)
        } else if (content instanceof Element) {
            this.each((el) => el.appendChild(content.cloneNode(true)))
        }
    }
    

    Finally, to make it easier to select elements and append them somewhere else, we'll accept an array of elements in addition to just one. Remember, our njq query objects are themselves arrays of elements.

    njq.methods.append = function(content) {
        if (typeof content === 'string') {
            this.each((el) => el.innerHTML += content)
        } else if (content instanceof Element) {
            this.each((el) => el.appendChild(content.cloneNode(true)))
        } else if (content instanceof Array) {
            content.forEach((each) => this.append(each))
        }
    }
    

    Prepend

    As long as we are adding to the end of elements, we'll want to add to the beginning as well. This is a nearly identical to the append() version.

    njq.methods.prepend = function(content) {
        if (typeof content === 'string') {
            // We add the next text to the start of the element's inner HTML
            this.each((el) => el.innerHTML = content + el.innerHTML)
        } else if (content instanceof Element) {
            // We use insertBefore here instead of appendChild
            this.each((el) => el.parentNode.insertBefore(content.cloneNode(true), el))
        } else if (content instanceof Array) {
            content.forEach((each) => this.prepend(each))
        }
    }
    

    Replacement

    jQuery offers two replacement methods, which work in opposite directions.

    $('.left').replaceAll('.right')
    $('.left').replaceWith('.right')
    

    The first, replaceAll(), will use the elements from $('.left') and use those to replace everything from $('.right'). If you had this HTML:

    <h2>Some Uninteresting Header Text</h2>
    <p>A very important story to tell.</p>
    

    you could run this to replace the tag entirely, not just its contents:

    $('<h1>Some Exciting Header Text</h1>').replaceAll('h2')
    

    and your HTML would now look like this:

    <h1>Some Exciting Header Text</h1>
    <p>A very important story to tell.</p>
    

    The second, replaceWith(), does the opposite by using elements from $('.right') to replace everything in $('.left')

    $('h2').replaceWith('<h1>Some Exciting Header Text</h1>')
    

    So let's add these to "Not jQuery".

    njq.methods.replaceWith = function(replacement) {
        let $replacement = njq(replacement)
        let combinedHTML = []
        $replacement.each((el) => {
            combinedHTML.push(el.innerHTML)
        })
        let fragment = document.createRange().createContextualFragment(combinedHTML)
        this.each((el) => {
            el.parentNode.replaceChild(fragment, el)
        })
    }
    
    njq.methods.replaceAll = function(target) {
        njq(target).replaceWith(this)
    }
    

    Since this is a little more complex than some of our simpler methods, let's step through how it works. First, notice that we only really implemented one of them, and the second simply re-uses that with the parameters reversed. Now, both of these replacement methods will replace the target with all of the elements from the source, so the first thing we do is extract a combined HTML of all those.

    let $replacement = njq(replacement)
    let combinedHTML = []
    $replacement.each((el) => {
        combinedHTML.push(el.innerHTML)
    })
    let fragment = document.createRange().createContextualFragment(combinedHTML)
    

    Now we can replace all the target elements with this new fragment that contains our new content. To replace an element, we have to ask its parent to replace the correct child, using replaceChild():

    this.each((el) => {
        el.parentNode.replaceChild(fragment, el)
    })
    

    Clone

    The last yet easiest to implement helper is a clone() method. Allowing us to copy elements, and all their children, will make other helpers more powerful by allowing us to either move or copy them. This can be combined with other helpers we've already added, so that you have control over prepend and append operations moving or copying the elements they move around.

    njq.methods.clone = function() {
        return njq(Array.prototype.map.call(this, (el) => el.cloneNode(true)))
    }
    

    Now You Made a jQuery

    We've replicated a lot of the things jQuery gives us out of the box. Our little jQuery clone is able to select elements on a page and relative to other elements via find() and closest(). Events can be handled with simple on(event, callback) bindings and more complex on(event, selector, callback) delegations. The contents, attributes, and styles of elements can be read and manipulated with text(), html(), attr(), and css(). We can even manipulate whole segments of the DOM tree with append(), prepend(), replaceAll() and replaceWith().

    jQuery certainly offers a much deeper and wider toolbox of goodies. We weren't aiming to create a call-for-call 100% compatible replacement, just to understand what happens under the hood. If you learned anything from this exercise, learn that the tools you use are all transparent and can be learned from. They're all layers on an onion that you can peel back and learn from.

    Philip SemanchukA Postcard of Tunisia

    Earlier on this blog I briefly mentioned working with some Libyans in Tunis, the capital of Tunisia. We chose to meet at that location because it’s close to Libya but much safer than Tripoli. Now that I’ve been back for a while and had a chance to catch up, I wanted to write more about my experience.

    A photo of the translator translating liveEnglish-to-Arabic translation on the fly!

     

    I was there with Tobias McNulty of Caktus Group. We (Tobias and I) trained the Libyan employees of Libya’s High National Election Commission (HNEC) in the maintenance and use of the HNEC-commissioned SMS-based voter registration system that I had helped to develop while working with Caktus. The system has been open sourced as Smart Elect.

    If the big picture was promoting democracy, the medium picture was training system admins and developers. And the very small picture was working together on the nitty gritty of features and bug fixes, like figuring out that if a @property method raises an exception when invoked by hasattr(), the exception isn’t propagated under Python 2.7.

    The admin training consisted of a comprehensive review of the system, including the obscure corners and edge case handling. The developers were eager to get their hands dirty, so after some organizational review, we dove into fixing bugs and implementing some new features that HNEC wanted.

    A photo of a traineeAbdullah (Photo by Tobias McNulty)

    Tobias and I worked with the developers as both mentors and peers. Grinding through bugs from start to finish was really valuable. Our trainees have good development experience, but working in groups with us allowed them to participate in our approach to debugging, problem reporting, development, and test. It seemed a little different from what they were used to. We were very methodical about creating an issue in our tracker, creating a branch for that issue, reviewing one another’s code, documenting the fix, etc. “It’s a lot of process,” said one trainee after working through one particular bug with us. He’s right. I wish I had thought to ask if Libyan culture has a proverb similar to “For want of a nail…“. I could have said, “For want of filing an issue in the tracker, a voter was disenfranchised,” but it doesn’t have the same ring to it.

    A photo of Tobias and a traineeTobias and Ahmed

    This was my first trip to Africa, and, grand notions aside, what stood out to me was how mundane much of the experience was. The guys we worked with would have fit right in at any coding meetup I’ve been to. They had opinions about laptops. They were distracted by their phones. Everyone enjoyed a successful bug hunt. I remember one trainee being tired at 5PM, saying he had no more left in him, and seeing him there grinning 2 hours later when we finally solved the problem we’d been working on.

    Outside of the training, I especially enjoyed the dinners at Sakura/Pasta Cosy and Chez Zina (my favorites, in that order).

    We also ate at Le Bon Vieux Temps, where the handwritten chalkboard menu is carted from table to table on a charming-but-impractical frame. Tunisia is principally French speaking, with Arabic on an almost equal footing. At Le Bon Vieux Temps (“The Good Old Times”), the menu was all in French, and my vestigial French came in handy for translating the menu into English for the Libyans who in turn peppered the waiter with questions in Arabic. (That night in the restaurant began and ended my career as a French-to-English translator.)

    On the weekends we rested, walked in the city, and paid a visit to the Bardo National Museum. The Bardo was famously attacked in 2015, and has since sprouted a razor wire fence around the entire property. Bored soldiers sat on a truck by the gate and motioned us to enter. It’s a nice museum, and I’m glad I went.

    A photo of my entrance pass to the Bardo Museum

    Inside the classroom and out, I got to know and really like our Libyan colleagues. They were generous with their good humor and kindness. If they lacked anything, it was a willingness to complain.

    Libya is a difficult place to live at the moment. I think we all know that in an abstract sense, but talking to my Libyan friends made it more concrete for me. Banks don’t have enough cash. Electricity isn’t reliable. People they know have been kidnapped. My friends have a lot on their minds, and yet they found rooom to squeeze in opinions about good software development practices.

    A photo of a traineeMunir

    I’m glad I got the chance to go, and to get to know the people I did. In addition to working with Tobias and the Libyans, I had a lot of non-work experiences I’ll remember for a long time. I walked among ruins in Carthage that are over 2000 years old. I drove solo (and lost) through rush hour traffic in Tunis and survived. I saw a Tunisian wedding, and got to use the word “ululating” for the first time outside of Scrabble or Bananagrams. I swam in the Mediterranean. I saw flocks of flamingoes (many, many thanks to Hichem and Claudia of Les Amis des Oiseaux).

    HNEC is now better positioned than ever to use the Smart Elect system, and I hope they do so again soon. That’s partly for egotistical reasons — I like to see my work get used. Who doesn’t? But more importantly, if it gets used, that means Libyans are voting to determine their own future.

    Caktus GroupCaktus at PyCaribbean

    For the first time, Caktus will be gold sponsors at PyCaribbean February 18-19th in Bayamon, Puerto Rico. We’re pleased to announce two speakers from our team.

    Erin Mullaney, Django developer, will give a talk on RapidPro, the open source SMS system backed by UNICEF. Kia Lam, UI Developer, will talk about how women can navigate the seas of the tech industry with a few guiding principles and new perspectives. Erin and Kia join fantastic speakers from organizations like 18F, the Python Software Foundation, IBM, and Red Hat.

    We hope you can join us, but if you can’t, there’ll be videos!

    Caktus GroupPlan for mistakes as a developer

    I Am Not Perfect.

    I've been programming professionally for 25 years, and the most important thing I have learned is this:

    • I am fallible.
    • I am very fallible.
    • In fact, I make mistakes all the time.

    I'm not unique in this. All humans are fallible.

    So, how do we still get our jobs done, knowing that we're likely to make mistakes in anything we try to do? We look for ways to compensate.

    Pilots use checklists, and have for decades. No matter how many times they've done a pre-flight check on a plane, they review their checklist to make sure they haven't missed anything, because they know it's important, people make mistakes, and the consequences of a mistake can be horrendous.

    The practice of medical care is moving in the same direction. There's a great book, The Checklist Manifesto by Atul Gawande, that I highly recommend if you haven't come across it before. It talks about the kind of mistakes that happen in medicine, and how adding checklists for even basic procedures had amazing results.

    I'm a big fan of checklists. I'm always pushing to get deploy and release processes, for example, nailed down in project documentation to help us make sure not to miss an important step.

    But my point is not just to use checklists, it's the reason behind the use of checklists: acknowledging that people make mistakes, and looking for ways to do things right regardless of that.

    For me, I try to find ways to do things that I'm less likely to get wrong now, and that make it harder for future me to screw them up. I know that future me will have forgotten a lot about the project by the time he looks at it again, maybe under pressure to fix a production bug.

    One of my favorite quotations about programming is by Brian Kernighan:

    Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?

    ("The Elements of Programming Style", 2nd edition, chapter 2)

    So I work hard to avoid mistakes, both now and in the future.

    • I try to keep things straightforward
    • I use features and tools like strict typing, lint, flake8, eslint, etc.
    • I try to make sure knowledge is recorded somewhere more reliable than my memory

    I also try to detect mistakes before they can cause bad things to happen. I'm a huge fan of

    • unit tests
    • parameter checking
    • error handling
    • QA testing
    • code reviews

    To sum all this up:

    Expect to make mistakes. You will anyhow.

    Plan for them.

    And don't beat yourself up for it.

    Tim HopperYour Old Tweets from This Day

    A while ago, I published a Bash script that will open a Twitter search page to show your old tweets from this day of the year. I have enjoyed using it to see what I was thinking about in days gone by.

    So I turned this into a Twitter account.

    If you follow @your_old_tweets, it'll tweet a link at you each day that will show you your old tweets from the day. It attempts to send it in the morning (assuming you have your timezone set).

    This runs on Amazon Lambda. The code is here.

    Caktus GroupShip It Day Q1 2017

    Last Friday, Caktus set aside client projects for our regular quarterly ShipIt Day. From gerrymandered districts to RPython and meetup planning, the team started off 2017 with another great ShipIt.

    Books for the Caktus Library

    Liza uses Delicious Library to track books in the Caktus Library. However, the tracking of books isn't visible to the team, so Scott used the FTP export feature of Delicious Library to serve the content on our local network. Scott dockerized Caddy and deployed it to our local Dokku PaaS platform and serves it over HTTPS, allowing the team to see the status of the Caktus Library.

    Property-based testing with Hypothesis

    Vinod researched using property-based testing in Python. Traditionally it's more used with functional programming languages, but Hypothesis brings the concept to Python. He also learned about new Django features, including testing optimizations introduced with setupTestData.

    Caktus Wagtail Demo with Docker and AWS

    David looked into migrating a Heroku-based Wagtail deployment to a container-driven deployment using Amazon Web Services (AWS) and Docker. Utilizing Tobias' AWS Container Basics isolated Elastic Container Service stack, David created a Dockerfile for Wagtail and deployed it to AWS. Down the road, he'd like to more easily debug performance issues and integrate it with GitLab CI.

    Local Docker Development

    During Code for Durham Hack Nights, Victor noticed local development setup was a barrier of entry for new team members. To help mitigate this issue, he researched using Docker for local development with the Durham School Navigator project. In the end, he used Docker Compose to run a multi-container docker application with PostgreSQL, NGINX, and Django.

    Caktus Costa Rica

    Daryl, Nicole, and Sarah really like the idea of opening a branch Caktus office in Costa Rica and drafted a business plan to do so! Including everything from an executive summary, to operational and financial plans, the team researched what it would take to run a team from Playa Hermosa in Central America. Primary criteria included short distances to an airport, hospital, and of course, a beach. They even found an office with our name, the Cactus House. Relocation would be voluntary!

    Improving the GUI test runner: Cricket

    Charlotte M. likes to use Cricket to see test results in real time and have the ability to easily re-run specific tests, which is useful for quickly verifying fixes. However, she encountered a problem causing the application to crash sometimes when tests failed. So she investigated the problem and submitted a fix via a pull request back to the project. She also looked into adding coverage support.

    Color your own NC Congressional District

    Erin, Mark, Basia, Neil, and Dmitriy worked on an app that visualizes and teaches you about gerrymandered districts. The team ran a mini workshop to define goals and personas, and help the team prioritize the day's tasks by using agile user story mapping. The app provides background information on gerrymandering and uses data from NC State Board of Elections to illustrate how slight changes to districts can vastly impact the election of state representatives. The site uses D3 visualizations, which is an excellent utility for rendering GeoJSON geospatial data. In the future they hope to add features to compare districts and overlay demographic data.

    Releasing django_tinypng

    Dmitriy worked on testing and documenting django_tinypng, a simple Django library to allows optimization of images by using TinyPNG. He published the app to PyPI so it's easily installable via pip.

    Learning Django: The Django Girls Tutorial

    Gerald and Graham wanted to sharpen their Django skills by following the Django Girls Tutorial. Gerald learned a lot from the tutorial and enjoyed the format, including how it steps through blocks of code describing the syntax. He also learned about how the Django Admin is configured. Graham knew that following tutorials can sometimes be a rocky process, so he worked together with Graham so they could talk through problems together and Graham was able to learn by reviewing and helping.

    Planning a new meetup for Digital Project Management

    When Elizabeth first entered the Digital Project Management field several years ago, there were not a lot of resources available specifically for digital project managers. Most information was related to more traditional project management, or the PMP. She attended the 2nd Digital PM Summit with her friend Jillian, and loved the general tone of openness and knowledge sharing (they also met Daryl and Ben there!). The Summit was a wonderful resource. Elizabeth wanted to bring the spirit of the Summit back to the Triangle, so during Ship It Day, she started planning for a new meetup, including potential topics and meeting locations. One goal is to allow remote attendance through Google Hangouts, to encourage openness and sharing without having to commute across the Triangle. Elizabeth and Jillian hope to hold their first meetup in February.

    Kanban: Research + Talk

    Charlotte F. researched Kanban to prepare for a longer talk to illustrate how Kanban works in development and how it differs from Scrum. Originally designed by Toyota to improve manufacturing plants, Kanban focuses on visualizing workflows to help reveal and address bottlenecks. Picking the right tool for the job is important, and one is not necessarily better than the other, so Charlotte focused on outlining when to use one over the other.

    Identifying Code for Cleanup

    Calvin created redundant, a tool for identifying technical debt. Last ShipIt he was able to locate completely identical files, but he wanted to improve on that. Now the tool can identify functions that are almost the same and/or might be generalizable. It searches for patterns and generates a report of your codebase. He's looking for codebases to test it on!

    RPython Lisp Implementation, Revisited

    Jeff B. continued exploring how to create a Lisp implementation in RPython, the framework behind the PyPy project project. RPython is a restricted subset of the Python language. In addition to learning about RPython, he wanted to better understand how PyPy is capable of performance enhancements over CPython. Jeff also converted his parser to use Alex Gaynor's RPLY project.

    Streamlined Time Tracking

    At Caktus, time tracking is important, and we've used a variety of tools over the years. Currently we use Harvest, but it can be tedius to use when switching between projects a lot. Dan would like a tool to make this process more efficient. He looked into Project Hampster, but settled on building a new tool. His implementation makes it easy to switch between projects with a single click. It also allows users to sync daily entries to Harvest.

    Tim HopperTop Ten Favorite Photos of 2016

    I spent a lot of time with my camera in 2016. Here are some of the results.

    2016 Top Ten

    Caktus GroupNew year, new Python: Python 3.6

    Python 3.6 was released in the tail end of 2016. Read on for a few highlights from this release.

    New module: secrets

    Python 3.6 introduces a new module in the standard library called secrets. While the random module has long existed to provide us with pseudo-random numbers suitable for applications like modeling and simulation, these were not "cryptographically random" and not suitable for use in cryptography. secrets fills this gap, providing a cryptographically strong method to, for instance, create a new, random password or a secure token.

    New method for string interpolation

    Python previously had several methods for string interpolation, but the most commonly used was str.format(). Let’s look at how this used to be done. Assuming 2 existing variables, name and cookies_eaten, str.format() could look like this:

    "{0} ate {1} cookies".format(name, cookies_eaten)
    

    Or this:

    "{name} ate {cookies_eaten} cookies".format(name=name, cookies_eaten=cookies_eaten)
    

    Now, with the new f-strings, the variable names can be placed right into the string without the extra length of the format parameters:

    f"{name} ate {cookies_eaten} cookies"
    

    This provides a much more pythonic way of formatting strings, making the resulting code both simpler and more readable.

    Underscores in numerals

    While it doesn’t come up often, it has long been a pain point that long numbers could be difficult to read in the code, allowing bugs to creep in. For instance, suppose I need to multiply an input by 1 billion before I process the value. I might say:

    bill_val = input_val * 1000000000
    

    Can you tell at a glance if that number has the right number of zeroes? I can’t. Python 3.6 allows us to make this clearer:

    bill_val = input_val * 1_000_000_000
    

    It’s a small thing, but anything that reduces the chance I’ll introduce a new bug is great in my book!

    Variable type annotations

    One key characteristic of Python has always been its flexible variable typing, but that isn’t always a good thing. Sometimes, it can help you catch mistakes earlier if you know what type you are expecting to be passed as parameters, or returned as the results of a function. There have previously been ways to annotate types within comments, but the 3.6 release of Python is the first to bring these annotations into official Python syntax. This is a completely optional aspect of the language, since the annotations have no effect at runtime, but this feature makes it easier to inspect your code for variable type inconsistencies before finalizing it.

    And much more…

    In addition to the changes mentioned above, there have been improvements made to several modules in the standard library, as well as to the CPython implementation. To read about all of the updates this new release includes, take a look at the official notes.

    Caktus GroupResponsive web design

    What is responsive web design?

    Responsive web design is an approach to web design and development whereby websites and web applications respond to a screen size of the device on which they’re being accessed. The response includes layout changes, rearrangement of content, and in some cases selective display or hiding of content elements. Using a responsive web design approach you can optimize web pages to achieve great user experience on a range of devices, from smartphones to desktop.

    Responsive web design is typically accomplished by writing a set of styling rules (CSS media queries) that define how page layout should be rendered between breakpoints. Breakpoints are the pixel values at which rendition of a layout in the browser changes (or breaks); they correspond to screen widths of different devices on which web pages can be accessed.

    Why choose responsive web design?

    There is a clear advantage in leveraging responsive web design. With a responsive website, the same HTML with all static assets such as CSS, JavaScript, and images are served in the browser on any device. The width of the viewport in which the website is being viewed is detected by the browser and the appropriate styling rules are used to render the layout accordingly. You only write and maintain one codebase; and any code edits over time only have to be made once for the changes to be reflected on all devices. Long-term, the cost of maintenance is greatly reduced.

    In adaptive web design, on the other hand, you develop different versions of the layout, each optimized for a different screen size. A script on the server detects the device used to access the website, and the appropriate version of HTML, CSS, JavaScript, and images is served in the browser. In adaptive web design approach, edits to the codebase have to be made in each version of the website separately, which means higher long-term maintenance cost.

    There is also an option of building a native application for iOS, Android, or other mobile operating systems. While native applications often offer better functionality, unless the core of the business for which you build is mobile, a responsive website is a great alternative to consider. Building native applications is a lot more expensive, especially if you need to support multiple operating systems. Additionally, responsive websites are more discoverable by search engines since their content can be crawled, indexed, and ranked.

    Why traditional mockups hinder responsive design and drain resources

    A common approach is to design three sets of high fidelity mockups: for the smartphone (screen width of 320px), for the tablet (screen width of 768px), and for the desktop (screen width of 1024px). Sometimes four or six sets of mockups are designed to account for portrait and landscape orientations of mobile devices and for high definition desktop screens. But even the latter approach leaves out a number of viewport widths and disregards the fact that mobile web is not a collection of discrete breakpoints set apart by hundreds of pixels; it is a continuum.

    Delivering high fidelity mockups for each of the target breakpoints often drains resources and results in disappointment. Mockups themselves have to go through a cycle of design, edits, and approval, a process that is effort-heavy and leads to a false sense of satisfaction that a design has been perfected. As soon as the translation of the high fidelity mockup into code begins, you discover that page elements do not behave in the perfect way the mockups would suggest.

    At the cross-roads of the two realities--the perfection of a high fidelity mockup and the practicalities of living code and a browser--you can take one of two paths:

    • Adjust the design to align it with a page behavior in a browser
    • Write a lot of extra code to force the page into the behavior dictated by mockups

    The latter is what happens most of the time, because by this stage in the process so much effort has already gone into the design, and so much has been invested both in terms of resources and commitment to the design, that it is very hard to make any major design concessions.

    Getting smarter about designing for responsive web

    Let’s start out by stating the obvious. Any design is constrained by the medium in which it is executed and by the context in which it will live. When working with interiors, designers must take into account the space and its shape, lighting conditions, even elements of the exterior environment in order to execute a successful design. An architect must take into account the land and the surroundings in which a building will stand. An industrial designer must consider the properties of the material that will be used to produce an object she is designing.

    The same rigor applies to designing for responsive web. You’re missing the constraints of the medium and the context in which your design will live if you do not acknowledge at the onset that the perfect layout of the page you conceive of will break in the browser as the user accesses the page on a range of devices or simply resizes the browser window.

    Short of designing directly in code, there is no perfect method that would allow a designer to work with and to convey the continuous nature of responsive pages, and to anticipate how content will reflow as the width of the viewport changes incrementally. But there are ways to approach designing for responsive web that help making a transition from a static design to a responsive web page somewhat easier:

    • Low fidelity wireframes and prototypes. The longer you work with low fidelity wireframes and prototypes the better chance you have of identifying places where the page layout breaks in the browser before a major commitment to a high fidelity design is undertaken. At Caktus, we favor the approach of moving on to code early, well before the design reaches high fidelity. That allows us to shape the design to work with the medium, rather than to force it into the medium.
    • Mobile first. Designing for smaller screens first encourages you to think about content in terms of priorities. It’s an opportunity to take a hard look at all elements of a page and to decide which ones are essential and which ones are not. If you prioritize content for smaller screens first to create great experience, you will have a much easier time translating that experience for desktop screen sizes.
    • Atomic design. Instead of thinking about a website as a collection of pages, start thinking about it as a system of components. Design components that can be adjusted and rearranged across viewports; then make a plan for how those components should reflow as the width of the viewport changes.
    • Style guides. Building a style guide alongside components of the website helps achieve consistency of user interface, user experience, and code. Establishing a style guide is a step that supports atomic design approach to web design. It is also an important design tool of lean UX.
    • Digital prototyping tools that help convey responsive layouts. With the growing number of prototyping tools, two are worth mentioning for their ability to simulate responsive layouts: UXPin and Axure. They both come with features that allow you to set breakpoints and to mockup layouts for each breakpoint range. Using these tools does not get around the issue of designing for discrete viewport widths rather than for a continuum. However, they offer an ability to create multiple breakpoints within a single mockup, and to preview that mockup in a browser, simulating responsive behavior. This encourages the designer to focus on planning for a changing layout instead of thinking about discrete viewport widths in isolation.

    Conclusions

    Responsive web design is an economical long-term approach to building and maintaining a mobile website. When compared to adaptive approach, responsive web design is less expensive to maintain over a long period of time. When compared to native applications (iOS, Android, etc.), it is a less costly alternative to develop and it results in web presence that’s easier to discover by search engines. That’s why responsive web design is an approach we favor at Caktus Group.

    In order for responsive web design to truly deliver on the promise of higher ROI, it must be done right. Finalizing high fidelity design mockups ahead of development process runs a risk of draining resources and may result in disappointment. For that reason at Caktus we prefer to begin the development process while the design is still in its low fidelity stage. That allows us to identify problems early and to pivot to optimize the design as needed.

    Philip SemanchukHow Best to Coerce Python Objects to Integers?

    Summary

    In my opinion, the best way in Python to safely coerce things to integers requires use of an (almost) “naked” except, which is a construct I rarely want to use. Read on to see how I arrived at this conclusion, or you can jump ahead to what I think is the best solution.

    The Problem

    Suppose you had to write a Python function to convert to integer string values representing temperatures, like this list —

    ['22', '24', '24', '24', '23', '27']

    The strings come from a file that a human has typed in, so even though most of the values are good, a few will have errors ('25C') that int() will reject.

    Let’s Explore Some Solutions

    You might write a function like this —

    def force_to_int(value):
        """Given a value, returns the value as an int if possible.
        Otherwise returns None.
        """
        try:
            return int(value)
        except ValueError:
            return None

    Here’s that function in action at the Python prompt —

    >>> print(force_to_int('42'))
    42
    >>> print(force_to_int('oops'))
    None

    That works! However, it’s not as robust as it could be.

    Suppose this function gets input that’s even more unexpected, like None

    >>> print(force_to_int(None))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<stdin>", line 6, in force_to_int
    TypeError: int() argument must be a string or a number, not 'NoneType'

    Hmmm, let’s write a better version that catches TypeError in addition to ValueError

    def force_to_int(value):
        """Given a value, returns the value as an int if possible.
        Otherwise returns None.
        """
        try:
            return int(value)
        except (ValueError, TypeError):
            return None

    Let’s give that a try at the Python prompt —

    >>> print(force_to_int(None))
    None

    Aha! Now we’re getting somewhere. Let’s try some other types —

    >>> import datetime
    >>> print(force_to_int(datetime.datetime.now()))
    None
    >>> print(force_to_int({}))
    None
    >>> print(force_to_int(complex(3,3)))
    None
    >>> print(force_to_int(ValueError))
    None

    OK, looks good! Time to pop open a cold one and…

    Wait, I can still feed input to this function that will break it. Watch this —

    >>> class Unintable():
     ...    def __int__(self):
     ...        raise ArithmeticError
     ...
     >>>
     >>> trouble = Unintable()
     >>> print(force_to_int(trouble))
     Traceback (most recent call last):
       File "<stdin>", line 1, in <module>
       File "<stdin>", line 6, in force_to_int
       File "<stdin>", line 3, in __int__
     ArithmeticError

    Dang!

    While the class Unintable is contrived, it reminds us that classes control their own conversion to int, and can raise any error they please, even a custom error. A scenario that’s more realistic than the Unintable class might be a class that wraps an industrial sensor. Calling int() on an instance normally returns a value representing pressure or temperature. However, it might reasonably raise a SensorNotReadyError.

    And Finally, the Naked Except

    Since any exception is possible when calling int(), our code has to accomodate that. That requires the ugly “naked” except. A “naked” except is an except statement that doesn’t specify which exceptions it catches, so it catches all of them, even SyntaxError. They give bugs a place to hide, and I don’t like them. Here, I think it’s the only choice —

    def force_to_int(value):
        """Given a value, returns the value as an int if possible.
        Otherwise returns None.
        """
        try:
            return int(value)
        except:
            return None

    At the Python prompt —

    >>> print(int_or_else(trouble))
     None

    Now the bones of the function are complete.

    Complete, Except For One Exception

    Graham Dumpleton‘s comment below pointed out that there’s a difference between what I call a ‘naked’ except —

    except:

    And this —

    except Exception:

    The former traps even SystemExit which you don’t want to trap without good reason. From the Python documentation for SystemExit —

    It inherits from BaseException instead of Exception so that it is not accidentally caught by code that catches Exception. This allows the exception to properly propagate up and cause the interpreter to exit.

    The difference between these two is only a side note here, but I wanted to point it out because (a) it was educational for me and (b) it explains why I’ve updated this post to hedge on what I was originally calling a ‘naked’ except.

    The Final Version

    We can make this a bit nicer by allowing the caller to control the non-int return value, giving the “naked” except a fig leaf, and changing the function name —

    def int_or_else(value, else_value=None):
        """Given a value, returns the value as an int if possible. 
        If not, returns else_value which defaults to None.
        """
        try:
            return int(value)
        # I don't like catch-all excepts, but since objects can raise arbitrary
        # exceptions when executing __int__(), then any exception is
        # possible here, even if only TypeError and ValueError are 
        # really likely.
        except Exception:
            return else_value

    At the Python prompt —

    >>> print(int_or_else(trouble))
    None
    >>> print(int_or_else(trouble, 'spaghetti'))
    spaghetti

    So there you have it. I’m happy with this function. It feels bulletproof. It contains an (almost) naked except, but that only covers one simple line of code that’s unlikely to hide anything nasty.

    You might also want to read a post I made about the exception handling choices in this post.

    I release this code into the public domain, and I’ll even throw in the valuable Unintable class for free!

    The image in this post is public domain and comes to us courtesy of Wikimedia Commons.

    Tim HopperQuerying data on S3 with Amazon Athena

    Athena Setup and Quick Start

    Last week, I needed to retrieve a subset of some log files stored in S3. This seemed like a good opportunity to try Amazon's new Athena service. According to Amazon:

    Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

    Athena is easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis.

    Athena uses Presto in the background to allow you to run SQL queries against data in S3. On paper, this seemed equivalent to and easier than mounting the data as Hive tables in an EMR cluster.

    The Athena user interface is similar to Hue and even includes an interactive tutorial where it helps you mount and query publically available data. It was easy for me to mount my private data using the same CREATE statement I'd run in Hive:

    CREATE EXTERNAL TABLE IF NOT EXISTS default.logs (
        - SCHEMA HERE
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    LINES TERMINATED BY '\n'
    LOCATION 's3://bucket/path/';
    

    At this point, I could write SQL queries against default.logs. Queries run from the Athena UI run in the background; even if you close the browser window, the query continues to run. Up to 5 queries can be run simultaneously.

    Query results can be downloaded from the UI as CSV files. Results are also written as a CSV file to an S3 bucket; by default, results go to s3://aws-athena-query-results-<account-id>-region/. You can change the bucket by clicking Settings in the Athena UI.

    Up to this point, I was thrilled with the Athena experience. However, after this, I started to uncover the limitations.

    Athena Limitations

    First, Athena doesn't allow you to create an external table on S3 and then write to it with INSERT INTO or INSERT OVERWRITE. Thus, you can't script where your output files are placed. More unsupported SQL statements are listed here.

    Next, the Athena UI only allowed one statement to be run at once. Because I wanted to load partitioned data, I had to run a bunch of statements of the form `ALTER TABLE default.logs ADD partition (d = numeric-date) LOCATION 's3://bucket/path/numeric-date/'; using the Athena UI would've required me to run these one day at a time. Thankfully, I was able to run them all at once in SQL Workbench.

    Third, Athena's output format is highly limited. It strictly outputs CSV files where every field is quoted. This was particularly problematic for me because I hoped to later load my data into Impala, and Impala can't extract text data from quoted fields! I was told by Athena support "We do plan to make improvements in this area but I don’t have an ETA yet."

    Finally, Athena fell flat on its face in the presence of bad records. I'm not sure whether I had bad GZIPs for malformed logs, but when I did, Athena stopped in its tracks. For my application, I needed my query engine to be able to ignore bad files. Adding to the frustration, even when a query failed, Athena would write partial output (up to the failure) to S3, yet the output files didn't provide any indication that they were partial, incomplete output.

    Conclusion

    My first encounter with Athena was a flop. I ended up switching to EMR and filtering my logs with Hive. Until it offers more control over output and better error handling, Athena will be of limited value to me.

    Caktus GroupUsing Priority in Scrum to address team anxiety

    In Scrum, the backlog of tasks is ordered by the Product Owner from highest to lowest business value - not merely prioritized - so that the team knows what the most valuable items are. This helps to prevent Product Owners/Project Managers from being able to say two or more Product Backlog Items (PBIs) are the “same priority.” And this makes sense for the most part. However there are times when this information is not enough.

    I am the Product Owner of a team, and we are coming to the last few sprints for a project (re-styling an already existing website, with some new features being added for this phase), and there is still a significant amount of high business value tickets in the backlog. The team is feeling anxious and overwhelmed by the pure amount of tickets they see sitting there, regardless if they are aware that the list is refined. In order to assuage this anxiety, and also help make a plan to allow us to hit the deadline, I decided to make use of the priority field in JIRA.

    To keep things simple, I decided to use three priorities - High, Medium and Low. I started by ranking the backlog items on my own:

    • High priority PBIs are a must-have for this website to go live. These are items that I know as the client representative are non-negotiable.
    • Medium priority is for items that I think the client would want if we could get to them, but would probably be ok without for this phase of the project.
    • Low priority is for items that would not likely be missed by the client, the end-user, or our team.

    The PBIs included tasks as well as bugs. While Scrum states that bugs don’t belong in the backlog, that is where my team found it most useful to keep them.

    I then exported this list to be able to see all PBIs prioritized at a glance (PMs love Excel!), and reviewed it with my team to get their sense on whether my priorities matched their expectations. It was especially helpful on PBIs labeled as Technical Debt, since the developers have a better sense of which of these items are absolutely required for launch. It was also invaluable to ensure that our QA analyst had a say in what bugs were not critical for launch, and to ensure any critical bugs were not overlooked in my prioritizing.

    To my delight, a) the team didn’t change many of my priorities, and b) while this exercise obviously did not decrease the amount of work we still have to do, it did quell some of the anxiety around the seemingly endless backlog.

    And to those Agile purists out there, I am still refining the backlog in the “correct” way. But this exercise was valuable in helping align everyone’s priorities, and share with the team a bird’s-eye view of where we are at, and how far we have to go.

    Caktus GroupDjango is Boring, or Why Tech Startups (Should) Use Django

    I recently attended Django Under The Hood in Amsterdam, an annual gathering of Django core team members and developers from around the world. A common theme discussed at the conference this year is that “Django is boring.” While it’s not the first time this has been discussed, it still struck me as odd. Upon further reflection, however, I see Django’s “boringness” as a huge asset to the community and potential adopters of the framework.

    Caktus first began using Django in late 2007. This was well before the release of Django 1.0, in the days when startups and established companies alike ran production web applications using Subversion “trunk” (akin to the Git “master” branch) rather than using a released version of the software. Using Django was definitely not boring, because it required reading each commit merged to see if it added a new feature you could use and to make sure it wasn’t going to break your project . Although Django kept us on our toes in the early days, it was clear that Django was growing into a robust and stable framework with hope for the future.

    With the help of thousands of volunteers from around the world, Django’s progressed a lot since the early days of “tracking trunk.” What does it mean that the people developing Django itself consider it “boring,” and how does that change our outlook for the future of the framework? If you’re a tech startup looking for a web framework, why would you choose the “boring” option? Following are several reasons that Caktus still uses Django for all new custom web/SMS projects, reasons I think apply equally well in the startup environment.

    1. Django has long taken pride in its “batteries included” philosophy.

    Django strives to be a framework that solves common problems in web development in the best way possible. In my original post on the topic nearly 8 years ago, some of the key features included with Django were the built-in admin interface and a strong focus on data integrity, two features missing from Ruby on Rails, the other major web framework at the time.

    Significant features that have arrived in Django since that time include support for aggregates and query expressions in the ORM, a built-in application for geographic applications (django.contrib.gis), a user messages framework, CSRF protection, Python 3 support, a configurable User model, improved database transaction management, support for database migrations, support for full-text search in Postgres, and countless other features, bug fixes, and security updates. The entire time, Django’s emphasis on backwards compatibility and its generous deprecation policy have made it perfectly reasonable to plan to support and grow applications over 10 years or more.

    2. The community around Django continues to grow.

    In the tradition of open source software, users of the framework new and old support each other via the mailing list, IRC channel, blog posts, StackOverflow, and cost-effective conferences around the globe. The ecosystem of reusable apps continues to grow, with 3317 packages available on https://djangopackages.org/ as of the time of this post.

    A common historical pattern has been for apps or features to live external to Django until they’re “proven” in production by a large number of users, after which they might be merged into Django proper. Django also recently adopted the concept of “official” packages, where a third-party app might not make sense to merge into Django proper, but it’s sufficiently important to the wider Django community that the core team agrees to take ownership of its ongoing maintenance.

    The batteries included in Django itself and the wealth of reusable apps not only help new projects get off the ground quickly, they also provide solutions that have undergone rigorous code review by experts in the relevant fields. This is particularly important in startup environments when the focus must be on building business-critical features quickly. The last thing a startup wants to do, for example, is focus on business-critical features at the expense of security or reliability; with Django, one doesn’t have to make this compromise.

    3. Django is written in Python.

    Python is one of the most popular, most taught programming languages in the world. Availability of skilled staff is a key concern for startups hoping to grow their team in the near future, so the prevalence of Python should reassure those teams looking to grow.

    Similarly, Python as a programming language prides itself on readability; one should be able to understand the code one wrote 6-12 months ago. Although this is by no means new nor unique to Django, Python’s straightforward approach to development is another reason some developers might consider it “boring.” Both by necessity and convention, Python espouses the idea of clarity over cleverness in code, as articulated by Brian Kernighan in The Elements of Programming Style. Python’s philosophy about coding style is described in more detail in PEP 20 -- The Zen of Python. Leveraging this philosophy helps increase readability of the code and the bus factor of the project.

    4. The documentation included with Django is stellar.

    Not only does the documentation detail the usage of each and every feature in Django, it also includes detailed release notes, including any backwards-incompatible changes, along with each release. Again, while Django’s rigorous documentation practices aren’t anything new, writing and reading documentation might be considered “boring” by some developers.

    Django’s documentation is important for two key reasons. First, it helps both new and existing users of the framework quickly determine how to use a given feature. Second, it serves as a “contract” for backwards-compatibility in Django; that is, if a feature is documented in Django, the project pledges that it will be supported for at least two additional releases (unless it’s already been deprecated in the release notes). Django’s documentation is helpful both to one-off projects that need to be built quickly, and to projects that need to grow and improve through numerous Django releases.

    5. Last but not least, Django is immensely scalable.

    The framework is used at companies like EventBrite, Disqus, and Instagram to handle web traffic and mobile app API usage on behalf of 500M+ users. Even after being acquired by Facebook, Instagram swapped out their database server but did not abandon Django. Although early startups don’t often have the luxury of worrying about this much traffic, it’s always good to know that one’s web framework can scale to handle dramatic and continuing spikes in demand.

    At Caktus, we’ve engineered solutions for several projects using AWS Auto Scaling that create servers only when they’re needed, thereby maximizing scalability and minimizing hosting costs.

    Django into the future

    Caktus has long been a proponent of the Django framework, and I’m happy to say that remains true today. We established ourselves early on as one of the premiere web development companies specializing in Django, we’ve written previously about why we use Django in particular, and Caktus staff are regular contributors not only to Django itself but also to the wider community of open source apps and discussion surrounding the framework.

    Django can be considered a best of breed collection of solutions to nearly all the problems common to web development and restful, mobile app API development that can be solved in generic ways. This is “boring” because most of the common problems have been solved already; there’s not a lot of low-hanging fruit for new developers to contribute. This is a good thing for startups, because it means there’s less need to build features manually that aren’t specific to the business.

    The risk of adopting any “bleeding edge” technology is that the community behind it will lose interest and move on to something else, leaving the job of maintaining the framework up to the few companies without the budget to switch frameworks. There’s a secondary risk specific to more “fragmented” frameworks as well. Because of Django’s “batteries included” philosophy and focus on backwards compatibility, one can be assured that the features one selects today will continue to work well together in the future, which won’t always be the case with frameworks that rely on third-party packages to perform business-critical functions such as user management.

    These risks couldn’t be any stronger in the world of web development, where the framework chosen must be considered a tried and true partner. A web framework is not a service, like a web server or a database, that can be swapped out for another similar solution with some effort. Switching web frameworks, especially if the programming language changes, may require rewriting the entire application from scratch, so it’s important to make the right choice up front. Django has matured substantially over the last 10 years, and I’m happy to celebrate that it’s now the “boring” option for web development. This means startups choosing Django today can focus more on what makes their projects special, and less on implementing common patterns in web development or struggling to perform a framework upgrade with significant, backwards-incompatible changes. It’s clear we made the right choice, and I can’t wait to see what startups adopt and grow on Django in the future.

    Caktus GroupCSS Grid, not Frameworks, are the Future

    At the 2016 An Event Apart Conference in San Francisco, I peeked under the hood of a new technology that would finally address all the layout woes that we as designers and developers face: CSS Grid Layout Module. At first I was a little skeptical - except for Microsoft Edge, browser support for Grid is currently non-existent - however its official release is actually not that far off. Currently it is enabled behind a flag in Chrome and Firefox, or you can download the latest nightly or developer versions of Firefox or Safari. Here’s my brief synopsis of why I think CSS Grid is going to change the landscape of the web forever, and why I think it’s so important from a design and developer perspective.

    Many website designs today are stuck in what I would call an aesthetic rut. That is, they are all comprised of similar design patterns (similar icons, sections, hero images, etc.) and are structured with common layout patterns. As many speakers at the conference pointed out, this gets boring, fast. The CSS Grid Layout Module is meant to address these concerns by implementing a dynamic method of creating elegant layouts easily, and across two dimensions. Where Flexbox only handled layout in one dimension at a time (either column or row direction), CSS Grid handles layout for columns and rows simultaneously. CSS Grid makes possible what we used to do in traditional print layout: the utilization of white space to create movement and depth, with very little code that is both responsive and easily adaptable to new content.

    CSS Grid involves very little markup. A simple display: grid with its subset of attributes is all it takes. Rather than bore you with examples, check out this nifty guide. What used to comprise hundreds of lines of code wrapped in a framework (Bootstrap, Foundation, Skeleton) is now accomplished with a few lines, and presumably fewer dependencies mean an increase in performance and decrease in page load times. Grid is a great tool to prototype and design with, simply because you can now get up and running with no setup or dependencies - everything is baked into the browser.

    The true power and beauty of Grid is that it allows for both complete control over layout placement, or you can let the browser do the work. You can specify column (or row) spacing, and have CSS Grid decide where to place your content. If you want to leverage more control over where your content goes on the page, you can specify where it goes with grid-column: start line/end line or grid-row: start line/end line, or a combination of both.

    One of the most exciting things about CSS Grid is that we can use it now to prototype and plan for the future. My challenge to you, as designers and developers, to use it now so that when CSS Grid is released, not only will your project already take advantage of all the new and wonderful possibilities using CSS Grid, you will have also adapted the future friendly approach for your project. Need inspiration? Check out My reinterpretation of a Japanese Magazine Cover with CSS Grid.

    Caktus GroupDjango Under the Hood 2016 Recap

    Caktus was a proud sponsor of Django Under the Hood (DUTH) 2016 in Amsterdam this year. Organized by Django core developers and community members, DUTH is a highly technical conference that delves deep into Django.

    Django core developers and Caktus Technical Manager Karen Tracey and CEO/Co-founder Tobias McNulty both flew to Amsterdam to spend time with fellow Djangonauts. Since not all of us could go, we wanted to ask them what Django Under the Hood was like.

    Can you tell us more about Django Under the Hood?

    Tobias: This was my first Django Under the Hood. The venue was packed. It’s an in-depth, curated talk series by invite-only speakers. It was impeccably organized. Everything is thought through. They even have little spots where you can pick up toothbrush and toothpaste.

    Karen: I’ve been to all three. They sell out very quickly. Core developers are all invited, get tickets, and some funding depending on sponsorship. This is the only event where some costs are covered for core developers. DjangoCon EU and US have core devs going, but they attend it however they manage to get funds for it.

    What was your favorite part of Django Under the Hood?

    Tobias: The talks: they’re longer and more detailed than typical conference talks; they’re curated and confined to a single track so the conference has a natural rhythm to it. I really liked the talks, but also being there with the core team. Just being able to meet these people you see on IRC and mailing list, there’s a big value to that. I was able to put people in context. I’d met quite a few of the core team before but not all.

    Karen: I don’t have much time to contribute to Django because of heavy involvement in cat rescue locally and a full time job, but this is a great opportunity to have at least a day to do Django stuff at the sprint and to see a lot of people I don’t otherwise have a chance to see.

    All the talk videos are now online. Which talk do you recommend we watch first?

    Karen: Depends on what you’re interested in. I really enjoyed the Instagram one. As someone who contributed to the Django framework, to see it used and scaled to the size of Instagram 500 million plus users is interesting.

    Tobias: There were humorous insights, like the Justin Bieber effect. Originally they’d sharded their database by user ID, so everybody on the ops team had memorized his user ID to be prepared in case he posted anything. At that scale, maximizing the number of requests they can serve from a single server really matters.

    Karen: All the monitoring was interesting too.

    Tobias: I liked Ana Balica’s testing talk. It included a history of testing in Django, which was educational to me. Django didn’t start with a framework for testing your applications. It was added as a ticket in the low thousands. She also had practical advice on how to treat your test suite as part of the application, like splitting out functional tests and unit tests. She had good strategies to make your unit tests as fast as possible so you can run them as often as needed.

    What was your favorite tip or lesson?

    Tobias: Jennifer Akullian gave a keynote on mental health that had a diagram of how to talk about feelings in a team. You try to dig into what that means. She talked about trying to destigmatize mental health in tech. I think that’s an important topic we should be discussing more.

    Karen: I learned things in each of the talks. I have a hard time picking out one tip that sticks with me. I’d like to look into what Ana Balica said about mutation testing and learn more about it.

    What are some trends you’re seeing in Django?

    Karen: The core developers met for a half-day meeting the first day of the conference. We talked about what’s going on with DJango, what’s happened in the past year, what’s the future of Django. The theme was “Django is boring.”

    Tobias: “Django is boring” because it is no longer unknown. It’s an established, common framework now used by big organizations like NASA, Instagram, Pinterest, US Senate, etc. At the start, it was a little known bootstrappy cutting edge web framework. The reasons why we hooked up with Django nine years ago at Caktus, like security and business efficacy, all of those arguments are ever so much stronger today. That can make it seem boring for developers but it’s a good thing for business.

    Karen: It’s been around for awhile. Eleven years. A lot of the common challenges in Django have been solved. Not that there aren’t cutting edge web problems. But should you solve some problems elsewhere? For example, in third party, reusable apps like channels, REST framework.

    Tobias: There was also recognition that Django is so much more than the software. It’s the community and all the packages around it. That’s what make Dango great.

    Where do you see Django going in the future?

    Karen: I hate those sorts of questions. I don’t know how to answer that. It’s been fun to see the Django community grow and I expect to see continued growth.

    Tobias: That’s not my favorite question either. But Django has a role in fostering and continuing to grow the community it has. Django can set an example for open source communities on how to operate and fund themselves in sustainable ways. Django is experimenting with funding right now. How do we make open source projects like this sustainable without relying on people with full-time jobs volunteering their nights and weekends? This is definitely not a “solved problem,” and I look forward to seeing the progress Django and other open source communities make in the coming years.

    Thank you to Tobias and Karen for sharing their thoughts.

    Philip SemanchukCreating PDF Documents Using LibreOffice and Python, Part 4

    This is the fourth and final post in a series on creating PDFs using LibreOffice and Python. The first three parts are here:

    They’re all a supplement to a talk I gave at PyOhio 2016.

    This final post is here to point you to a working code example that you can download from my Bitbucket repository. It’s enough to get you started so you can experiment with your own goals in mind.

    https://bitbucket.org/philip_semanchuk/pdfs_from_python

    One thing I mention in the code that’s worth repeating here is that the code uses ElementTree to manipulate XML. It’s sufficient for this demo, and the fact that it’s part of the Python standard library means you can run the demo without installing any third party libraries. For real world (i.e. non-demo) usage, I recommend lxml as a more robust and helpful alternative to ElementTree.

    A Curious Coincidence: Stinkin’ Badges

    Treasure of the Sierra Madre movie posterThe title of my PyOhio talk was “We Don’t Need No Stinkin’ PDF Library: Build PDFs with Python the Lazy Way”. You know the “we don’t need no stinkin’ [whatever]” meme, don’t you? It’s from the Mel Brooks movie Blazing Saddles. (You can find the clip on YouTube.) Did you know that Blazing Saddles is quoting another movie?

    The night before I gave my talk, I walked from my AirBnB to a nearby bar and bottle shop. (It’s simply called “The Bottle Shop”. Ohioans are plain dealers, apparently). I settled in there, happy with a pint of stout. On the big screen they were playing an old black and white Western — The Treasure of the Sierra Madre.

    I didn’t realize until it happened on the screen that this movie is the inspiration for the “We don’t need no stinkin’ badges” quote, although no one ever actually says “We don’t need no stinkin’ badges”. The actual line is “Badges? We ain’t got no badges. We don’t need no badges! I don’t have to show you any stinkin’ badges!”

    It’s pretty close to the line from B. Traven’s novel of the same name.

    I didn’t have time in my talk to mention Blazing Saddles, the mysterious B. Traven, The Treasure of the Sierra Madre, Humphrey Bogart, The Bottle Shop, nor the stout. But I was amused by our brief coincidence in Columbus.

    Caktus GroupOn building relationships - Digital Project Management Summit Recap

    Photo of Elizabeth speaking to DPM 2016 Summit by David Jordan.

    When I first became a digital project manager, I struggled to find professional resources. There was a plethora of information available for traditional project management, but not much specifically for digital project management. Luckily, a colleague recommended the Digital PM Summit, sponsored by the Bureau of Digital.

    It's one of the first, and still one of the only, professional conferences in the United States for digital project managers, and it’s grown every year. I initially attended the Summit three years ago in Austin, TX and it was an eye-opening, informative, and motivational experience. I met many people who did the same work that I did! I don't know where they were hiding before, but I was thankful to finally connect with them. It was such a relief to learn that others had the same challenges that I did, and that I was not alone.

    I attended the Summit every year since, and this year, I was invited to speak. I was one of twenty-two expert speakers and I was thrilled about the opportunity to present on one of the most important aspects of digital project management—relationships. I’ve found that positive working relationships are key not only to project success, but also to my success, my team’s success, and our client’s success. As project managers, we must focus on process and logistics to deliver quality projects on time, and all of that involves people.

    Investing in Relationships is Key to Project Success

    Projects are always about people, no matter where you work, and no matter what the project involves. Building positive working relationships can be challenging, but I’ve found that the best project managers invest in their team, clients, and stakeholders’ success, and not just in the project’s success. While it is possible to launch a project that successfully meets its goals, if the people involved are miserable, was it really a success? After all, the project is not going to pat you on the back, but the people involved would.

    I became a better project manager when I realized the importance of relationships, and when I recognized how much I could impact the people around me. Several years ago, as a brand new PM, I didn’t have the confidence that I do now, and it was difficult for me to take the lead. After a couple years, and thanks in large part to the Digital PM Summit, through which I learned skills that I could apply on the job, I became a more effective PM. I’m more flexible and adaptable, which is key to collaboration, and my interpersonal communication skills have improved.

    The importance of collaboration and communication were key points within my Digital PM Summit presentation, “Think Outside the Project Management Triangle.” The Project Management Triangle, or Iron Triangle, is a widely-known model of the typical constraints of project management that impact project quality—resources (budget and workers), project scope (features and functionality), and schedule (time and prioritization)—these are all components that project managers must consider and work with. In my experience, the Triangle is too limiting and overlooks relationships. The Triangle is a good basic model, but the best PMs think outside of the Triangle to positively leverage relationships in order to balance resources, scope, and schedule.

    Project Management Triangle

    The Project Management Triangle, or Iron Triangle

    Approximately fifty project managers attended my talk at the Digital PM Summit on October 13, 2016. By that point, I’d worked at Caktus for four weeks, which impacted my presentation because it was the first time I’ve worked with external clients and the first time I wasn’t a lone PM.

    As an established, full service Django shop, Caktus includes a team of trained PMs who provide professional project management services to clients. Working with other PMs helped me feel more at home at Caktus, and I learned a lot from them in a few weeks. For example, the PM team taught me about the Agile Scrum process, which I was familiar with, but never practiced before. Scrum includes a product owner who serves as an extension of the client, championing the client’s goals and priorities to the development team. At Caktus, project managers also act as product owners. During the Digital PM Summit, some attendees were curious about how I made the shift from working in-house to working with external clients, and how the Scrum process impacted my transition. I was happy to inform them that while working with external clients is different from working in-house, there are still similarities, and that Scrum had been a refreshing change for me.

    Unity in the Project Management Community Raises our Standards

    It’s not unusual for a PM to be a lone wolf, like I was in my last job where I was connected with only one other digital PM who was in a different department. We quickly became friends and confidants based on our shared experiences. As a new digital PM, support from others is critical to success, and I’m glad the Bureau of Digital, which hosted the Digital PM Summit, provides a platform for project managers to connect and share their knowledge during and after the conference. I was honored to support their mission with my own presentation, and as it turned out, relationship building was a main theme during this year’s Summit.

    The Bureau of Digital’s leaders, Brett Harned, Carl Smith, and Lori Averitt have increasingly focused on building a supportive community of professional project managers through events like the Summit. This year, the conference brought together 223 talented individuals, and the conversations have not stopped, thanks to Slack, Twitter, and LinkedIn. The attendees are still sharing tips, tools, and strategies with each other, and they’re forming Meetup groups. Since digital project management is still evolving and growing, conversations and collaboration among practitioners and experts is crucial to creating a greater shared understanding of best practices and to raise industry standards as well as recognition, helping all of us to better serve our clients and teams. I’m thrilled to work at a place like Caktus that recognizes the value of digital project management, and supports my engagement within the PM community.

    The relationships I’ve made and the community support that I’ve received via the Digital PM Summit has been integral to my growth and success as a digital project manager. I would not be where I am today, and I certainly would not have presented at the Digital PM Summit, without support. What it comes down to is that no matter who you are or what your job is, none of us work or live in a bubble, and none of us are an island. We depend upon others. Perhaps Carl Smith, one of the conference organizers, said it best: "When you invest in others, they invest in you.”

    Additional Links

    Thinking Outside the Project Management Triangle

    Caktus GroupRapidCon 2016: RapidPro Developer's Recap

    Developer Erin Mullaney was just in Amsterdam for RapidCon, a UNICEF-hosted event for developers using RapidPro, an SMS tool built on Django. The teams that have worked on RapidPro and its predecessor RapidSMS have gotten to know each other virtually over the years. This marks the second time they’ve all come from across the globe to share learnings on RapidPro and to discuss its future.

    RapidPro has the potential to transform how field officers build surveys, collect data, and notify populations. It allows users with no technical background to quickly build surveys and message workflows. With over 100% cell phone saturation in many developing regions, SMS presents a cheap, fast means of reaching many quickly.

    Erin worked closely with UNICEF Uganda in the development of a data analytics and reporting tool called TracPro for RapidPro. The organizers invited her to speak about the tool with other RapidPro users.

    How was the conference?

    Erin: The conference was amazing and I was ecstatic to go. Meeting the folks who work at UNICEF for the first time was exciting because we normally only speak via audio over Skype. It was nice to see them in person. We had an evening event, so it was fun to get to know them better in a social atmosphere. It was also a great opportunity to get together with other technical people who are very familiar with RapidPro and to think about ways we could increase usage of this very powerful product.

    What was your talk about?

    Erin: The title of my talk was “TracPro: How it Works and What it Does”. TracPro is an open source dashboard for use with RapidPro. You can use it for activities like real-time monitoring of education surveys. Nyaruka originally built it for UNICEF and it’s now being maintained by Caktus.

    I was one of two developers who worked on TracPro at Caktus. We worked to flesh out the data visualizations including bar charts, line charts over date ranges and maps. We also improved celery tasks and added other features like syncing more detailed contact data from RapidPro.

    What do you hope your listeners came away with?

    Erin: I delved into the code for how we synced data locally via Celery and the RapidPro API and how we did it in a way that is not server-intensive. I also had examples on how to build the visualizations. Both of those features were hopefully helpful for people thinking of building their own dashboards. Building custom dashboards in a short amount of time is really easy and fun. For example, it took a ShipIt Day I to build a custom RapidPro dashboard for PyCon that called the RapidPro API.

    What did you learn about RapidCon?

    Erin: People discussed the tools they were building. UNICEF talked about a new project, eTools, being used for monitoring. That sounds like an interesting project that will grow.

    RapidPro has had exponential usage and growth and Nyaruka and UNICEF are working really hard to manage that. It was interesting to learn about the solutions Nyaruka is looking at to deal with incredibly large data sets from places with a ton of contacts. They’ll be erasing unnecessary data and looking at other ways to minimize these giant databases.

    UNICEF is pretty happy with how RapidPro is working now and don’t expect to add too many new features to it. They’re looking ahead to managing dashboard tools like TracPro. So their focus is really on these external dashboards and building them out. The original RapidPro was really not for dashboards.

    What was the best part of RapidCon for you?

    Erin: It was pretty cool to be in a room and selected for this. I was one of only two women. Having them say “You have this knowledge that other developers don’t have” was rewarding. I felt like I had a value-add to this conference based on the past year and a half working on RapidPro-related projects.

    Will you be sharing more of your RapidPro knowledge in the future?

    Erin: So far, we’ve been the only one giving a talk about RapidPro, it seems. I gave a RapidPro talk at PyData Carolinas this year with Rebecca Muraya, Reach More People: SMS Data Collection with RapidPro and during a PyCon 2016 sponsor workshop. I’ve been encouraged to give this talk at more conferences and spread the word about RapidPro in order to get the word out further. I plan to submit it to a few 2017 conferences for sure!

    Thank you Erin for sharing your experience with us!

    To view another RapidPro talk Erin gave during PyData 2016 Carolinas, view the video here.

    Tim HopperData Scientists Need More Automation

    Many data scientists aren't lazy enough.

    Whether we are managing production services or running computations on AWS machines, many data scientists are working on computers besides their laptops.

    For me, this often takes the form of SSH-ing into remote boxes1, manually configuring the system with a combination of apt installs, Conda environments, and bash scripts.

    To run my service or scripts, I open a tmux window, activate my virtual environement, and start the process.2

    When I need to check my logs or see the output, I SSH back into each box, reconnect to tmux (after I remember the name of my session), and tail my logs. When running on multiple boxes, I repeat this process N times. If I need to restart a process, I flip through my tmux tabs until I find the correct process, kill it with a Ctrl-C, and use the up arrow to reload the last run command.

    All of this works, of course. And as we all know, a simple solution that works can be preferable to a fragile solution that requires constant maintenance. That said, I suspect many of us aren't lazy enough. We don't spend enough time automating tasks and processes. Even when we don't save time by doing it, we may save mental overhead.

    I recently introduced several colleagues to some Python-based tools that can help. Fabric is a "library and command-line tool for streamlining the use of SSH for application deployment or systems administration tasks." Fabric allows you to encapsulate sequences of commands as you might with a Makefile. It's killer feature is the ease with which it lets you execute those commands on remote machines over SSH. With Fabric, you could tail all the logs on all your nodes with a single command executed in your local terminal. There are a number of talks about Fabric on Youtube if you want to learn more. One of my colleagues reduced his daily workload by writing his system management tasks into a Fabric file.

    Another great tool is Supervisor. If you run long running processes in tmux/screen/nohup, Supervisor might be for you. It allows you to define the tasks you want to run in an INI file and "provides you with one place to start, stop, and monitor your processes". Supervisor will log the stdout and stderr to a log location of your choice. It can be a little confusing to set up, but will likely make your life easier in the longer run.

    A tool I want to learn but haven't is Ansible, "a free-software platform for configuring and managing computers which combines multi-node software deployment, ad hoc task execution, and configuration management". Unlike Chef and Puppet, Ansible doesn't require an agent on the systems you need to configure; it does all the configuration over SSH. You can use Ansible to configure your systems and install your dependencies, even Supervisor! Ansible is written in Python and, mercifully, doesn't require learning a Ruby-based DSL (as does Chef).

    Recently I've been thinking that Fabric, Supervisor, and Ansible combined become a powerful toolset for management and configuration of data science systems. Each tool is also open source and can be installed in a few minutes. Each tool is well documented and offers helpful tutorials on getting started; however, learning to use them effectively may require some effort.

    I would love to see someone create training materials on these tools (and others!) focused on how data scientists can take improve their system management, configuration, and operations. A screencast series may be the perfect thing. Someone please help data scientists be lazier, do less work, and reduce the mental overhead of dealing with computers!


    1. Thankfully I recently started taking better advantage of aliases in my ssh config

    2. When I have to do this on multiple machines, I'm occasionally clever enough to use tmux to broadcast the commands to multiple terminal windows. 

    Caktus GroupCommon web site security vulnerabilities

    I recently decided I wanted to understand better what Cross-Site Scripting and Cross-Site Request Forgery were, and how they compared to that classic vulnerability, SQL Injection.

    I also looked into some ways that sites protect against those attacks.

    Vulnerabilities

    SQL Injection

    SQL Injection is a classic vulnerability. It probably dates back almost to punch cards.

    Suppose a program uses data from a user in a database query.

    For example, the company web site lets users enter a name of an employee, free-form, and the site will search for that employee and display their contact information.

    A naive site might build a SQL query as a string using code like this, including whatever the user entered as NAME:

    "SELECT * FROM employees WHERE name LIKE '" + NAME + "'"
    

    If NAME is "John Doe", then we get:

    SELECT * FROM employees WHERE name LIKE 'John Doe'
    

    which is fine. But suppose someone types this into the NAME field:

    John Doe'; DROP TABLE EMPLOYEES;
    

    then the site will end up building this query:

    SELECT * FROM employees WHERE name LIKE 'John Doe'; DROP TABLE EMPLOYEES;'
    

    which might delete the whole employee directory. It could instead do something less obvious but even more destructive in the long run.

    This is called a SQL Injection attack, because the attacker is able to inject whatever they want into a SQL command that the site then executes.

    Cross Site Scripting

    Cross Site Scripting, or XSS, is a similar idea. If an attacker can get their Javascript code embedded into a page on the site, so that it runs whenever someone visits that page, then the attacker's code can do anything on that site using the privileges of the user.

    For example, maybe an attacker posts a comment on a page that looks to users like:

    Great post!
    

    but what they really put in their comment was:

    Great post!<script> do some nefarious Javascript stuff </script>
    

    If the site displays comments by just embedding the text of the comment in the page, then whenever a user views the page, the browser will run the Javascript - it has no way to know this particular Javascript on the page was written by an attacker rather than the people running the site.

    This Javascript is running in a page that was served by the site, so it can do pretty much anything the user who is currently logged in can do. It can fetch all their data and send it somewhere else, or if the user is particularly privileged, do something more destructive, or create a new user with similar privileges and send its credentials somewhere the bad guy can retrieve them and use them later, even after the vulnerability has been discovered and fixed.

    So, clearly, a site that accepts data uploaded by users, stores it, and then displays it, needs to be careful of what's in that data.

    But even a site that doesn't store any user data can be vulnerable. Suppose a site lets users search by going to http://example.com/search?q=somethingtosearchfor (Google does something similar to this), and then displays a page showing what the search string was and what the results were. An attacker can embed Javascript into the search term part of that link, put that link somewhere people might click on it, and maybe label it "Cute Kitten Pictures". When a user clicks the link to see the kittens, her browser visits the site and tries the search. It'll probably fail, but if the site embeds the search term in the results page unchanged (which Google doesn't do), the attacker's code will run.

    Why is it called Cross-Site Scripting? Because it allows an attacker to run their script on a site they don't control.

    CSRF

    Cross Site Request Forgeries

    The essence of a CSRF attack is a malicious site making a request to another site, the site under attack, using the current user's permissions.

    That last XSS example could also be considered a CSRF attack.

    As another, extreme example, suppose a site implemented account deletion by having a logged-in user visit (GET) /delete-my-account. Then all a malicious site would have to do is link to yoursite.com/delete-my-account and if a user who was logged into yoursite.com clicked the link, they'd make the /delete-my-account request and their account would be gone.

    In a more sophisticated attack, a malicious site can build a form or make AJAX calls that do a POST or other request to the site under attack when a user visits the malicious site.

    Protecting against vulnerabilities

    Protections in the server and application

    SQL Injection protection

    Django's ORM, and most database interfaces I've seen, provide a way to specify parameters to queries directly, rather than having the programmer build the whole query as a string. Then the database API can do whatever is appropriate to protect against malicious content in the parameters.

    XSS protection

    Django templates apply "escaping" to all embedded content by default. This marks characters that ordinarily would be special to the browser, like "<", so that the browser will just display the "<" instead of interpreting it. That means if content includes "<SCRIPT>...</SCRIPT>", instead of the browser executing the "..." part, the user will just see "<SCRIPT>...</SCRIPT>" on the page.

    CSRF protection

    We obviously can't disable links to other sites - that would break the entire web. So to protect against CSRF, we have to make sure that another site cannot build any request to our site that would actually do anything harmful.

    The first level of protection is simply making sure that request methods like GET don't change anything, or display unvalidated data. That blocks the simplest possible attack, where a simple link from another site causes harm when followed.

    A malicious site can still easily build a form or make AJAX calls that do a POST or other request to the site under attack, so how do we protect against that?

    Django's protection is to always include a user-specific, unguessable string as part of such requests, and reject any such request that doesn't include it. This string is called the CSRF token. Any form on a Django site that does a POST etc has to include it as one of the submitted parameters. Since the malicious site doesn't know the token, it cannot generate a malicious POST request that the Django site will pay any attention to.

    Protections in the browser

    Modern browsers implement a number of protections against these kinds of attacks.

    "But wait", I hear you say. "How can I trust browsers to protect my application, when I have no control over the browser being used?"

    I frequently have to remind myself that browser protections are designed to protect the user sitting in front of the browser, who for these attacks, is the victim, not the attacker. The user doesn't want their account hacked on your site any more than you do, and these browser protections help keep the attacker from doing that to the user, and incidentally to your site.

    Same-origin security policy

    All modern browsers implement a form of Same Origin Policy, which I'll call SOP. In some cases, it prevents a page loaded from one site from accessing resources on other sites, that is, resources that don't have the same origin.

    The most important thing about SOP is that AJAX calls are restricted by default. Since an AJAX call can use POST and other data-modifying HTTP requests, and would send along the user's cookies for the target site, an AJAX call could do anything it wanted using the user's permissions on the target site. So browsers don't allow it.

    What kind of attack does this prevent? Suppose the attacker sets up a site with lots of cute kitten pictures, and gets a user victim to access it. Without SOP, pages on that site could run Javascript that made AJAX calls (in the background) to the user's bank. Such calls would send along whatever cookies the user's browser had stored for the bank site, so the bank would treat them as coming from the user. But with SOP, the user's browser won't let those AJAX calls to another site happen. They can only talk to the attacker's own site, which doesn't do the attacker any good.

    CSP

    Content Security Policy (CSP)

    CSP is a newer mechanism that browsers can use to better protect from these kinds of attacks.

    If a response includes the CSP header, then by default the browser will not allow any inline javascript, CSS, or use of javascript "eval" on the page. This blocks many forms of XSS. Even if an attacker manages to trick the server into including malicious code on the page, the browser will refuse to execute it.

    For example, if someone uploads a comment that includes a <script> tag with some Javascript, and the site includes that in the page, the browser just won't run the Javascript.

    Conclusion

    I've barely touched the surface on these topics here. Any web developer ought to have at least a general knowledge of common vulnerabilities, if only to know what areas might require more research on a given project.

    A reasonable place to start is Django's Security Overview.

    The OWASP Top Ten is a list of ten of the most commonly exploited vulnerabilities, with links to more information about each. The ones I've described here are numbers 1, 3, and 8 on the list, so you can see there are many more to be aware of.

    Philip SemanchukJust in Time to Vote!

    I just returned from Tunisia a couple of days ago.

    In 2014 and 2015 I worked for Caktus Group to help develop an SMS-based voter registration system on behalf of the Libyan government (specifically the High National Election Commission, or HNEC). The open source version of this system is called Smart Elect.

    For the last three weeks in Tunisia (which is next door to Libya and a whole lot safer), Caktus’ Tobias McNulty and I trained a dozen HNEC employees on how to use, develop, and maintain the system. We talked about Python, Django, open source culture, GitHub Flow, and, of course, the upcoming U.S. election.

    On the eve of that election, I thought it appropriate to express gratitude for my opportunity to participate in the messy sausage-making that is democracy. Good luck to our new Libyan friends; I hope they get the opportunity to do the same in the very near future.

    i_voted

     

    Caktus GroupManaging multiple Python projects: Virtual environments

    Even Python learning materials that get into very advanced language features rarely mention some practical things that would be very helpful to know as soon as you start working on more serious projects, like:

    • How to install packages written by others so that your code can use them, without just copying the files into your own project.
    • How to work on multiple projects on one computer that might depend on different packages, and even different versions of the same packages, without them interfering with each other.

    The key concept that helps to manage all this is the "virtual environment".

    A virtual environment is a way of giving each of your Python projects a separate and isolated world to run in, with its own version of Python and installed libraries. It’s almost like installing a completely separate copy of Python for each project to use, but it’s much lighter weight than that.

    When you create a virtual environment named "foo", somewhere on your computer, a new directory named "foo" is created. There's a "bin" directory inside it, which contains a "python" executable. When you run that python executable, it will only have access to the python built-in libraries and any libraries that have been installed "inside" that virtual environment.

    Using a Virtual Environment

    When working at the command line, you can put the virtual environment's "bin" directory first on your PATH, what we call "activating" the environment, and from then on, anytime you run python, you'll be running in the environment.

    (This is one advantage of starting your scripts with:

    #!/usr/bin/env python
    

    rather than:

    #!/usr/bin/python
    

    By using the "/usr/bin/env" version, you'll get the first copy of Python that's on your PATH, and if you've activated a virtual environment, your script will run in that environment.)

    Virtual environments provide a "bin/activate" script that you can source from your shell to activate them, e.g.:

    $ . /path/to/virtualenv/foo/bin/activate
    $ python
    ... runs the python from virtual environment "foo"
    

    (but be sure to notice that you have to source the activate script, using . or source, and running the script normally will not activate your virtual environment as you might expect).

    After activating a virtual environment in a shell, a new command deactivate will become available, that will undo the activation.

    Activation is just a convenience for use in an interactive shell. If you're scripting something, just invoke Python or whatever other script or executable you need directly from the virtual environment's bin directory. E.g.:

    /path/to/env/bin/python myprogram.py
    

    or:

    /path/to/env/bin/script
    

    Anything that's been installed in the virtual environment and has executables or scripts in the bin directory can be run from there directly and will run in the virtual environment.

    Installing packages into a virtual environment

    In an activated virtual environment, you'll have a command pip that you'll use to install, update, and remove packages. E.g.:

    $ pip install requests==2.11.1
    

    will make the requests package, version 2.11.1, available to Python programs running in that virtual environment (and no other). It works by installing it in the lib directory within the virtual environment, which will only be on the library path for Python when running under that virtual environment.

    When the requests package releases version 2.12, you can upgrade to it with:

    $ pip install -U requests==2.12
    

    or just upgrade to the latest version with:

    $ pip install -U requests
    

    If you decide that your project no longer needs requests, you can remove it:

    $ pip uninstall requests
    

    In practice, we usually put a project's requirements in a file in the project's top directory named requirements.txt, one per line, and then use the -r option of pip to install them all at once. So we might have this in requirements.txt:

    requests==2.12
    psycopg2==2.6.1
    

    and then install all that with this command:

    $ pip install -r requirements.txt
    

    Of course, pip has all sorts of options and other features, which you can read about in the Pip documentation.

    Pip outside of virtual environments

    Pip is automatically provided for you inside all virtual environments, but you can also install it system-wide and use it to install Python packages for the whole system. But there are some things to be careful of:

    • Be sure to be aware whether you're in a virtual environment or not; it's much more likely that you intend to install something in a virtual environment, but if you run pip outside of one, it'll happily install things to the whole system if you have permissions.
    • Using pip to install things to the whole system can create conflicts with Python packages provided by your operating system distribution. Pip will typically install things under /usr/local while OS packaging will install things under /usr, which can help you figure out which version of things you're getting.

    Creating virtual environments

    Python started including direct support for virtual environments in version 3.3, but for versions before that, including Python 2.7, you need to use a third-party tool called virtualenv to create them.

    virtualenv still works for all versions of Python, and since I still need to deal with Python 2.7 on a regular basis, and don't want to try to remember the details of two different ways to create virtual environments, I just use virtualenv for everything. Plus, I still think virtualenv is simpler to use and more flexible.

    Here's how simple using virtualenv is:

    $ virtualenv -p /path/to/python /path/to/new/virtualenv
    

    So if I wanted to create a virtual environment at $HOME/foo that was running Python 2.7, I could run:

    $ virtualenv -p /usr/bin/python2.7 $HOME/foo
    

    That just creates the virtual environment. When you're ready to use it, don't forget to activate it first.

    Note that whenever virtualenv creates a new virtual environment, it automatically installs pip into it. This avoids the chicken-and-egg problem of how to install pip into the virtual environment so you can install things into the virtual environment.

    Installing virtualenv

    You can often install virtualenv using a system package, but it will likely be an old version. Since virtualenv bundles pip, that means all virtual environments you create with it will start with an old version of pip, and will complain at you until you upgrade pip to the latest. (That's easy - pip install -U pip - but gets tiresome.)

    I prefer to just install the latest version of virtualenv, and ironically, the best way to install the latest version of virtualenv system-wide is using pip.

    First, you'll want to install pip on your system - that is, globally. If your distribution provides a package to install pip, use that to avoid possibly breaking your system's installed Python.

    I'll try to provide instructions for a variety of systems, but be warned that the only one I'm currently using is Ubuntu.

    For Debian or Ubuntu, you'd use sudo apt-get install python-pip. For RPM-based distributions, yum -y install python-pip. For Arch, try pacman -S python-pip. On Mac OS X, brew install python should install pip too.

    Only if your distribution does not provide a way to install pip, then you'll need to install it directly. First, securely download get-pip.py. You can click that link and save the file somewhere, or use wget from the command line:

    $ wget https://bootstrap.pypa.io/get-pip.py
    

    Then run get-pip.py, with system privileges:

    $ sudo python get-pip.py
    Collecting pip
      Downloading pip-8.1.2-py2.py3-none-any.whl (1.2MB)
        100% |████████████████████████████████| 1.2MB 932kB/s
    Collecting wheel
      Downloading wheel-0.29.0-py2.py3-none-any.whl (66kB)
        100% |████████████████████████████████| 71kB 8.1MB/s
    Installing collected packages: pip, wheel
    Successfully installed pip-8.1.2 wheel-0.29.0
    

    Now that you have a system pip, you can install virtualenv:

    $ sudo pip install virtualenv
    

    and anytime you want to make sure you have the latest version of virtualenv:

    $ sudo pip install -U virtualenv
    

    Learning more

    virtualenv and pip are just parts of the whole virtual environment ecosystem for Python. They're enough to get a lot done, but here are some other things you might find interesting at some point.

    • virtualenvwrapper provides shell shortcuts to make working with virtual environments easier. For example, instead of having to type . path/to/my-venv/bin/activate, you can just type workon my-venv.

    • If you want to make your own package installable by pip, see the Python Packaging User Guide.

      (And send a word of thanks to the people behind it, the Python Packaging Authority. For many years, the packaging side of Python was a poorly documented mess of different, conflicting, poorly documented tools (did I mention the poor documentation?). In the last couple of years, the PPA has brought order out of this chaos and the Python development community is so much better for it.)

    Caktus GroupPresidential debate questions influenced by open source platform

    During the past two presidential debates, moderators from ABC and Fox asked candidates Hillary Clinton and Donald Trump voter-submitted questions from PresidentialOpenQuestions.com. The site, created by the bipartisan Open Debate Coalition (ODC) with the support of Caktus Group, is built on top of the open source Django web framework.

    “This coalition effort is a first-of-its-kind attempt to ensure moderators can ask questions that are not just submitted by the public, but voted on by the public to truly represent what Republican, Democratic, and Independent families are discussing around their dinner tables. Open Debates are the future,” said Lilia Tamm Dixon, Open Debate Coalition Director.

    Voters using PresidentialOpenQuestions.com submitted over 15,000 questions and cast more than 3.6 million votes for their favorite submissions. The selected debate questions had an unprecedented audience. According to Nielsen Media, 66.5 million viewers watched the second debate and 71.6 million the third debate.

    The ODC and Caktus teams continue to make improvements to the platform, readying new versions for use in political debates around the country. For national media coverage on the Open Debate Coalition and to learn more about their goals, see articles from The Atlantic, The Los Angeles Times, and Politico.

    Tim HopperUnderstanding Probabilistic Topic Models By Simulation

    I gave a talk last week at Research Triangle Analysts on understanding probabilistic topic models (specificly LDA) by using Python for simulation. Here's the description:

    Latent Dirichlet Allocation and related topic models are often presented in the form of complicated equations and confusing diagrams. Tim Hopper presents LDA as a generative model through probabilistic simulation in simple Python. Simulation will help data scientists to understand the model assumptions and limitations and more effectively use black box LDA implementations.

    You can watch the video on Youtube:

    I gave a shorter version of the talk at PyData NYC 2015.

    Caktus GroupShipIt Day Recap Q3 2016

    This ShipIt day marks four years of ShipIt days at Caktus! We had a wide range of projects that people came together to build. Most importantly, we all had fun and learned through actively working on the projects. People explored new technologies and tools, and had a chance to dig a bit deeper into items that piqued their interest in their regular work.

    React + Django = django-jsx

    Calvin did some work inspired by a client project to create tools for working with React’s JSX DOM manipulation within Django projects. This bridge allows embedding of JSX in Django templates (even using Django template language syntax) to be compiled and then rendered on the page. Calvin released django-jsx up on Github and pypi, and is interested in feedback from people who use it.

    Django ImageField compression

    Dmitriy continued working on the TinyPNG compressed Django ImageField from the previous ShipIt Day. He’s shared his updates through the Github repository django_tinypng. This time Dmitriy worked on cleaning up the project in preparation for possibly submitting it to pypi. His work included adding documentation and a nice way to migrate pre-existing image fields in projects to the new compressed image field.

    Python microservices with asyncio

    Dan explored the asyncio capabilities of Python 3 via a long standing project of his. He had a previous project to control displaying videos. Issues came up when the player would lose connectivity and Dan wanted his program to be able to dynamically recover. Dan dug into the asyncio documentation head first, but was a bit overwhelmed by the scope of the library. Luckily, he found an excellent write up by Doug Hellmann in his Python Module of the Week series. Dan used what he learned to build an event loop, and focused on making his project more resilient to handle errors gracefully.

    More Python microservices with asyncio

    Mark created a set of microservices working together including a worker, a web server to handle webhook requests, a web hook generator, and a logging server. These services communicated together using rabbitmq and asyncio. The work that Mark did on this was a fun but relevant diversion from his prep for his upcoming All Things Open talk next week on RabbitMQ and the Advanced Message Queueing Protocol.

    Microsoft Azure provisioning and deployment

    David worked with Microsoft Azure and compared it with our standard provisioning and deployment practices. He learned about how configuration and management tools around Azure compare to those of other cloud providers. As a test case, David built a test application using our SaltStack based django-project-template and worked on getting the test application up and running to identify any pain points.

    Elixir

    Neil shared with the team his explorations into using Elixir. Elixir is a functional language built on top of Erlang’s virtual machine (VM), but without some of the peculiarities of the older Erlang language. The Erlang VM was developed with extreme fault tolerance in mind for creating telecom software (eg. the electronic phone switching system) that would never go down and could even be upgraded without downtime. Neil delved into this high availability mindset by creating a test project with worker processes handling data storage and supervisor processes in place to restart failed worker processes. Overall, Neil found the exploration useful in that understanding a wide range of programming language paradigms helps you to think through challenges in any language, in different ways.

    Selenium testing

    Rebecca and Alex both worked on updating or adding front-end tests to projects via Selenium. Rebecca looked at updating the tests in django-scribbler in preparation of an upgrade of the open source project to support Django 1.10. Alex looked into using information from Mark Lavin’s 2015 DjangoCon talk on front-end testing and amazing documentation to add front-end tests to an existing project.

    Implicit biases

    Charlotte F., Liza, and Sarah facilitated Caktus team members taking an implicit bias test from Harvard’s Project Implicit. Caktus team members participated by taking the test and anonymously shared their results. Charlotte and Liza reviewed the responses and compared them to the average responses for all project implicit respondents. This was a great way for team members to examine and become aware of implicit biases that may arise when interacting with people inside and outside of Caktus.

    Team availability forecasts

    Sarah worked on a new tool for forecasting team availability via a tool that uses project schedules in their native format so that they do not need to be reformatted for forecasting. In working on this project, Sarah had a chance to learn some new spreadsheet creation tools and practice laying out a sheet in a way that can be sustainably maintained.

    UX design case study for cat adoption

    Charlotte M. and Basia Coulter invited everyone at Caktus to participate as clients would in a UX discovery workshop. Alley Cats and Angels is a local animal rescue that Caktus is particularly fond of. Alley Cats and Angels has an internal database to track information about cats getting enrolled in its programs, foster homes, adoption applications, adopters etc. It also has a public-facing website where its programs are described and relevant application forms can be submitted, and where cats available for adoption are featured. But there is no automated communication between the database and the public-facing website, and not all important information is being tracked in the database. That results in significant overhead of manual processes required to keep all information in order, and to facilitate programs. Using a story mapping technique, Charlotte and Basia worked with Caktii to map out a web-based application that would allow for an integration of the internal database and the public-website, automation of critical processes, and more complete information tracking. They identified important user flows and functionality, and broke them down into individual user stories, effectively creating a backlog of tasks that could be prioritized and leveraged in a sprint-based development process. They also determined which features were necessary for the first iteration of the application to deliver client value. By doing so, they defined a version of the minimum viable product for the application. At the end of the workshop they took some time to sketch paper prototypes of selected features and screens. The result of the workshop was a comprehensive set of deliverables (user flows, backlog of user stories, minimum viable product, and paper prototypes) that could serve as starting point for application development.

    Tim HopperUndersampled Radio Interview

    I was flattered to be asked to be on a burgeoning data science podcast called Undersampled Radio. You can listen here. We recorded the interview on a Google Hangout, so you can also watch it here.

    Caktus GroupDon't keep important data in your Celery queue

    The Celery library (previous posts) makes it as easy to schedule a task to run later as calling a function. Just change:

    send_welcome_email('dan@example.com')
    

    to:

    send_welcome_email.apply_async('dan@example.com')
    

    and rely on Celery to run it later.

    But this introduces a new point of failure for your application -- if Celery loses track of that task and never runs it, then a user will not get an email that we wanted to send them.

    This could happen, for example, if the server hosting your Celery queue crashed. You could set up a hot backup server for your queue server, but it would be a lot of work.

    It's simpler if you treat your Celery queue like a cache -- it's helpful to keep around, but if you lose it, your work will still get done.

    [Ed: the code snippets in this post are just intended as little illustrations to go with the ideas in this post, and by no means should they be taken as examples of how to implement these ideas in production. ]

    We can do that by changing our pattern for doing things in the background. The key idea is to keep the information about the work that needs to be done in the database, with the rest of our crucial data.

    For example, instead of:

    send_welcome_email.apply_async('dan@example.com')
    

    we might add a needs_welcome_email Boolean field to our model, and write:

    user.needs_welcome_email = True
    user.save()
    

    Now we know from our database that this user needs to get a welcome email, independently of Celery's queue.

    Then we set up a periodic task to send any emails that need sending:

    @task
    def send_background_emails():
        for user in User.objects.filter(needs_welcome_email=True):
            send_welcome_email(user.email)
            user.needs_welcome_email = False
            user.save()
    

    We can run that every 5 minutes, it'll be quite cheap to run if there's no work to do, and it'll take care of all the "queued" welcome emails that need sending.

    If we want the user to get their email faster, we can just schedule another run of the background task immediately:

    user.needs_welcome_email = True
    user.save()
    send_background_emails.apply_async()
    

    And the user will get their email as fast as they would have before.

    We will still want to run the background task periodically in case our queued task gets lost (the whole point of this), but it doesn't have to run as frequently since it will rarely have any work to do.

    By the way, I learned this while doing a code review of some of my co-worker Karen's code. This ability to continue learning is one of my favorite benefits of working on a team.

    Expiring tasks

    Now that we've made this change, it opens up opportunities for more improvements.

    Suppose we're scheduling our periodic task in our settings like this:

    CELERYBEAT_SCHEDULE = {
        'process_new_work': {
            'task': 'tasks.send_background_emails',
            'schedule': timedelta(minutes=15),
        },
    }
    

    Every 15 minutes, celery will schedule another execution of our background task, and if all is well, it'll run almost immediately.

    But suppose that our worker is unavailable for a while. (Maybe it lost connectivity temporarily.) Celery will keep on queuing our task every 15 minutes. If our worker is down for a day, then when it comes back, it'll see 24*4 = 96 scheduled executions of our task, and will have to run the task 96 times.

    In our case, we're not scheduling our task all that frequently, and the task is pretty lightweight. But I've seen times when we had thousands of tasks queued up, and when the workers were able to resume, the server was brought to its knees as the workers tried to run them all.

    We know that we only need to run our task once to catch up. We could manually flush the queue and let the next scheduled task handle it. But wouldn't it be simpler if Celery knew the tasks could be thrown away if not executed before the next one was scheduled?

    In fact, we can do just that. We can add the expires option to our task when we schedule it:

    CELERYBEAT_SCHEDULE = {
        'process_new_work': {
            'task': 'tasks.send_background_emails',
            'schedule': timedelta(minutes=15),
            'options': {
                'expires': 10*60,  # 10 minutes
            }
        },
    }
    

    That tells Celery that if the task hasn't been run within 10 minutes of when we schedule it, it's not worth running at all and to just throw it away. That's fine because we'll schedule another one in 5 more minutes.

    So now what happens if our workers stop running? We continue adding tasks to the queue - but when we restart the worker, most of those tasks will have expired. Every time the worker comes to a task on the queue that is expired, it will just throw it away. This allows it to catch up on the backlog very quickly without wasting work.

    Conclusion

    As always, none of these techniques are going to be appropriate in all cases. But they might be handy to keep in your toolbox for the times when they might be helpful.

    Caktus GroupPyData Carolinas 2016 Recap

    We had a great time at the inaugural PyData Carolinas hosted nearby at IBM in the Research Triangle Park. People from Caktus presented a number of talks and the videos are now up online:

    There were also many more fascinating talks about how people in and around North and South Carolina are using Python to do data analysis with Pandas, Jupyter notebooks, and more. It was a great event that brought together the strong communities around data and Python locally to celebrate their overlapping interests. We had a great time meeting folks and reconnecting with old friends at the after hours events hosted by MaxPoint and the Durham Convention & Visitors Bureau. Many thanks to all of the local organizers and sponsors who worked together to put on a great program and event. We can’t wait until the next one!

    Philip SemanchukCreating PDF Documents Using LibreOffice and Python, Part 3

    This is part 3 of a 4-part series on creating PDFs using LibreOffice. You should read part 1 and part 2 if you haven’t already. This series is a supplement to a talk I gave at PyOhio 2016.

    Here in part 3, I review the conversation we (the audience and I) had at the end of the PyOhio talk. I committed the speaker’s cardinal sin of not repeating (into the microphone) the questions people asked, so they’re inaudible in the video. In addition, we had some interesting conversations among multiple people that didn’t get picked up by the microphone. I don’t want them to get lost, so I summarized them here.

    The most interesting thing I learned out of this conversation is that LibreOffice can open PDFs; once opened they’re like an ordinary LibreOffice document. You can edit them, save them to ODF, export to PDF, etc. Is this cool, or what?

    First Question: What about Using Excel or Word?

    One of the attendees jumped in to confirm that modern MS Word formats are XML-based. However, he went on to say, the XML contains a statement at the top that says something like “You cannot legally read the rest of this file”. I made a joke about not having one’s lawyer present when reading the file.

    In all seriousness, I can’t find anything online that suggests that Microsoft’s XML contains a warning like that, and the few examples I looked at didn’t have have any such warning. If you can shed any light on this, please do so in the comments!

    We also discussed the fact that one must invoke the office app (LibreOffice or Word, Excel, etc.) in order to render the document to PDF. LibreOffice has a reputation for performing badly when invoked repeatedly for this purpose. LibreOffice 5 may have addressed some of these problems, but as of this writing it’s still pretty new so the jury is still out on how this will work in practice.

    Another attendee noted that Microsoft can save to LibreOffice format, so if Word (or Excel) is your document-editing tool of choice, you can still use LibreOffice to render it to PDF. That’s really useful if MS Office is your tool of choice but you’re doing rendering on a BSD/Linux server.

    Question 2: What about Scraping PDFs?

    The questioner noted that scraping a semi-complex PDF is very painful. It’d be ideal, he said, to be able to take a complex form like the 1040 and extract key value pairs of the question and answer. Is the story getting better for scraping PDFs?

    My answer was that for the little experience I have with scraping PDFs, I’ve used PDFMiner, and the attendee said he was using the same.

    Someone else chimed in that it’s a great use case for [Amazon’s] Mechanical Turk; in his case he was dealing with old faxes that had been scanned.

    Question 3: Helper Libraries

    Matt Wilson asked if it would make sense to begin building helper libraries to simplify common tasks related to manipulating LibreOffice XML. My answer was that I wasn’t sure since each project has very specific needs. Someone else suggested that one would have to start learning the spec in order to begin creating abstractions.

    In the YouTube comments, Paul Hoffman1 called our attention to OdfPy a “thin abstraction over direct XML access”. It looks quite interesting.

    Comment 1: Back to Scraping

    One of the attendees commented that he had used Jython and PDFBox for PDF scraping. “It took a lot to get started, but once I started to figure out my way around it, it was a pretty good tool and it moved pretty speedily as compared to some of the other tools I used.” He went on to say that it was pretty complete and that it worked very well.

    Question 4: About XML Parsing

    The question was what I used to parse the XML, and my answer was that I used ElementTree from the standard library. Your favorite XML parsing library will work just fine.

    Question 5: Protecting Bookmarks

    The question was whether or not I did anything special to protect the bookmarks in the document. My answer was that I didn’t. (I’m not even sure it’s possible.) If you go through multiple rounds of editing with your client, those invisible bookmarks are inevitably going to get moved or deleted, so expect a little maintenance work related to that.

    Comment 2: Weasyprint

    One of the attendees commented that Weasyprint is a useful HTML/CSS to PDF converter. My observation was that tools in this class (HTML/CSS to PDF converters) are not as precise as either of the methods I outlined in this talk, but if you don’t need precision they’re probably a nice way to go.

    Question 6: unoconv in a Web Server

    Can one use unoconv in a Web server? My answer was that it’s possible, but it’s not practical to use it in-process. For me, it worked to do so in a demo of an intranet application, but that’s about as far as you want to go with it. It’s much more practical to use a distributed processing application (Celery, for example).

    One of the attendees concurred that it makes sense to spin it off into a separate process, but “unoconv inexplicably crashes when it feels like it”.

    Comment 3: Converting from Word

    The initial comment was that pandoc might help with converting from Word to LibreOffice. This started a conversation which I’d summarize this way:

    • LibreOffice can open MS Office docs, so use that instead of pandoc and save as LibreOffice
    • If you open MS Office documents with LibreOffice, double check the formatting because it doesn’t always survive the transition
    • LibreOffice can open PDFs for editing.

    Caktus GroupHow to tell if you’re building great user experience

    In web and software development, when we talk about user experience, we usually mean the experience you have when you are using a website or an application. Does it feel like you know where to click to get to your next destination? Do you know what to do in order to accomplish a task? Has anything you’ve clicked taken you to an unexpected page? Are you getting frustrated with or are you enjoying the website or app? Answers to these and similar questions are what describes the experience you’re having as a user. That’s user experience.

    Nobody wants to deliver a bad experience. It’s a common sense approach to build applications that people will want to use, especially if your bottom line depends on the adoption and usage of the application you’re building. Ultimately, it’s not cutting edge technology that’ll make your application successful; it’s the people who will pay for your service or product. And people who use technology have increasingly higher expectations about the experience your product offers.

    While there is a gut-level understanding around the idea that a happy customer is a customer more likely to pay, there seems to be less of a recognition that if a customer happens to be a user (of your application or website), it is the user you need to make happy.

    “Numerous industry studies have stated that every dollar spent on UX brings in between $2 and $100 dollars in return.” Peter Eckert, FastDesign

    Product Stakeholders, Designers, or Developers Usually Aren’t the Users

    It turns out that building great user experience isn’t easy. But it is possible if you invest the time in understanding your users and their predicaments. One of the most common mistakes made in software design and development is forgetting that product stakeholders, designers, or developers usually are not the user segments for whom the application is being built. Any interface or application flow you design without understanding your target users’ pain points, their current workflows, and their needs and wants, will be but an assumption. And as humans we tend to assume that others would think and act as we do. So long as you let your assumptions go untested, you’re not really building for your users.

    Methods to Build Great User Experience

    There are methods that help forge an understanding of users and the contexts within which they function. You can employ these methods before any design and development work is commenced, during the product design and development processes, and after the product is built and released. For the purpose of this article, I’ve chosen to group those methods as follows:

    • Product discovery
    • Qualitative user research (user interviews, contextual inquiry)
    • Usability testing (testing your application with real users)
    • Quantitative user research (data analytics and user surveys)
    • Heuristic evaluation (UX review)

    “While it's not realistic to use the full set of methods on a given project, nearly all projects would benefit from multiple research methods and from combining insights.” Christian Rohrer, Nielsen Norman Group

    • A new project should start with a discovery phase, during which you build a shared understanding with all stakeholders about the problem to be solved and the users for whom it is being solved.
    • Qualitative user research methods are used before product design and development begin, but they can continue into product development stages, and beyond. They allow you to ask users directly about their needs and pain points or to observe them in contexts in which they would be using your product.
    • A project in progress (and, of course, any existing project) always benefits from usability testing. Usability testing is an effective method to discover usability problems and to make adjustments consistent with user expectations. In-person usability testing sessions are relatively easy and inexpensive to set up. With three to five users per testing session, you have a great chance of discovering usability issues your application may be suffering from.
    • Quantitative user research methods give you insights into questions such as “who,” “what,” “where,” “how much,” “how many.” You can learn how many people visit your website, which pages they visit, who they are, and where they are coming from. You can also analyse user surveys and understand the distribution of users’ responses. All that data can help you identify your user segments and their behaviors.
    • Finally, in order to analyze an existing application against established UX conventions, you can conduct a heuristic evaluation. Heuristics.

    For Best Results, Use a Combination of Methods

    It would be difficult to include all user experience methods and tools on any given project. Time and budgets are always important factors to consider. A good approach is to leverage a combination of small number of different methods, for example a discovery workshop and a user survey before a new project begins, and usability testing with a dash of quantitative methods once development is underway. You can customize your tool set for each project as needed in order to make sure you build as good user experience as you possibly can.

    Tim HopperSharing Your Side Projects Online

    I gave a talk at Pydata Carolinas 2016 on Sharing Your Side Projects Online. Here's the abstract:

    Python makes it easy to create small programs to handle all kinds of tasks, and tools like Github make it easy and free to share code with the world. However, simply adding a *.py to a Github repository (or worse: a zip file on your personal website) doesn't mean other Python programmers will be able to run and use your code.

    For years, I've written one-off scripts and small programs to automate personal tasks and satisfy my curiosity. Until recently, I was never comfortable sharing this code online. In this talk, I will share good practices I've learned and developed for sharing my small projects online.

    The talk will include tips on writing reusable scripts, the basics of Git and Github, the importance of READMEs and software licenses, and creation of reproducible Python environments with Conda.

    Besides making your code more usable and accessible to others, the tips in this talk will help you make your Github profile a valuable component of your online résumé and open the door for others to improve your programs through Github pull requests.

    The video is now online. I sincerely hope others find it valuable.

    Caktus GroupPrinciples of good user experience (UX)

    Google “UX principles” and you’ll end up with a search results page that offers:

    • 5 principles...
    • 31 fundamentals….
    • 10 principles…
    • Guidelines…
    • Basics….

    So let’s get this out of the way: no single checklist will guarantee that you create a great user experience. Every project is different, and the development of every product should be tailored to the user segment the product is built for. But there are some principles that can guide design decisions no matter what and you will not go wrong if you follow them. Let’s talk about a few:

    • Consistency,
    • Content structure,
    • Affordances and signifiers, and
    • Feedback (it’s not opinions!).

    Consistency

    Consistency applies to a range of design contexts:

    • Visual design,
    • Typography,
    • User interface (UI) design,
    • Interactions design,
    • Markup (the HTML code written by developers),
    • Language (the copy written by content strategists),
    • And more.

    Consistency means that we use the same devices—whether visual, typographic, interactive, or linguistic—to convey the same type of information across the entire website or application. If a green checkmark mean success, use the same shade of green, size, shape, and style, no matter where in the application it is used.

    Content structure

    Content structure example (Principles of good UX)

    Content structure reflects the information architecture of the application. Every website or application has content that can be divided up into categories. Those categories have certain relationships to one another:

    • Some categories are like siblings, they reside at the same level of hierarchy;
    • Other categories have parent-child relationships, with parent categories containing the children within them.

    Even on a single web page, it is not only important to divide content into smaller chunks and to categorize it, but also to convey visually the relationships between the various pieces.

    Users do not read text on the screen word by word, but rather scan it to locate points of interest and drill deeper from there. For that reason, content on any interface must be scannable. Some techniques used to enhance user experience include:

    • Breaking up text into short paragraphs,
    • Using clearly distinguishable and semantic headings,
    • Displaying text as bulleted lists,
    • Using shorter line length, and
    • Aligning text to the left.

    Affordances and signifiers

    Example of door knobs with different affordances. (Principles of good UX)

    Affordance is a property of an object that allows us to use that object in a specific way. A well-designed affordance is accompanied by signifiers that communicate to us, “I can be used in this way!”

    Think about a lever door handle you know that you need to press down on it in order to open the door. A spherical knob, on the other hand, communicates a need to turn to accomplish the same outcome. The shape of these devices informs us about how they can be used. The same is true for websites and applications.

    Users recognize a piece of underlined text as a hyperlink. In this case, linked text is an affordance and the underline is a signifier that communicates clickability. When we talk about affordances and signifiers of an interface, we’re talking about interface elements and visual cues that accompany them that help users understand that those elements are interactive and what kind of interaction to expect from them.

    Feedback (not an opinion!)

    Example of feedback (Principles of good UX)

    Feedback is any way in which the system communicates to the user that an action has occurred and what outcome results from that action. Take, for example, a task of filling out and submitting a form online. It’s important that you design the form to communicate feedback to the user upon submission.

    Form submission feedback could be provided as a text message, change of color, a subtle animation, or a combination of all of the above. No matter how the feedback is expressed, however, it must convey a meaning such as “You have been successful in submitting this form,” or “You have not been successful in submitting this form.”

    Users need to know what happens when they interact with an interface. They need to understand what result has come out of their specific interaction with the system, and what they should do next based on that outcome.

    Takeaway

    Whether you follow five principles, ten principles, or thirty-one fundamentals, you have to test your application with actual users. Designers and developers make assumptions along the way, no matter how diligent they are in applying principles of good UX to product development. Only testing those assumptions against actual people, who are using the application, will help weed out any lingering problems that may compromise great user experience you want to offer to your users.

    Caktus GroupWhat We’re Clicking - September 2016 Link Roundup

    Every 30 days, we look over the most popular and talked about links that we’ve either shared on social media or amongst ourselves. This month, a lot of the favorites were from our peers on our blog post. Here’s our top five for September.

    Insights into software development from a quality assurance (QA) pro

    Our QA Analyst, Charlotte Foque, sheds light onto what exactly quality assurance is and shares with us the intricacies of doing it well.

    NoSQL Databases: a Survey and Decision Guidance

    Felix Gessert of Baqend shares an overview of the NoSQL landscape and various tradeoffs in this highly detailed article. The article demonstrates how difficult it can be to match your database to your work/queries.

    Audiences, Outcomes, and Determining User Needs

    “Every website needs an audience. And every audience needs a goal. Advocating for end-user needs is the very foundation of the user experience disciplines. We make websites for real people. Those real people are able to do real things. Everyone is happy.”

    Digital development principles: a tech firm’s take on understanding ecosystems

    Caktus UX Designer Basia Coulter and Strategist Tania Lee talk about ways to understand existing ecosystems and building consensus behind goals and solutions.

    Creating Your Code Review Checklist

    In this DZone article, Erik Dietrich presents not only a code review checklist, but a philosophy: automate the easy stuff, code review the important stuff.

    Philip SemanchukThanks for PyData Carolinas

    My PyData Pass

    Thanks to all who made PyData Carolinas 2016 a success! I had conversations about eating well while on the road, conveyor belts, and a Fortran algorithm to calculate the interaction of charged particles. Great stuff!

    My talk was on getting Python to talk to compiled languages; specifically C, Fortran, and C++.

    Once the video is online I’ll update this post with a link.

     

    Caktus GroupCaktus Group @ PyData Carolinas 2016

    Tomorrow marks the official beginning of PyData Carolinas 2016 (though technically, the tutorials started today). This is the first time PyData has hosted a conference in our area. We’re especially proud of the way local leaders and members of meetups like TriPython, TechGirlz, Girl Develop It RDU, and PyLadies have worked in tandem to put this event together for the Python community.

    Caktus will be at PyData tomorrow and Friday as a Silver sponsor. We’re glad to be in the company of esteemed sponsoring organizations like IBM, RENCI, Continuum Analytics, and the Python Software Foundation.

    Come see us at the following PyData events and talks!

    Wednesday, September 14th


    7:00PM - 8:00PM
    Evening Social at the 21c Museum Hotel
    Join us after the tutorials with a social hosted by the Durham Convention Center. More details here.

    Thursday, September 15th


    8:30AM - 5:30PM
    Caktus Booth
    We’ll have a booth with giveaways for everyone plus a raffle. We’ll also have a display of OpenDataPolicingNC, a project Caktus CTO Colin Copeland helped lead; it received a White House nod for Code for Durham.

    11:30AM - 12:10PM
    Reach More People: SMS Data Collection with RapidPro (Room 1)
    Erin Mullaney (Caktus) and Rebecca Muraya (TransLoc) will share how to use RapidPro, an open source SMS survey data collection app developed by UNICEF, to collect data. They’ll also show you how to use RapidPro’s API to create your own data visualizations.

    11:30AM - 12:10PM
    Python, C, C++, and Fortran Relationship Status: It’s Not That Complicated (Room 2)
    Philip Semanchuk, a Caktus contractor, gives an overview of your many options for getting Python to call and exchange data with code written in a compiled language. The goal is to make attendees aware of choices they may not know they have, and when to prefer one over another.

    6:30PM - 8:30PM
    Drinks & Data (The Rickhouse, Durham)
    We're looking forward to this event, hosted by MaxPoint. It overlooks the park where the Durham Bulls play.

    Friday, September 16th


    8:30AM - 5:30PM
    Caktus Booth
    Do stop on by to say hello! We’d love to learn more about the different projects you’re working on.

    10:40AM - 11:20PM
    Identifying Racial Bias in Policing Practices: Open Data Policing (Room 2)
    Colin Copeland, Caktus Co-founder and CTO, Co-chief of Code for Durham, and a 2015 Triangle Business Journal 40 Under 40 awardee, will give a talk on OpenDataPolicingNC.com. His efforts were recognized via an invitation to the White House during Obama’s Police Data Initiative celebration. North Carolina developers and civil rights advocates used demographic data from nearly 20,000,000 unique NC traffic stops to create a digital tool for identifying race-based policing practices.

    11:30AM - 12:10 PM
    You Belong with Me: Scraping Taylor Swift Lyrics with Python and Celery (Room 1)
    Mark Lavin, Caktus Technical Director, and Rebecca Conley, Caktus developer, will demonstrate the use of Celery in an application to extract all of the lyrics of Taylor Swift from the internet. Expect laughter and fun gifs.

    12:30PM
    Raffle drawing for a copy of Lightweight Django (O’Reilly)
    We’ll contact the winner just in time for the book signing. Lightweight Django is co-authored by Caktus Technical Director Mark Lavin.

    12:45 - 1:10PM
    Book signing of Lightweight Django (O’Reilly) with Mark Lavin
    Line up early! We only have limited copies to give away. Each time we’ve done a book signing, the line has been far longer than copies available. For those who aren’t able to get a copy of the book, we’ll have coupon cards for a discount from O’Reilly.

    Can’t join us?

    If you can’t join us at PyData Carolinas and there’s a talk of ours you want to see, we’ll have the slides available after the conference. You can also follow us on Twitter during PyData itself: @caktusgroup.

    Caktus GroupWhat Web Analytics Can’t Tell You About User Experience

    Is analytics data collected for a website, an application, or a game sufficient to understand what problems users encounter while interacting with it and what prevents their full engagement?

    Why would you want to engage your client in a discovery workshop or your client’s users in user interviews, user surveys, or usability testing sessions, if you can simply look at the data gathered by an analytics tool, and tell with a high level of precision what’s working, and what is not?

    “The biggest issue with analytics is that it can very quickly become a distracting black hole of “interesting” data without any actionable insight.” - Jennifer Cardello, Nielsen Norman Group

    What metrics do you get out of data analytics?

    Analytics tools track and assemble data from events that happen on an existing website or in an application.

    The type of quantitative data you can collect with an analytics tools include:

    • Number of visits/sessions
    • Average duration of visits/sessions
    • Number of unique visitors
    • Average time on page
    • Percentage of new visits
    • Bounce rate
    • Sources of traffic
    • List of pages where users exit the website/application

    It is an abundant source of information. Data analytics tell you what users do on your website and where—and on which pages—they do it.

    So what’s missing from this picture?

    While data analytics are incredibly powerful in identifying the “whats” and the “wheres” of your website/application’s traffic, they tell you nothing about the “why.” And without an answer to the “why,” you are a step away from misinterpretation.

    Data analytics can be misleading if not supported by insights from qualitative user research

    Let’s say you notice that an average visit time on a given page is high. You might be tempted to congratulate yourself for having created such an engaging experience that users spend several minutes on the page. But it is equally possible that the experience you have created is confusing. It takes users a lot of time to make sense of what they are looking at on the page, and they’re spending all that time in deep frustration.

    Quantitative data can track user's journey through your website or application. They help you ask better questions, verify hypotheses about patterns of usage, and optimize the application’s performance to align with desired user behaviors.

    What data analytics cannot do is identify usability issues. Usability issues and their causes are best diagnosed through usability testing.

    Don’t take my word for it

    UX professionals frequently report their own and their clients’ inability to draw conclusive answers from data analytics alone. Below are a few insights from conversations I’ve had with UX practitioners on the UX.guide Slack channel.

    Christian Ress, co-founder at PlaytestCloud (a mobile game usability testing platform) says that customers often come to them because they spotted issues during soft-launch through their analytics. They see, for example, low interaction with certain features, retention issues, much higher number of attempts for certain game levels, but they do not understand what is causing those problems. It is through remote usability and playability testing sessions that the causes of the problems signaled by quantitative data can be discovered. Remote usability and playability testing involves recording players and prompting them to think out loud during all gameplay sessions.

    David Sharek, the founder of UX.guide, finds the greatest challenge in data overload, when a lot of quantitative information is collected without a sufficient amount of time spent on defining the problem. David approaches an investigation into product performance and usability like any research experiment. He formulates a hypothesis and sets out to test it. The quantitative data he collects with an analytics tool Piwik helps him verify hypotheses about the “what” of user behavior. Then he drills deeper into the “why” by talking to users.

    Vivien Chang, a UX designer at Brisbane, points out that quantitative methods are used to confirm or disconfirm working hypotheses about the usage patterns within an application, and they require a significant amount of data to do so. Qualitative methods, on the other hand, are tools to gain an understanding of underlying reasons for user actions and user’s motivations. In other words, you collect quantitative data to learn how people use your website or application. That information in itself gives you little or no insight into what problems users might be encountering in the process. To identify and counter usability issues, you should conduct qualitative studies such as usability testing.

    What’s the secret sauce?

    When you build a product such as a website or an application, you must pay attention to user experience. Your product’s success is not merely dependent on a cutting edge technology you may have employed; it depends on users (or customers) adopting the product. And increasingly sophisticated and savvy users won’t settle for a mediocre experience. You must give them the best experience you can.

    How do you build a great experience? By taking strategic advantage of all the tools in your toolbox. You begin the journey by exploring the problem to be solved, understanding the users, and the broader context in which they function. Through discovery workshops, you build a shared understanding with all stakeholders and work together as a team to design a great solution. You monitor potential and actual usability pain points by testing the product iterations with users and adjusting the product’s design accordingly. You measure performance and monitor user behavior patterns with data analytics to further back up your product strategy decisions. Then you dig deeper to understand the causes of user actions by conducting more usability testing.

    There you have it; the secret sauce to understanding the “what,” the “where,” and the “why” of user experience by tying together quantitative and qualitative user research methods.

    Og MacielPodcasts I've Been Listening To Lately

    Podcasts

    For someone who has run his own podcast for several years (albeit not generating a lot of content lately), it took me quite some time to actually start listening to podcasts myself. Ironic, I know, but I guess the main reason behind this was because I was always reading code at work and eventually, no matter how hard I tried, I just couldn't pay attention to what was being said! No matter how interesting the topic being discussed was or how engaging the hosts (or hosts) were, my brain would be so focused on reading code that everything else just turned into white noise.

    Well, fast forward a couple of years and I still am reading code (though not as much as I used to due to a new role), and I still have a hard time listening to podcast while at work... so I decided to only listen to them when I was not working. Simple, right? But it took me a while to change that for some reason.

    Anyhow, I now listen to podcasts while driving (which I don't really do a lot of since I work from home 99.99% of the time) or when I go for walks, and after a while I have started following a handful of them which are now part of my weekly routine:

    • All The Books which provide me with an up to date list of suggestions for what books to read next. They're pretty regular with their episodes, so I can always count on listening about new books pretty much every week.
    • Book Riot for another dose of more news about books!
    • Hack the Entrepreneur to keep up with people who are making something about what they are passionate about.
    • Wonderland Podcast which I only started listening to a few weeks back but it has turned into one of my favorite.
    • Science Vs another new addition to my list, with entertaining takes at interesting topics such as 'the G-spot', 'Fracking', 'Gun Control' and 'Organic Food'.

    Today I was introduced to Invisibilia and though I only listened to the first 10 minutes (I was giving the link during working hours, so no go for me), I'm already very interested and will follow it.

    I do have other podcasts that I am still subscribed to, but these listed here are the ones that I am still following every episode. Maybe if I had to drive to work every day or went for walks more often, maybe then I would listen to more podcasts? Trust me though, I rather continue listening to only a small set of them than drive to work every day. Don't get me wrong, I love going to work, but that's 2 hours/day of my life that I rather spend at home :)

    Caktus GroupDigital development principles: a tech firm’s take on understanding ecosystems

    When we meet potential clients, we want to learn more about their software development needs. Beyond that, we’re deeply curious about the work they do, those involved, and the kind of impact they desire to make in the world.

    Digital Principle 2, "Understand the Existing Ecosystem" embraces this idea. In many ways, the Digital Principles are an extension of conversations that are ongoing throughout the greater technology community. We're an Agile company, and one of the four propositions of the Agile manifesto reads, “Customer collaboration over contract negotiation.” No collaboration is complete without the inclusion of the user and relevant communities. We share here one of the methods we use to uphold Digital Principle 2.

    A critical tool to building shared understanding: Discovery Workshop

    We strive to reach a shared understanding of the existing ecosystem and to build consensus behind goals and solutions. To best produce for relevance and sustainability we collaborate with as many stakeholders as possible during discovery workshops. A discovery workshop is a method for all stakeholders to acknowledge existing assumptions. We then:

    • State hypotheses,
    • Brainstorm ideas, and
    • Prototype solutions for any set of problems that need solving.

    The client’s software development needs remain front and center, but it is through the process of building a thorough and shared understanding of the ecosystem that we can arrive at tailored solutions that address the actual needs in a more impactful way.

    We use a physical, participatory process that focuses on centering the human perspective and contextual environment with all the complexities in between. Human beings learn best by doing. We also tend to make assumptions without being fully aware of them. Within the collaborative environment of a discovery workshop, we get to:

    • Represent a range of perspectives,
    • Create the perfect setting to dig deep into the questions that are being asked,
    • Reframe the existing questions, and
    • Discover new questions that may have been missed before.

    A discovery workshop allows all stakeholders to suspend the focus on building the product in order to think about creating experiences first. We use established and well-researched industry tools like journey maps (pictured above) to support our efforts.

    Imagine you want to build a web application to support tracking climate change in a region for local communities. In order to build a solution that will enhance rather than impede the work of community actors, it is not enough to make a list of desirable features. You can never be sure the list is complete, unless you fully understand the current workflows with their pain points and unfulfilled needs, the benefits the proposed application is expected to generate, and the contexts in which it will be used.

    Understanding the contexts in which applications are being used

    In order to build a successful product, we need to have a deep understanding of the outcomes the product is expected to bring about. And for that to happen, we need to learn about all contexts in which the product will function.

    • Who will be using the product?
    • What level of comfort with technology do its potential users have?
    • Where will people be using it?
    • Will they be using the product exclusively to fulfill the need or will they also be using alternative ways to accomplish the same goals.
    • If the latter, will the two paths be competing with one another or will they be complimentary?
    • Does the product’s functionality need to be informed by external inputs or will it be entirely independent?

    Using threat modeling to mitigate risks to users

    Understanding the ecosystem is also a necessary part of threat modeling. Threat modeling requires understanding the physical and political spaces in addition to the digital touchpoints. Even with the best of intentions, data and assets can be co-opted and potentially used for harm. Understanding and planning risk mitigation strategies helps to improve the impact of the final product. These perspectives are critical for delivering the kinds of products that make a lasting impact on lives and have been an important aspect of our social impact work.

    Change happens and this change can save time and resources

    From our experience in conducting these workshops and closely partnering with our clients, it has been rare that the original pitch for a product did not change after having gone through the discovery workshop process. In some cases, we uncovered issues in business processes that needed to be addressed before a product could be useful or successful. In one case, for example, an original product pitch required a complex printing system. As we went through the discovery process, we determined that the necessary human resources required to maintain that printing system would be difficult to implement. We decided to take printing out of the priority feature set and thus saved the client time and money with this early discovery.

    Digital Principle 2 reflects best practice application design

    Investing time and energy into understanding the existing ecosystem is a core digital principle because of the incredible value it adds to outcome of any project.

    Within the software industry, the importance of directly answering user needs by speaking to them and understanding their world, is a long-cherished standard. Seeing this same approach promulgated throughout ICT4D via the Digital Principles can help more projects scale up from pilot projects which, in turn, can positively impact more lives meaningfully.

    Discovery workshops are a tool to uncover and dig into holistic and technical questions using industry tools and best practices. However, it is just the beginning of improving products and impact as ecosystems always change.

    Caktus GroupInsights into software development from a quality assurance (QA) pro

    Because quality assurance (QA) is all about creating a seamless application experience across any number of devices, it’s most successful when no one notices it. The craft and expertise behind the work of QA professionals, as a result, can sometimes feel hidden. Charlotte Fouque, Caktus’ QA Analyst, sheds light onto what exactly quality assurance is and the intricacies of doing it well.

    How did you get into QA?

    I came to QA because I speak French. I was doing language quality assurance testing, testing translations in social games. So I continued into software QA after that. I am very organized outside of work. I use Kanban in my personal life. It’s part of a natural impetus to create order out of chaos.

    What is quality assurance?

    QA for a web project means testing across selected devices and browsers to make sure the site works as intended. We hold the development team to their own definition of quality. Whatever they or the organization has set as their quality threshold, QA ensures that is being met.

    QA also serves as a user advocate. We have to think about whether the application feels good, feels enjoyable for users. Things don’t always line up between design intentions and technical implementation, and we have to be able to call out anything that doesn’t make sense. QA will make sure that the user is getting the best experience possible.

    When does QA play a role in software development?

    At Caktus, QA is involved from the beginning. We’re involved in estimates before a contract is signed. And we’re involved from every step on a project from the very first sprint, to development, and release.

    What’s a typical day of QA for you?

    In a typical day, I attend all the scrum meetings in the morning for the teams I’m working with. I have a testing plan that adds on only whatever stories or group of tasks the teams are working on in that sprint. I write the test cases, or reproducible steps to test a feature, in the morning. A test case, for example, could be that the close button on the popup needs to highlight when a mouse hovers over it. This test case would go into a test matrix that contains intersections of all devices and browsers. There’s a pass/fail for each test matrix field, or each device to each browser. Then in the afternoon, I usually follow the testing plan— cross browser, cross device testing.

    But no two days are ever the same in an Agile environment, and it’s never boring!

    How do you highlight potential bugs and issues for the development team?

    If I find an issue, I will submit a ticket that contains all the relevant information for the developer to reproduce and debug it. I try to be as concise as possible (no developer wants to have to read paragraphs of text to figure out what the problem is!) Basically, I try to be as helpful as I can be in answering any questions they have or helping them track down where the issue is occurring. I want the bug fixed just as much as they want to fix it.

    I sometimes sit down with the team, or we bounce ideas off each other on how to make a feature better. In my role, I put myself in the user’s shoes in a way that’s hard for a developer that’s too close to the project. I talk to development teams about bugs from the user’s perspective rather than a developer’s. If a user came across this or that bug, would they think there’s something obviously wrong? How would a user behave to get around the issue? Would a user consider it a workflow blocker and leave the site? Sometimes developers don’t see this because they’re deep in the code; they might not have the distance to consider how the user is navigating the site or getting to that particular bug.

    What makes someone a good quality assurance professional?

    A good tester is able to pay attention to details and willing to drill down. If I see something wrong in something I’m testing, I have to pursue it, try to find steps to reproduce the issue, check to see if it’s happening across devices and browsers, and then follow up with the developers. It requires a lot of patience.

    We have to be able to find various small disparate issues but also larger overarching problems. For example, I’ll often find little bugs like links not highlighting but I also find bigger issues with user flow or ways to navigate the website. Someone that’s doing QA has to be able to find all sorts of issues from large to small to difficult to define problems. It takes a good QA person to not only find that range of problems, but to also be able to articulate them in precise, intelligible bug reports.

    There are also different types of testing, and experienced QA people will have their own style. I like to test methodically, ensuring I hit everything I set in the test plan. Some people are better at exploratory testing, like clicking on things randomly; unusual user behaviors might uncover the software behaving in unexpected ways.

    What’s the best part of QAing?

    I find it really satisfying to pick apart features or entire websites, find little issues, follow up on them, and see them resolved quickly. It’s very rewarding to see a project come together, getting better and better every day.

    What’s the most challenging part of QAing?

    The most difficult part is that my imperative is to find bugs and get them fixed. But nothing is perfect - there will always be unresolved bugs that make it to release, and coming to terms with that is really hard! At what point is it good enough before you can say “ship it”? It’s important to have open lines of communication with the project manager or product owner so that they are aware of the existing issues and can prioritize them along with the rest of the development work.

    What is your favorite QA tool?

    The most recent one that I found has been really revolutionary for me: Ghostlab. Ghostlab does synchronized browser testing so I can look at the same application and interactions on as many browser windows and devices as I can see in front of me at the same time. It saves me tons of time and I really love it!

    If someone were starting a QA process from scratch and they were part of an agile company, what advice would you give them?

    I would say that the best way to approach that situation is to try to integrate the QA process with the existing development process. For example, at Caktus, we use scrum. QA actively participates in scrum as part of the development team. This is absolutely key. Developers would usually already have a verification and review step where they review each other’s code. QA comes in the step afterwards. They are part of the workflow and part of the success criteria for each task. Every task that goes through QA upholds the acceptance criteria or Definition of Done for the team.

    A good QA person has to be able to work within the team along with the developers. QA should not work in opposition to developers, but as part of development. We can’t just throw issues over the wall to developers and call it not our problem anymore - it’s essential to remember that our common goal is to build the best product possible.

    Caktus GroupWhat’s “User Experience” and Why It Matters

    Caktus recently welcomed UX designer Basia Coulter to our team. We sat down with her to discuss her perspective on user experience design. Basia, like many in tech, came to her role through a nonlinear path. She first earned a PhD in neurobiology while in Poland, and then came to the United States for a postdoctoral fellowship. The experience led to soul searching, including seven years in a Tibetan monastery in upstate New York where, along with her spiritual interests, she pursued a passion for design, particularly web design. She subsequently devoted herself to learning more about digital communication. Basia has been in the North Carolina area for 2.5 years and currently is part of the leadership team of the local chapter of Girl Develop It and a member of local organizations such as TriangleUXPA, Code for Durham, and AIGA Raleigh.

    Let’s start simple. What is user experience?

    The person who coined the term “user experience” or UX is Don Norman, a cognitive psychologist and a co-founder of Nielsen Norman Group, currently Director of Design Lab at the University of California - San Diego, who used to work for Apple. For him, user experience includes all aspects of end user interaction with the company, the company’s services, and the company’s products. It’s a very broad range of interactions and includes areas like marketing, customer service, product, and, really anything.

    What we have come to mean by user experience or “UX” colloquially is probably a narrower definition than that of Don Norman’s. We’re usually referencing some specific system such as an application or a piece of software. When we talk about UX, we think about how a user feels when they interact with that system. What experience do they have?

    The goal of good user experience design is to design and build products that are easy to use, that are a solution to an existing problem, and not a cause of frustration or source of more problems.

    What is the benefit of focusing on user experience to businesses and organizations?

    UX professionals help organizations understand their users and guide teams in employing best practices to build products that solve users’ problems. By designing experiences specifically around users’ needs, we improve customer satisfaction and, by extension, increase ROI and drive profit or app adoption.

    When businesses include UX research in the process of product design and development, they ensure they build solutions that target their particular user segment and they invest in solutions that address the specific pain points of their users. Having a UX designer on the team that builds your product means there is a dedicated person whose job it is to advocate for best user experience every step of the way. Great user experience means happy users; happy users translate to satisfied customers, and satisfied customers become loyal customers.

    You’re a UX professional that’s worked in many contexts and types of organizations. What is the UX professional’s role in application development at Caktus?

    One of the most exciting aspects of UX work is that involves a variety of skills, and my role often depends on the project. At Caktus, I can support projects at the onset, even before a product is well understood; while the product is being defined; then while it’s being designed and developed; and finally toward the end, once it has been built and is undergoing testing.

    Before you can build a solution that addresses users’ needs, you have to understand those needs, you have to identify users’ pain points, and that’s why user research is so important. I recently attended a UXPin webinar during which the speaker, Satyam Kantamneni of UXReactor, said, “Any time you have a user, you’ve got to do user research.” I strongly agree with that statement.

    So my job could be leading a discovery workshop or a meeting where all stakeholders come to the table to brainstorm in the early stages of defining the product-to-be to understand the problem at hand and to uncover possible solutions. It could be doing user research by conducting user surveys or interviews. Or I could be doing a UX review of an existing application to determine whether or not it complies with best practices of usability and user experience design.

    Can you give us a brief overview of principles of good user experience design? We’ll have you dive deeper in a subsequent blog post.

    I think it is important to understand that user experience arises from many disciplines coming together. Those disciplines include information architecture, product strategy, content strategy, user research, visual design, interaction design, software development, usability, accessibility, cognitive psychology, and probably more.

    So when we talk about principles of good user experience, we’d have to talk about best practices within all of those domains. For the purpose of this conversation, we could talk about a few basic principles that help build good experience for an interface. I would say that anything that helps decrease the amount of mental processing, so-called “cognitive load”, that the user needs to do to be successful, and anything that guides the user in accomplishing a task within an application constitutes a principle of good user experience. That would include, among many other principles, consistency of visual design and interactions, solid and consistent content structure, and presence of affordances and feedback.

    To put simply, people are more likely to engage with content that follows principles of good user experience than with content that does not. So if you want to retain visitors on your website or if you want to see more people subscribe to your application, you cannot afford to ignore those principles.

    You hold a PhD in neurobiology. What's the link you see between how the brain works and UX?

    Understanding sensory perception is very handy in UX design. Take a couple of aspects of visual perception. One, perception of color is contextual—the same color may be perceived differently in different contexts, for example against different backgrounds. Two, our brain cannot process all the information it is bombarded with at any given moment, so it weeds out what it renders irrelevant. The brain also fills in the gaps where information is missing and creates images of what we see as a representation, not a replica of the object of perception. In other words, we think we see what’s out there in the world around us, but in fact we see our brain’s constructs.

    So when we are designing, for example, an interface for users who primarily rely on vision, we need to keep in mind that they will not be processing every single element of that interface in order to make sense of it. Instead their brain will be constructing a representation of the interface based on the elements that get the most attention or are assessed as relevant. It is yet a different matter if we are designing an interface for users with vision impairments.

    Another topic very relevant to UX design is decision making. The lab where I did my postdoctoral studies was involved in investigating human emotions and decision making. It was there that I first encountered ideas around how emotional responses impact our decision making. I got a first glimpse of the notion that our rational mind is not the only and perhaps not even the primary actor in our day-to-day decision making. There are also studies showing, for example, that users make fast, snap judgement decisions about a website’s trustworthiness based on its aesthetics. Those are very important pieces of knowledge to keep in mind when designing an interface for users whose decisions and choices will be guided by that interface.

    Basia, you personally have had exposure to a wide set of cultural perspectives having come to the United States as an adult. What role does cultural perspective play in UX?

    It is critical. Take text as an example. If it’s written from left to right, our eyes will track it a different way than if it’s written top to bottom. Another example is color. We rely on color to convey meaning, but different colors have different meanings in various cultures. In western cultures, for example, white is associated with innocence and is often used in design for weddings. In other cultures, white signals death and mourning. So these are totally different connotations. If we use color to convey meaning, we will need to be mindful of different cultures to create the right experience for the user.

    There are also generational differences within the same culture that have impact on user experience. You could think about those as subcultural differences. For instance, there is a trend in web design that’s called flat design, in which interface elements look flat and no visual techniques are used to give them a three-dimensional appearance. This trend has become controversial in the UX community; some UX professionals feel strongly that stripping interactive elements such as buttons off of their three-dimensionality removes important affordances and compromises usability. And in fact usability tests have shown that a lot people have a hard time recognizing flat buttons as buttons. However, it turns out that Millennials and younger users do not have as much trouble with flat design as older users do. So if you’re designing an application for a younger audience, you might not have to worry so much about compromising the usability of your application by using flat design, but if you’re designing for an older generation, you should consider your flat design choices carefully.

    Caktus GroupPostgres Present and Future (PyCon 2016 Must-See Talk: 6/6)

    Part six of six in our annual PyCon Must-See Series, a weekly highlight of talks our staff especially loved at PyCon. With so many fantastic talks, it’s hard to know where to start, so here’s our short list.

    Coming from a heavy database admin background, I found Craig Kerstiens’s “Postgres Present and Future" to be incredibly well organized and engaging. Of particular interest to me, because I am somewhat new to Postgres (while having more background history with MS SQL), was the deep dive into indexes in Postgres.

    Check out 5:44-8:39 to find out when to use different types of indexes, outside of the standard B-Tree. For instance, Gin indexes are helpful when searching multiple values for a single column, ie. an array field or a JSONB field.

    Click over to 17:22-19:33 to learn about the new Bloom Filter in Postgres 9.6, which is coming out in a few months. This extension seems like it will be incredibly useful to speed up queries on wide tables with a bunch of different options.


    More in the annual PyCon Must-See Talks Series.

    Joe GregorioInertial Balance

    What to do when it's late at night and your high schooler says his table wasn't able to complete their physics lab today because they were missing equipment, and the teacher said, maybe half jokingly, that they could complete the lab at home if they didn't finish it in class? That's right, you build experimental equipment in your garage.

    This is the Interial Balance we built from scratch using two hacksaw blades. It took about about 20 minutes to build and then another 10 to actually run the experiment.

    I hope we don't have to "junkyard wars" all of his labs, but this was fun and quick to build.

    Joe GregorioGOP Climate Change Denial Timeline

    Building on The Republican race: Five degrees of climate denial, extended to the full seven stages:

    Stage 1: Denial
    Pre 2010 - The climate is not changing.
    Stage 2: Ignorance
    2010 - The climate might be changing, or it might not, we just don't know.
    Stage 3: GAIA Bashing
    2014 - Climate change is real, but it’s natural.
    Stage 4: We so tiny
    2016 - Climate change is real, but humans aren't the primary cause.
    Stage 5: We so poor
    2018 - OK, humans are the primary cause, but we can't afford to do anything about it.
    Stage 6: Acceptance
    2020 - This is awful, why didn't you tell us it would be this bad!?!?
    Stage 7: Revert to Form
    2024 - We would have fixed the climate if it wasn't for Obama.

    Caktus GroupWhat We’re Clicking - August Link Roundup

    Every month we collect the links we’ve shared on social media or amongst ourselves that have captured our interest. Here are our favorites from the past 30 days.

    Write an Excellent Programming Blog (TalkPython)

    One of the best ways to contribute to open source is by sharing knowledge. A. Jesse Davis, a frequent speaker on this topic, shares his thoughts on writing excellent blog posts in this TalkPython podcast.

    Deploying Django + Python 3 + PostgreSQL to AWS Elastic Beanstalk (Real Python)

    We’ve been exploring this very same topic at Caktus. Here’s the blog description: “The following is a soup to nuts walkthrough of how to setup and deploy a Django application, powered by Python 3, and PostgreSQL to Amazon Web Services (AWS) all while remaining sane.”

    APIs: A Bridge Between Mobile Operators and Startups in Africa (Venture Capital for Africa)

    “In emerging markets, where mobile operators are the main enablers of the digital economy, operator APIs are a powerful channel for unlocking creativity and giving the startup ecosystem a boost. Every time an operator opens a new set of APIs, it creates a powerful cycle of innovation as startups can combine several APIs to create new services.”

    Breaking out of two loops (nedbatchelder.com)

    “A common question is, how do I break out of two nested loops at once? For example, how can I examine pairs of characters in a string, stopping when I find an equal pair?... make the double loop into a single loop, and then just use a simple break.”

    Death of a survey (DevEx)

    In this article, there’s a discussion on how humanitarian organizations are now inundated with data. The most critical point is knowing what question to ask about the data: “What information could help both my organization and our partners do our work better?”

    Caktus GroupPython Nordeste 2016 Recap

    Image via @pythonnordeste #pyselfie

    I don’t know anyone there. I don’t know the language. What about this Zika virus? What about this political unrest? These were some of the doubts and fears racing through my mind at the start of my trip. I had barely settled back home from my trip from PyCon US when it was time to start making the trip to Python Nordeste. It’s a long set of flights to Teresina in the northeast region of Brazil, and I was alone.

    Those doubts vanished in almost an instant as I came off the plane to find many of the conference organizers there to greet me. From that moment I found that the Python community I’ve come to know and love so well, the community which has always been so warm and welcoming, is alive and well in Brazil.

    I had been asked months before to come and deliver a keynote at Python Nordeste 2016, which was now in its fourth year. The organizers explained their mission was to have a conference catering to the poorer northeast region of Brazil where many cannot afford to attend the larger PyCon Brazil, much less conferences outside of the country. I care about the diversity of voices in the Python community and reaching out to underrepresented groups so this mission spoke to me. I worked with them and with people at Caktus to make it possible to attend and speak.

    I thought about what I wanted to speak about for a long time. I didn’t know what they wanted or expected from me when they had asked me to speak. Should I give a technical talk on Python or Django? Should I try to inspire or motivate? Finally I set out on a single goal for my talk: to give a talk which only I could give. I wanted to share a piece of myself and my experience as a developer. The more I thought about it, the more I kept coming back to this idea of how my love of running has shaped my approach to development.

    The conference itself was three days. The first day was a set of tutorials followed by two days of talks. My keynote was set for just after lunch on the first day of talks. Since the tutorials were in Portuguese, I decided to take that first day to explore the city as well as continue to prepare my talk for the next day. I explored a local market full of artisan sellers. I walked up one of the main streets and saw the local shops, restaurants, and people going to work and school. It was hot and humid walk. The feeling was similar to our summers in North Carolina but it was winter there. The buildings were short, only a few stories tall. Many open air spaces likely due to the warm climate.

    This was my first experience with a single track conference format, and overall I liked that everyone saw the same content. It also allowed for questions to run a little bit over which they did several times during the course of the two days. While the talks themselves, other than mine, were in Portuguese, I was able to follow slides with code samples or project demonstrations.

    When it came time for my talk, I was nervous but ready. I talked about my love of running, sharing pictures from various races. I talked about how getting better as a runner requires a long view. Progress is slow and it comes from consistency over time rather than big efforts all at once. That’s been my approach to improving as a developer as well. It’s based on steady and focused improvement. My running is focused on being better than myself rather than judging my success versus the ability or accomplishments of others and I bring that same mentality when working on my programming skills. Unfortunately my talk could not be translated into Portuguese live due to the cost, and I’m sure that left some excluded. I received some positive reactions to my talk. One of the organizers, Francisco Fernandes, in particular shared how my talk related to his experience in graduate school and how it had touched him. In the end I felt I had met my goal and delivered a talk that was genuinely me and that made the long trip worth it.

    As luck would have it, the Olympic torch was being carried through the city that night, and I had the opportunity to go see it and run alongside it for a brief period. I grew up swimming and always loved watching the summer Olympics. I never dreamed that I would come so close to the torch. Francisco had a connection with one of the torch bearers and brought it to the conference for the second day of talks allowing people to have their picture taken with it. It felt like a once in lifetime experience.

    Mark with the Olympic torch

    The second day featured a number of talks which stirred a large amount of debate from the audience. Questions pushed well outside the original time and the schedule fell behind. Everyone seemed comfortable with adjusting as needed. While the final lightning talks were eventually lost due to the additional time, there was a great effort to have a round-table discussion about the state of women in technology in their community. There were roughly a dozen women in attendance, less than 10%, and no women speakers. The organizers gave those that wanted an opportunity to speak about their experiences and some of the other attendees responded with questions or their own experiences. I was thankful to have people in the audience willing to translate for me so that I could keep up with the conversation. I hope this leads to more inclusion efforts in their community. After the conference ended I had the chance to visit a local hackerspace before continuing to a post-conference celebration.

    My first trip to Brazil was an absolutely amazing experience. There were times when my face hurt from so much smiling. The food was as amazing as the people. I'm so thankful to have the opportunity to attend and thankful to the organizers for their invitation and warm welcome. I’ve always enjoyed my experiences at PyCon US to meet people using Python in places and ways different from my daily use and Python Nordeste gave me a glimpse into a world I’d otherwise never seen. I left feeling more excited and passionate about this community, wanting to share more, and reach more people who all love Python as I do.

    Philip SemanchukCreating PDF Documents Using LibreOffice and Python, Part 2

    This is part 2 of a 4-part series on creating PDFs using LibreOffice. You should read part 1 if you haven’t already. This series is a supplement to a talk I gave at PyOhio 2016.

    Here in part 2, I compare and contrast the two approaches I outlined in part 1 — the obvious approach of using ReportLab, and the LibreOffice approach that I think is underappreciated. Both approaches can be good in the right situation, but neither is better than the other all the time. In some cases, the difference is dramatic.

    Without further ado, here’s the 10 categories in which I want to compare these two, and how each approach stacks up. (The compare/contrast portion of my PyOhio talk starts at the 17 minute mark.)

    1. Cross-Platform?

    Both ReportLab and the LibreOffice technique run on Windows, Linux, OS X, and BSD. I haven’t researched mobile operating systems like iOS and Android, but you’re not likely to want to construct PDFs on a mobile device.

    2. Python 2/3 Support?

    Both approaches can be used with Python 2 and 3.

    3. FOSS?

    ReportLab is under a BSD License.

    LibreOffice is under the MPL v2.0 which is a BSD/GPL hybrid. However the details don’t matter much since you’re not going to use the source code anyway.

    4. Repairability?

    By repairability, I’m referring to the ease with which you can fix things that don’t behave the way you want them to.

    ReportLab scores very well here, because it’s pure Python and BSD license gives you a lot of flexibility. You can read, debug, patch, and copy the code. When debugging, you can step directly from your code into ReportLab code. If you patch ReportLab, it’s easy to roll out a patched version to your servers using pip.

    LibreOffice, on the other hand, is a large office suite written in C++ (and maybe Java?). It’s orders of magnitude more complicated. Think of it as an unrepairable black box.

    5. Power?

    ReportLab includes lots of cool stuff out of the box, like bar, pie, line, and other kinds of charts, a table of contents generator, and probably lots of other things I don’t know about.

    It’s also extensible, so it you want something it doesn’t have (like a list of figures generator) you can write it or search online to see if someone else has already done it.

    LibreOffice has even more to offer, though. It’s an entire office suite, after all! It not only handles all of the normal text document things (like headings, foot/endnotes, autonumbering lists, etc.), you can do more sophisticated things like embedding spreadsheets in documents. 1001 Creative Ways to Use an Office Suite could be a blog post all its own (or 1001 of them!).

    6. Scalability?

    ReportLab is just Python so one can run multiple concurrent threads or processes just as with any other Python code.

    Unfortunately, LibreOffice does not scale. It’s not possible to run multiple LibreOffice processes simultaneously on one machine. For probably 99.99% of users, this isn’t a concern, but it can be a problem for automation. It means you have to be willing to create your PDFs synchronously.

    7. Speed?

    Warning: Guess approaching!

    My hunch is that ReportLab is faster, maybe by a lot. But that’s backed by no data whatsoever. Benchmarking would be time-consuming. It would require inventing a variety of relatively complex PDFs and generating them using both methods. And that still might not tell you much about your use case.

    In the grand tradition of arguing on the Internet, I’m not going to let my ignorance or lack of data keep me from having an opinion. But understand that it’s a guess, and take it with a huge grain of salt.

    8. Experimentation?

    You’re probably not producing this PDF for yourself, but for someone else. That someone might be an immediate co-worker, another department in your company, or a customer that’s in a completely different company. Experimenting with the output PDF is an important part of the process because it usually takes many tweaks to get the PDF to look the way your client wants.

    Just as with developing software, the end result will be a moving target as ideas evolve. And also like software development, you want your tools and process to add as little friction as possible to the evolution.

    With ReportLab, experimentation can be time-consuming. If you have a complex PDF, you’ll have a non-trivial amount of Python/ReportLab code to generate it. As code gets more complex, it gets harder to change. That’s not specific to ReportLab, it’s just a general software development principle. So when your client wants to change, say, how the page footer is formatted, or how figures are numbered, or the document font, the usual difficulties of maintaining code apply.

    With LibreOffice, changing the document is extremely easy because you’re using a tool built expressly for that purpose. It’s straightforward, and you can immediately see the results of your changes.

    9. Complexity?

    By complexity, I’m referring to the complexity of one’s code relative to the complexity of the PDF you’re trying to create.

    With ReportLab, the relationship is roughly linear complexity. If you have a complex PDF, you’ll have reasonably complex Python code to create it.

    With LibreOffice, the relationship is non-linear. Deleting and duplicating XML elements and changing text are easy. Creating new elements is difficult. For instance, our trivial PDF example contained two paragraphs and a table. As I demonstrate in part 1 and in my talk, it’s easy to add, delete, and change table rows, but if you asked me to add an image to that document, I would be stuck because there’s no image in the XML for me to copy.

    Obviously, I could add an image to the document and then see how that’s expressed in the XML, but that only works if I know in advance that I’m going to need an image.

    10. Strengths

    ReportLab is a safe choice. It does one thing and it does it well. The fact that it’s extensible means you can always get it to do what you want (although you might have to write more code than you planned). It’s the well-traveled path, so you’ll be able to find fellow travelers (and their tutorials and advice). It can handle extremely varied output.

    Using the LibreOffice method is best when there’s a high ratio of static to dynamic content. Think about the extreme example of a 900-page PDF in which there’s only one paragraph of dynamic content. You would have to write very little code to populate that one paragraph, whereas with ReportLab you’d have to write code to generate all 900 pages, even though they never change.

    The LibreOffice method requires less code — maybe a lot less, depending on your situation. The tradeoff is that you have to do more document construction work, but to me that’s still a win for two reasons. First, you get to use a tool built expressly for that purpose. Second, it’s easier (and cheaper) to find LibreOffice/document editing skills than Python/software development skills. Your client might even be able to build most of the document, which will save them money and give them control over and investment in the outcome. That makes for a happy client.

    What’s Next

    In my next post in this series, I’ll discuss some of the questions asked at my PyOhio talk, and in the fourth and final post I’ll present some useful code snippets. Stay tuned!

     

    Caktus GroupBake the Cookies (PyCon 2016 Must-See Talk: 5/6)

    Part five of six in our annual PyCon Must-See Series, a weekly highlight of talks our staff especially loved at PyCon. With so many fantastic talks, it’s hard to know where to start, so here’s our short list.

    One of the talks that had the most profound impact on me at PyCon was Adrienne Lowe’s talk, “Bake The Cookies, Wear the Dress: Connecting with Confident Authenticity”. It was really impressive to see a woman who is relatively new to coding talk about being herself and not being swayed by advice to appear "less feminine." Another really important point was that effective mentors need to model their slogging and struggle as well as their success. It's impossible for learners to emulate someone who appears to just magically "get" things. She used helpful metaphors from another passion of hers, cooking, that illustrated her points very clearly. Adrienne was forthcoming with her own personal challenges, which was brave and will be helpful to anyone listening to her talk who is experiencing similar challenges or who is in a position to mentor someone through those challenges.


    More in the annual PyCon Must-See Talks Series.

    Caktus GroupTrainspotting: Real-Time Detection (PyCon 2016 Must-See Talk: 4/6)

    Part four of six in our annual PyCon Must-See Series, a weekly highlight of talks our staff especially loved at PyCon. With so many fantastic talks, it’s hard to know where to start, so here’s our short list.

    To see real life use of Raspberry Pi with GoPro, watch Data Scientist Chloe Mawer’s “ Trainspotting: real-time detection of a train’s passing from video”. Mawer focuses on Caltrain in this video. Caltrain is a train for commuters traveling between Palo Alto and San Francisco, used by more than 18 million commuters in California. The train's schedule is unpredictable and there is a lack of trustworthy data on the train's status.

    Chloe Mawer, a Stanford PhD, designed an algorithm that uses OpenCV via Python to track the train's timing via video. Mawer talked through each facet of the OpenCV algorithm and how to read a video taken with a camera attached to a Raspberry Pi. It was incredibly interesting, especially because of my interest in public transit and public data. The slides are available on her Github.


    More in the annual PyCon Must-See Talks Series.

    Philip Semanchuk♡s to PyOhio

    • To conference volunteers too numerous to mention
    • To Jason, Eric, and Jan for their hospitality which helped me to feel at home away from home
    • To Oscar the AirBnB cat for headbutting me affectionately and repeatedly in the face at 5:45 AM only on the morning for which had my alarm set for 6:15. (He let me sleep the other mornings.)

    I hope to see y’all at PyData Carolinas 2016!

    oscar_the_cat_pyohio

    Caktus GroupHow I Built a Power Debugger (PyCon 2016 Must-See Talk: 3/6)

    Part three of six in our annual PyCon Must-See Series, a weekly highlight of talks our staff especially loved at PyCon. With so many fantastic talks, it’s hard to know where to start, so here’s our short list.

    While at PyCon 2016, I really enjoyed Doug Hellmann’s talk, “How I built a power debugger out of the standard library and things I found on the internet” (video below). It's listed as a novice talk but anyone can learn from this talk. Doug talked about the process of creating this project more than the project itself. He talked about his original idea, his motivations, and how he worked in pieces towards his goal. His approach and attitude were refreshing, including talking about places that he struggled and now how long the process took. A beautiful glimpse into the mind of a very smart, creative, and humble developer.


    More in the annual PyCon Must-See Talks Series.

    Tim HopperPhotos Featured on Smithsonian Magazine

    A few weeks ago, I introduced my wife to backpacking in the beautiful Grayson Highlands State Park in southestern Virginia. Part of my reason for picking this location was to see the herd of wild ponies that life at 5000' on the grassy balds.

    I shared some of my best pictures from the trip on Flickr under a Creative Commons license (CC BY-NC-ND 2.0). On Saturday, I stumbled acrosss an article about the Grayson Highlands ponies on the Smithsonian Magazine website. I was pleasantly surprised to see they selected two of my images for the story! I've been spending more time lately exploring my longtime interest in wildlife photography, and I'm thrilled to see others sharing my work.

    You can find more of my photography on Flickr or Instagram.

    Wild Ponies of Grayson Highlands

    Untitled

    Caktus GroupShipIt Day Recap - July 2016

    We finished up last week with another successful ShipIt Day. ShipIt Days are quarterly events where we put down client work for a little bit and focus on learning, stretching ourselves, and sharing. Everyone chooses to work together or individually on an itch or a project that has been on the back of their mind for the last few months. This time, we stretched ourselves by trying out new frameworks, languages, and pluggable apps. Here are some of the projects we worked on during ShipIt Day:

    TinyPNG Image Optimization in Django

    Kia and Dmitriy started on django_tinypng. This project creates a new OptimizedImageField in Django which uses the tinify client for the tinypng project to compress PNG files. This means that files uploaded by users can be reduced in size by up to 70% without perceivable differences in image quality. Reducing image sizes can free up disk space on servers and improve page load speeds, significantly improving user experiences.

    Maintaining Clean Python Dependencies / Requirements.txts

    Rebecca Muraya researched how we, as developers, can consistently manage our requirements files. In particular, she was looking for a way to handle second-level (and below) dependencies -- should these be explicitly pinned, or not? Rebecca did some research and found the pip-tools package as a possible solution and presented it to the group. Rebecca described pip-tools as a requirements file compiler which gives you the flexibility to describe your requirements at the level that makes sense to your development team, but have them consistently managed across development, testing, and production environments. Rebecca presented ideas for integrating pip-tools into our standard development workflows.

    Elm

    Neil and Dan each independently decided to build projects using Elm, a functional language for frontend programming.They were excited to demonstrate how they rearranged their concept of development temporarily to focus on state and state changes in data structures. And then, how these state changes would be drawn on the screen dynamically. Dan mentioned missing HTML templates, the norm in languages where everything is not a function, but loved that it forced programmers to handle all cases as a result of the strict type system (unlike Python). Neil dug not only into Elm on the frontend, but also a functional backend framework Yesod and the Haskell language. Neil built a chat app using Websockets and Yesod channels.

    Firebase + React = Bill Tracking

    Hunter built a bill tracking project using Google’s Firebase database and the React frontend framework. Hunter walked us through his change in thought process from writing code as a workflow to writing code that changes state and code that updates the drawing of the state. It was great to see the Firebase development tools and learn a bit more about React.

    Open Data Policing Database Planning

    Rebecca Conley worked on learning some new things about database routing and some of the statistics that go into the Open Data Policing project. She also engaged Caelan, Calvin’s son who was in the office during the day, to build a demonstration of what she had been working on.

    Mozilla’s DXR Python Parser Contributions

    DXR is a web-based code indexing and searching tool built by Mozilla. For his project, Jeff Bradberry decided to create a pull request contribution to the project that improves Python code indexing. Specifically, he used Python’s own Abstract Syntax Tree (AST), a way to introspect and consider Python code as structured data to be analyzed. Jeff’s contribution improves the analysis of nested calls like a(b(c())) and chained calls like a().b().c().

    Hatrack: We all wear lots of hats, switch contexts easily

    Rather than working on something completely new, Calvin decided to package up and share a project he has been working off-and-on in his free time called Hatrack. Hatrack attempts to solve a problem that web developers frequently face: changing projects regularly means switching development environments and running lots of local development web servers. Hatrack notices what projects you try to load up in your browser and starts up the development server automatically. For his project, Calvin put Hatrack up on NPM and shared it with the world. You can also check out the Hatrack source code on Github.

    Software Testing Certification

    Sometimes ShipIt Day can be a chance to read or review documentation. Charlotte went this route and reviewed the requirements for the International Software Testing Qualifications Board (ISTQB)’s certification programs. Charlotte narrowed down on a relevant certification and began reviewing the study materials. She came back to the group and walked us through some QA best practices including ISTQB’s seven principles of software testing.

    Cross Functional Depth & Breadth

    Sarah began work to visualize project teams’ cross-functional specialties with an eye towards finding skill gaps. She built out a sample questionnaire for the teams and a method of visualizing the skill ranges in specific areas on a team. This could be used in the future when team members move between teams and for long-term planning.

    Demographic Data

    Colin and Alex each separately investigated adding demographic data into existing project data sets using SunlightLab’s Python bindings for the Cenus API. While the Census dataset contains tens of thousands of variables in various geographic resolution levels (states, counties, down to block groups), using the Census’ API and Sunlight Lab’s bindings made it relatively quick and painless.

    Caktus GroupWhat We’re Clicking - July Link Roundup

    Here’s external links our team's been chatting about and sharing on social media since the last roundup.

    Web Service Efficiency at Instagram with Python (Instagram)

    "Instagram currently features the world’s largest deployment of the Django web framework, which is written entirely in Python. We initially chose to use Python because of its reputation for simplicity and practicality, which aligns well with our philosophy of 'do the simple thing first.'"

    Survey: US City Open Data Census

    “Since its launch in 2014, the US City Open Data Census has helped push cities to make their data open and easily accessible online. The US City Open Data Census team is now looking at ways to improve the Census and make sure it's up-to-date with the needs of today's open data community. As someone interested in open data, we'd like to hear from you about how you think we can make the Census even better!”

    A beginners guide to thinking in SQL (Soham Kamani)

    “It’s always easy to remember something which is intuitive, and through this guide, I hope to ease the barrier of entry for SQL newbies, and even for people who have worked with SQL, but want a fresh perspective.”

    The state of containers: 5 things you need to know now (TechBeacon)

    “Docker adoption is up fivefold in one year, according to analyst reports. That's an amazing feat: One year ago, Docker had almost no market share. Now it's running on 6 percent of all hosts, according to a survey of 7,000 companies by Datadog, and that doesn't include hosts running CoreOS and other competing container technologies. “

    Well-Tempered API (K Lars Lohn)

    The Caktus team watched this during our public video lunch. Here’s a description: “I can play 400 year old music, but I can't open a Word document from 1990. Centuries ago, a revolution in music enabled compositions to last for centuries with no bit rot. There are innumerable parallels between music and software, why don't our programs last longer? Software Engineering has some things to learn from the parallel world of music.”

    Creating Your Code Review Checklist (DZone)

    “Learn about the steps of undergoing your rite of passage to review code and how to ask the right questions to make the process easier.”

    Extracting Video Metadata using Lambda and Mediainfo (Amazon)

    “In this post, I walk you through the process of setting up a workflow to extract technical metadata from multimedia files uploaded to S3 using Lambda.”

    Bootstrap 4: A Visual Guide (Bootply)

    “Here is a visual guide that will show you what’s new in Bootstrap 4 as compared to Bootstrap 3.”

    Footnotes