TriZPUG logoPlanet TriZPUG

TriZPUG EventsPyCon 2012

PyCon is the largest annual gathering for the community using and developing the open-source Python programming language. PyCon is organized by the Python community for the community.

TriZPUG EventsTriZPUG January 2012 Meeting: Elementwise/Flask

Nathan Rice of UNC will talk about Elementwise, his vectorized function, method, and operator support for Python iterables and will also give an overview of the micro-framework Flask and the similarities and differences of Flask relative to Django and Pyramid. As always, spontaneous lightning talks of ten minutes or less on other topics are also welcome. Anything you've learned about Python, no matter how trivial, can be a lightning talk. Note: this meeting starts at 6pm as the doors to the building automatically lock at 7pm. Parking is available in the lot beside the building for those who show up early.

Calvin SpealmanA little trick for wide pages

<data:post.title>We have wide monitors and our reading doesn't tend to like wide text very well. This is why newspapers have lots of narrow columns, rather than stretch each story across the entire width of the paper.</data:post.title>
<data:post.title>
</data:post.title>
<data:post.title>Not all websites follow this tip, so drag this to your bookmark toolbar and squeeze the margin in 100px at a time, until you can read more naturally.</data:post.title>
<data:post.title>
</data:post.title>
<data:post.title>>squeeze!<</data:post.title>
<data:post.title>
</data:post.title>
<data:post.title>Yes, I could resize my window, but the same width isn't right for all pages, and most are padded or have sidebars. This is good when you only need some of the pages you have open to be narrower than the rest.</data:post.title>

Calvin SpealmanANN: Django Better Cache 0.5 Released

<data:post.title>I am announcing the release of Django Better Cache 0.5 today. This release includes a move to sphinx as a documentation tool and a new component, the bettercache.objects module, which provides a lite ORM-like interface for caching data.</data:post.title>

<data:post.title>Please read the full, but short documentation over at Read The Docs for details on the bettercache {% cache %} tag and the bettercache.objects ORM, and have a much easier time with your caching needs.</data:post.title>
<data:post.title>
</data:post.title>
<data:post.title>Here is just a quick example of the new cache models, from the docs:</data:post.title>
<data:post.title>
</data:post.title>
class User(CacheModel):
    username = Key()
    email = Field()
    full_name = Field()

user = User(
    username = 'bob',
    email = 'bob@hotmail.com',
    full_name = 'Bob T Fredrick',
)
user.save()

...

user = User.get(username='bob')
user.email == 'bob@hotmail.com'
user.full_name == 'Bob T Fredrick'
 
<data:post.title></data:post.title>

Calvin SpealmanLeaving Google AppEngine

Calvin Spealman
I can use AWS and work on technological engineering issues, or appengine and work on price-ological engineering issues. :-/


Maybe I should take myself seriously when I said this. In the past few months, I've barely done any feature or bug work on JournaApp, because it takes my limited time and energy just keeping myself under quota when I'm the only user of the app. I can't keep that up and keep my sanity, and it is honestly an emotionally draining exercise. This is creating a toxin that affects everything I do, so I'm going to take it out of my life.

I don't know if I'll port JournalApp or not. I like it, it has been fun and useful, but I'll probably take project notes in Evernote from here on forward, and on paper again. I miss paper.
<data:post.title></data:post.title>

Caktus GroupConfiguring a Jenkins Slave

We're pretty avid testers here at Caktus and when one of our Django projects required upgrading to Python 2.7, we also needed to upgrade our Jenkins build environment. Luckily, Jenkins supports distributed builds to allow a master install to delegate tasks to slaves instances. This way we can continue to run our primary build system on Ubuntu 10.04, which defaults to Python 2.6, and delegate tasks to an Ubuntu 11.04 environment running Python 2.7. The setup is fairly easy, but since I didn't find much out there already, I figured I write up a quick post outlining what we did.

To start, we'll need a new machine. I setup an Ubuntu 11.04 instance on Linode. Then SSH in, upgrade the packages, and install a Java Runtime Environment:

$ sudo apt-get update
$ sudo apt-get upgrade
$ sudo apt-get install default-jre

That's the only package Jenkins needs by default. Next we'll setup a user for Jenkins to SSH as. To do this, we'll add a new user to the system and copy the master's SSH public key:

$ sudo useradd -m jenkins
$ sudo -u jenkins mkdir /home/jenkins/.ssh
$ sudo -u jenkins vim /home/jenkins/.ssh/authorized_keys2

Now the master Jenkins client can ssh to the slave without a password. Next we need to configure the Jenkins master to connect to the slave. Head over to the Master environment and navigate to "Manage Jenkins" and then "Manage Nodes". Click "New Node" in the sidebar and add a Dumb Slave. On the following page, fill in the following fields:

  • # of executors: 2 (controls the number of concurrent builds)
  • Remote FS root: /home/jenkins
  • Labels: python27 natty
  • Usage: Leave this machine for tied jobs only
  • Launch method: Launch slave agents on Unix machines via SSH. Also fill in the Host field with the address of your slave machine.

Hit save and your Jenkins master should open a connection to your slave machine. To use the new slave machine, update an existing Jenkins job and set the "Restrict where this project can be run" Label Expression to "python27". You'll need to install any project dependencies on the slave for it to build properly, but that's basically it!

Frank WierzbickiJython dev notes part I: The Jython Exposer

One of my new years resolutions is to make Jython more friendly to new developers. One way to do that is to write up some notes on bits of Jython that are particularly mysterious to newcomers. I've boldly titled this post "Jython dev notes part I" to push myself to create more than one of these :)

It should be noted that I'm not shying away from making these notes highly technical - but I'm happy to edit them to make them more manageable later. Hopefully if I write enough of these up they can make up the beginnings of an advanced dev guide for Jython.

Recently I was asked how the Jython exposer works. The Jython exposer is the code that exposes the types that are written in Java as Python types in the Jython interpreter. It does this by rewriting the .class files during build time and adding the special methods that are seen at runtime. The augmented .class files end up in the build/exposed/ directory, and are ultimately built into the jython.jar distribution file. You can run the exposer from the Jython source like so:

ant expose


The exposer reads through each of the classes listed in CoreExposed.include and rewrites each of them. The rewritten classes get new methods generated for the Jython runtime to use as Python classes and methods. The exposer finds the code that it needs to operate on by finding Java annotations that have been defined for this purpose. An example is the easiest way to explain:

The string type in Python is called "str" and is implemented in Jython by src/org/python/core/PyString.java -- at the top of PyString.java is:


@ExposedType(name = "str", doc = BuiltinDocs.str_doc)


Which tells the exposer to expose PyString.java as the "str" type and take its "__doc__" attribute from BuiltinDocs.str_doc. Note that there currently appears to be a bug that str.__doc__ ends up as None right now, but it will show up if you run:


>>> help(str)


Further down in the source of PyString.java:


@ExposedNew
static PyObject str_new(PyNewWrapper new_, boolean init, PyType subtype, ...


Exposes the str_new Java method as the Python str __new__ method. (The __new__ method is a special class level method that acts as a str factory).

Next we'll look at some examples of ExposedMethod, which is the most common annotation that the exposer uses. ExposedMethod exposes general methods for Python classes.


@ExposedMethod(doc = BuiltinDocs.str___len___doc)
final int str___len__() {


The above code exposes the __len__ method, which is how the str method will respond to a call to the builtin len() function. Since the object itself is the only argument, the ExposedMethod annotation only needs a doc.


@ExposedMethod(type = MethodType.BINARY, doc = BuiltinDocs.str___eq___doc)
final PyObject str___eq__(PyObject other) {


The above exposes the __eq__ method. MethodType.Binary indicates that this is a binary method that will take 2 arguments. The exposer has special handling for common types like this.


@ExposedMethod(defaults = {"null", "-1"}, doc = BuiltinDocs.str_split_doc)
final PyList str_split(String sep, int maxsplit) {


The above implements the split method of str, which takes 2 optional arguments. the "defaults" attribute assigns default values to the optional arguments.

There is much more to the exposer, but unfortunately the code is the only real documentation other than this post at this time. The source for the exposer engine is in:

src/org/python/expose/

And the source for the generator and ant task are here:

src/org/python/expose/generate/

Caktus GroupClass-based views in Django 1.3

Django class-based views


Introduction

Django 1.3 added class-based views, but neglected to provide documentation to explain what they were or how to use them. So here's a basic introduction.


Example of a very basic class-based view

Let's start with an example of a very basic class-based view.

urls.py:

...
url(r'^/$', MyViewClass.as_view(), name='myview'),
...

views.py:

from django.views.generic.base import TemplateView

class MyViewClass(TemplateView):
    template_name = "index.html"

    def get(self, request, *args, **kwargs):
        context = # compute what you want to pass to the template
        return self.render_to_response(context)

This will render your template index.html with the context you computed and return it as the content of an HttpResponse.


Introduction to class-based views

Now that we've seen the obligatory example, how about some instructions?

  • To create a class-based view, start by creating a class that inherits from django.views.generic.View or one of its subclasses.

  • In your URLconf, specify the view method as the name of the new class, plus .as_view():

    url(r'urlpattern', MyViewClass.as_view(), ...)

  • In your class, write a get method that takes as arguments self (as always), request (the HttpRequest), and any other arguments from the request as specified in your URLconf.

  • In your get method, use the same logic you'd have used in an old view, except that you can assume the request method is GET. Return an HttpResponse as usual.

  • If you need to handle POST, write a post method, just like your get method except that you can assume the request method is POST.

  • Any request method that you don't write a handler method for will automatically get back a "method not allowed" response; you don't have to do anything special.

Example:

from django.views.generic import View
from django.shortcuts import render

class MyViewClass(View):
    def get(self, request, arg1, keyword=value):
        return do_something()
    def post(self, request, arg1, keyword=value):
        return do_something_else()

Handy subclasses of View

Django comes with a number of useful subclasses of View that provide some of the function that often ends up as boilerplate in views, just by inheriting from them. You saw TemplateView being used already. You'll probably want to base your views on TemplateView almost anytime you're generating the content for a response.

Another useful one is RedirectView. This can be used to redirect all requests. Example:

from django.core.urlresolvers import reverse
from django.views.generic import RedirectView

class MyRedirectView(RedirectView):
    url = reverse(...)

That is a complete view, and will return a redirect to url on any GET, POST, or HEAD request.

You can optionally set permanent = False to return a temporary redirect instead of the default permanent redirect, and query_string = True to include any query string from the incoming request on the redirect URL:

from django.core.urlresolvers import reverse
from django.views.generic import RedirectView

class MyRedirectView(RedirectView):
    url = reverse(...)
    permanent = False
    query_string = True

Decorators

Unfortunately, using decorators with class-based views isn't quite as simple as using them with the old method-based views.

Maybe you're used to doing this:

from django.contrib.auth.decorators import login_required

@login_required
def myview(request):
    context = ...
    return render(request, 'index.html', context)

With class-based views, you have to decorate the .dispatch() method of the class view, which means you have to override it just to decorate it. And you need to decorate the decorator, because the decorators provided by Django expect to be decorating method-based views, not class-based ones:

from django.contrib.auth.decorators import login_required
from django.views.generic.base import View
from django.views.utils.decorators import method_decorator

class MyViewClass(View):

    def get(self, request, **kwargs):
        context = ...
        return render(request, 'index.html', context)

    @method_decorator(login_required)
    def dispatch(self, *args, **kwargs):
        return super(MyViewClass, self).dispatch(*args, **kwargs)

This is an area of class-based views that could use some improvement.

You could apply the decorator in urls.py without needing so much extra code:

urls.py:

from django.contrib.auth.decorators import login_required
...
    url(r'^/$', login_required(MyViewClass.as_view()), name='myview'),
...

but that moves the policy from the view code to the URLconf, which is not where people will be expecting to have to look for it, so I wouldn't recommend it.


Passing arguments to the view

The method signature for get(), post(), etc. in a view class is:

def get(self, request, *args, **kwargs)

Any unnamed values captured in the URLconf regular expression are passed in args, and any named values are passed in kwargs, just like before.

You can pass extra arguments to your view using the third element of your URLconf, the same as before, or using a new technique -- passing them to the .as_view() call in your url settings. E.g.

...
    url(r'^/$', MyViewClass.as_view(extra_arg=3), name='myview'),
...

One warning - don't accidently write MyViewClass(extra_arg=3).as_view(). That'll still appear to work, but that extra_arg is just thrown away.


Where's the beef?

So far, all we've done is the same behavior, written using a different syntax. But class-based views enable a whole new level of function.

Suppose you've got a view that displays some data on a web page, and you write it as a class-based view. Maybe something like this:

from django.views.generic.base import TemplateView

class MyViewClass(TemplateView):
    template_name = 'index.html'

    def get(self, request, **kwargs):
        # Lots of complex logic in here to compute 'context'
        self.render_to_response(context)

Now you're asked to provide an HTTP API that returns the same data in json.

Start by refactoring your existing class slightly, moving your business logic out of the get() method:

from django.views.generic.base import TemplateView

class MyViewClass(TemplateView):
    template_name = 'index.html'

    def compute_context(self, request, **kwargs):
        # Lots of complex logic in here to compute 'context'
        return context

    def get(self, request, **kwargs):
        self.render_to_response(self.compute_context(request, kwargs))

Now, write a new class that subclasses your original class, uses the same method to compute the data, but overrides get() with different rendering code:

class MyJsonViewClass(MyViewClass):
    def get(self, request, **kwargs):
        data = self.compute_context(request, **kwargs)
        # Very naive way to put your data into json, but a good starting place
        content = json.dumps(data)
        return HttpResponse(content, content_type='application/json')

Add a new URL to urls.py pointing to your new class-based view, and you're done. All the logic you worked out earlier is still in use, and the power of subclassing let you provide the data in a new format almost effortlessly.


Class-based views for common policy

The previous example was still something you could have done almost as easily with method-based views, by refactoring your code into separate methods and calling them from all your views.

A more powerful use of the new class-based views is to provide common function for many views. If you have a site with many views, and they all inherit from a common view, then you have the potential to change behavior across the site by changing that one view.

Previously, you would probably have used middleware for this kind of thing. The problem with middleware is that it's completely hidden from the view code. When working on your view, you won't even know middleware is affecting things unless you go look at the settings and track down each piece of middleware configured there.

Furthermore, middleware affects every request, not just the views you really wanted it for.

With a common class-based view, every view affected is declared to inherit from that view, making it obvious that we're inheriting behavior from elsewhere. With a good IDE, you can even jump straight to that superclass to inspect it. Any view that doesn't need the common behavior doesn't have to inherit it.


References

The only documentation page that really discussed class-based views in Django 1.3 is this one:

https://docs.djangoproject.com/en/1.3/topics/class-based-views/

Some of the rationale for the current design of class-based views, and pros and cons of some alternatives that were considered, are documented here:

https://code.djangoproject.com/wiki/ClassBasedViews

Beyond that, the best advice I can give is to go read the code. The code for the base View is surprisingly small, and can be found at django/views/generic/base.py.

Caktus GroupOpenBlock Geocoder, Part 3: External Geocoders

The OpenBlock geocoder is powerful and robust. It uses PostGIS for spacial queries, can extract addresses from bodies of text, and can understand block and intersection notation. We've run into a few issues with it, however, including a low geocoding success rate. This is a tough problem to solve and depends on a lot of factors (the extent of street and block data in OpenBlock, format of the street addresses, etc.), so your mileage may vary. Below I constructed a simple test using Google's Geocoding API to have as an alternative.

Disclamer: This is the third post in our OpenRural series reviewing OpenBlock and it's geocoder. You may wish to read Part 1: Data Model and Geocoding and Part 2: Text Parsing and Entity Extraction before proceeding.

Adding news with OpenBlock's geocoder

The Schema and NewsItem models provide OpenBlock with a generic data model to associate news with geographic locations. You can find a fairly extensive introduction in the official documentation, so we won't go into too much detail here.

Since a NewsItem requires a geographic point, let's use the OpenBlock geocoder to fine 123 East Franklin Street:

>>> from ebpub.geocoder import SmartGeocoder
>>> geocoder = SmartGeocoder()
>>> location_name = '123 East Franklin Street'
>>> point = geocoder.geocode(location_name)['point']
>>> point.wkt
'POINT (-79.0553588124999891 35.9133110937499964)'
You'll notice that point has a wkt attribute. wkt, or Well-known text, is a text markup language for representing geometry objects. Here we have a POINT, but the language can represent many geometries, including LineString and Polygons.

We'll use the "Local News" schema in this example as it is pre-loaded in OpenBlock:

>>> from ebpub.db import models as ebpub
>>> schema = ebpub.Schema.objects.get(name='Local News')

Using this schema, we'll add a new NewsItem with the point created above:

>>> import datetime
>>> news = schema.newsitem_set.create(
...     title='Incident downtown',
...     description='Something happend downtown today!',
...     item_date=datetime.date.today(),
...     location=point,
...     location_name=location_name,
... )
>>> news.location.wkt
'POINT (-79.0553588124999891 35.9133110937499964)'

That was easy. Now we have a NewsItem that OpenBlock is aware of and can be plotted on a map. However, what do we do if we can't geocode the address?

Using an External Geocoder

If we already have a geographic point, then we can circumvent the geocoder entirely:

>>> from django.contrib.gis.geos import Point
>>> manual_point = Point(-79.0553588124999891, 35.9133110937499964)
>>> news = schema.newsitem_set.create(
...     title='Incident downtown',
...     description='Something happend downtown today!',
...     item_date=datetime.date.today(),
...     location=manual_point,
...     location_name=location_name,
... )
>>> news.location.wkt
'POINT (-79.0553588124999891 35.9133110937499964)'

This means we can also use an external geocoder. For example, we can use Google's Geocoding API with geopy. First, you'll need a Google Maps API key, which we'll use with geopy:

>>> GOOGLE_MAPS_API_KEY = '' # your Google Maps API key

Then we can use geopy to construct a new geocoder:

>>> from geopy import geocoders
>>> g = geocoders.Google(GOOGLE_MAPS_API_KEY)

And we can geocode our address:

>>> address = '123 East Franklin Street, Chapel Hill, NC'
>>> place, (lat, lng) = g.geocode(address)
>>> point = Point(lng, lat)
>>> point.wkt
'POINT (-79.0549350000000004 35.9136495999999994)'

You can even tap into OpenBlock's internals and build a Geocoder that OpenBlock can use:

from django.conf import settings
from django.contrib.gis.geos import Point

from geopy import geocoders
from geopy.geocoders.google import GQueryError

from ebpub.geocoder import Geocoder, DoesNotExist


class GoogleGeocoder(Geocoder):

    def __init__(self, *args, **kwargs):
        kwargs['use_cache'] = False # haven't implemented cache yet
        super(GoogleGeocoder, self).__init__(*args, **kwargs)
        self.geocoder = geocoders.Google(settings.GOOGLE_MAPS_API_KEY)

    def _do_geocode(self, location_string):
        try:
            place, (lat, lng) = self.geocoder.geocode(location_string)
        except (GQueryError, ValueError), e:
            raise DoesNotExist(unicode(e))
        location = {'point': Point(lng, lat)}
        return location

This is an proof-of-concept geocoder we're using with OpenRural. You can find it on GitHub. Using this geocoder with a sample dataset from the North Carolina Secretary of State Corporation Filings, I was able to increase the geocoding success rate from about 37% to 95%. Again, your mileage will vary, but it can be useful to test out. We can't use Google's API for everything though. Normal users are limited to 2,500 requests per day. Business accounts are allotted 100,000 requests. Additionally, Google requires you to display any points geocoded with their API on a Google Map. So you'll need to evaluate your needs before deciding on using Google's API.

Caktus GroupUsing Django and Celery with Amazon SQS

Amazon's Simple Queue Service (SQS) is a relatively new offering in the family of Amazon Web Services (AWS). It's also an appealing one, because it proposes to quickly and easily replace a common component of the stack in a typical web application, thereby obviating the need to run a separate queue server like RabbitMQ. While RabbitMQ — the typical favorite for Celery users — is not necessarily difficult to install or maintain, removing it from the stack of a web application means one less component that might fail, offloading that service to AWS — especially for applications with a small to moderate queue volume — might prove financially advantageous.

While it's quite easy to use Celery with Amazon's Simple Queue Service (SQS), there's currently not a lot of information out there about how to do it. There's this post on the celery-users list that didn't leave me with much hope, and this question on StackOverflow that sounded slightly more promising. I still couldn't find a step-by-step how to, however, and it ended up being quite easy, so here's my take:

  1. Upgrade to the latest versions of kombu, celery, and django-celery. At the time of this writing, those versions are 1.5.1, 2.4.5, and 2.4.2.:

    pip install kombu==1.5.1
    pip install celery==2.4.5
    pip install django-celery==2.4.2
    
  2. Add the following lines to settings.py (or local_settings.py depending on your setup):

    BROKER_TRANSPORT = 'sqs'
    BROKER_TRANSPORT_OPTIONS = {
        'region': 'us-east-1',
    }
    BROKER_USER = AWS_ACCESS_KEY_ID
    BROKER_PASSWORD = AWS_SECRET_ACCESS_KEY
    

    In the above, AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY should point to the appropriate AWS access key and secret for account you want to use. Pro tip: Use AWS's Identity and Access Management (IAM) to setup an API key and secret that only has access to the services your web application will use (typically one or more of SQS, SES, and SimpleDB).

  3. Finally, if you'll be running multiple servers or environments on the same AWS account (e.g., two different web apps or staging and production environments of the same app), you may want to customize the SQS queue name being used (the default is "celery"). To make this change, add the following lines to your settings.py (or again, local_settings.py):

    CELERY_DEFAULT_QUEUE = 'celery-myapp-production'
    CELERY_QUEUES = {
        CELERY_DEFAULT_QUEUE: {
            'exchange': CELERY_DEFAULT_QUEUE,
            'binding_key': CELERY_DEFAULT_QUEUE,
        }
    }
    

For the curious, Celery's support for SQS lies in the underlying Kombu library, the latest version of which includes a transport for SQS. While some points I found (including the StackOverflow post) suggest using the BROKER_URL syntax for pointing to AWS, I found it simpler to use the BROKER_USER and BROKER_PASSWORD variables. I also saw some reports that slashes in your API secret could confuse the underlying URL parser, and since my API secret happened to include a number of slashes, I went straight to using BROKER_USER and BROKER_PASSWORD.

Anyways, I hope this helps someone else looking to solve the same problem, and don't hesitate to comment if you run into any issues or have a better way to go about this!

Caktus GroupOpenBlock Geocoder, Part 2: Text Parsing and Entity Extraction

This is the second post in our OpenRural series reviewing OpenBlock and it's geocoder. OpenBlock Geocoder, Part 1: Data Model and Geocoding covers the internals of the OpenBlock geocoder and it's geocoding capabilities. As this posts builds upon topics covered there, you may wish to read Part 1 before proceeding. In this post we step back from the internals of the geocoder and explore how to use it along with other OpenBlock tools to parse unstructured text.

I'd also like to give a shout out here to Paul Winkler who was kind enough to answer questions and point me in the right direction on the topics below. Thanks Paul!

The Problem

OpenBlock's original design is centered around providing news at a hyper-local level. That is, down to your own city block. This allows interested citizens to see events ranging from police incidents, to restaurant inspections, to local news articles all aggregated on a map of your block. OpenBlock provides scraping tools to assist downloading this data from the web, but the obvious problem here is that most data isn't packaged or tagged with geographic information. Let's look at an example article teaser from The Daily Tar Heel in Chapel Hill, NC:

No. 4 North Carolina led Evansville 63-27 with just more than 14 minutes to go in the first half when senior forward Tyler Zeller scored his 999th career point at the Smith Center on Tuesday night.

The article mentions the game at the Smith Center, which is the location we want to extract and plot on a map. This is where OpenBlock utilities to ingest unstructured text helps.

Places

Places are simple models containing only a name and geographic point. OpenBlock implements a mechanism to find places defined in the database from a body of text. For example, say we have the following string we'd like to parse:

>>> message = 'A good movie is playing at the Varsity Theater in Chapel Hill tonight.'

OpenBlock can extract "Varsity Theater" if we define it as a Place. You can create and import places in the OpenBlock admin, but to keep things simple, we'll just create one here:

<script src="https://gist.github.com/1469282.js?file=gistfile1.py"></script>

Here we created a new Point of Interest place (which is loaded by default on any OpenBlock install) geocoded to 123 East Franklin Street. Now we need a way to parse places from strings. Most of this functionality is found in ebdata. And ebdata contains a Natural Language Processing package, nlp. We can use it's place_grabber to extract matching places:

<script src="https://gist.github.com/1469286.js?file=gistfile1.py"></script> We can feed this right back into the Place model to retrieve the database objects and their geographic locations: <script src="https://gist.github.com/1469357.js?file=gistfile1.py"></script>

The parser is case sensitive however, so it'll fail if it's not an exact match:

>>> grabber("VARSITY THEATER")
[]

Obviously this is a brute-force method and requires you to pre-load all places of interest into the database beforehand. It's pretty rudimentary, but does provide this functionality out-of-the-box.

Locations

OpenBlock can also extract locations defined in the database. We already have cities loaded, so we'll use them in this example. Just like the place grabber, the location grabber is case sensitive, so we'll define a location synonym with the proper case:

>>> from ebpub.db.models import Location, LocationSynonym
>>> ch = Location.objects.get(name='CHAPEL HILL')
>>> LocationSynonym(pretty_name='Chapel Hill', location=ch).save()

By default, the location grabber igonores types of "city" and "borough". To keep things simple, we'll just create one that includes all location types:

>>> grabber = places.location_grabber(ignore_location_types=[])

Now we can use the grabber to extract locations:

>>> grabber(message)
[(50, 61, 'Chapel Hill')]

If you plan to parse a lot of text in succession, the OpenBlock grabbers cache the locations/places on instantiation. So you won't hit the database after the initial run. Cool!

Addresses

ebdata.nlp can also parse addresses. For example, let's use a simple string:

>>> from ebdata.nlp.addresses import parse_addresses
>>> parse_addresses('The Varsity Theater is located at 123 N Franklin St')
[('123 N Franklin St', '')]

Under the hood, OpenBlock uses a large regular expression to do this, so it's not actually hitting the database or attemping to do geocoding. You'll notice that it returns a 2-item tuple. The second item is for the city:

>>> parse_addresses('The individual was seen on 123 N Franklin St in Chapel Hill')
>>> [('123 N Franklin St', 'Chapel Hill')]

It can parse block locations too:

>>> parse_addresses('The construction is on the 100 block of Franklin St.')
[('100 block of Franklin St.', '')]

And intersections:

>>> parse_addresses('The incident occured at the intersection of Franklin and Hillsborough')
[('Franklin and Hillsborough', '')]

It all comes together with the geocoder:

<script src="https://gist.github.com/1469324.js?file=gistfile1.py"></script>

Conclusion

As you can see, OpenBlock provides a few useful utilities to parse unstructured text. They're fairly limited and, especially with the address parser, will most likely return a lot of false positives. But I think OpenBlock has provided a great starting point. Stayed tuned for more posts on inner-workings of the OpenBlock project!

TriZPUG EventsTriZPUG December 2011 Meeting: Python Show and Tell

Come share your Python experience through lightning talks. Lightning talks are 5 to 10 minutes extemporaneous expositions on a topic of interest to you, something you recently learned, kind of like a show and tell. We'll be meeting at Splatspace, a non-profit member-supported workshop and hacker meeting place. Splatspace is located in the basement of the Snow Building at 331 W. Main St. in Durham. Parking (free exit after 7pm) is in the back of the building in the lot off Ramseur St. on the downtown Durham loop (one way, approach Ramseur from W. Main St. or W. Chapel Hill St.. If you arrive after 7pm, please call 919-704-4225(HACK) to be let in the door.

Caktus GroupOpenBlock Geocoder, Part 1: Data Model and Geocoding

As Tobias mentioned in Scraping Data and Web Standards, Caktus is collaborating with the UNC School of Journalism to help develop Open Rural (the code is on GitHub). Open Rural hopes to help rural newspapers in North Carolina leverage OpenBlock. This blog post is the first of several covering the internals of OpenBlock and, specifically, the geocoder.

OpenBlock Data Model

The OpenBlock geocoder can only geocode from the data is has. It doesn't leverage a 3rd-party API or service. It only uses what's loaded in PostgreSQL (with PostGIS and GeoDjango) and, in this example, what comes from the US Census Bureau and local city and county GIS offices.

Further, the imported data is typically filtered by a bounding box setting in METRO_LIST. The setting, extent, is a list of leftmost longitude, lower latitude, rightmost longitude, upper latitude. This defines a bounding box - the range of latitudes and longitudes that are relevant to your area. A small or restrictive box will limit imported ZIP code and block data to areas that fall within the box.

Let's look at an example with these shapefiles:

We'll start with a restrictive extent that only consists of downtown Chapel Hill:

METRO_LIST = (
    {
        # Extent of the region, as a longitude/latitude bounding box.
        'extent': (-79.066272, 35.91671, -79.040481, 35.910663),
        # ...
    },
)

This selection loaded 2 ZIP codes:

$ django-admin.py import_nc_zips
Importing zip codes...
# ...
Skipping 27511, out of bounds
Skipping 27513, out of bounds
Created ZIP Code 27514 
Created ZIP Code 27516 
Skipping 27517, out of bounds
Skipping 27519, out of bounds
# ...
Created 2 zipcodes.

And limited the block data as well:

$ django-admin.py import_county_streets 37135
Importing blocks, this may take several minutes ...
Created 73 blocks
Populating streets and fixing addresses, these can take several minutes...
Populating the streets table
streets: created: 28
block_intersections: created: 160
Done.

Restricting the area will limit the ability of the geocoder. In this case, for example, it can geocode the intersection of Franklin and Henderson, which is right downtown, but not Franklin and Estes (don't worry, we'll get into more geocoding details in the next section). A map helps illustrate this more clearly. Below you can see the bounding box with pins on the two intersections:


<iframe frameborder="0" height="480" marginheight="0" marginwidth="0" scrolling="no" src="http://maps.google.com/maps/ms?msa=0&amp;msid=214893984853334421171.0004b3fa53ca228c33d80&amp;ie=UTF8&amp;t=m&amp;vpsrc=6&amp;ll=35.920335,-79.05015&amp;spn=0.033364,0.054932&amp;z=14&amp;output=embed" width="640"></iframe>
View OpenRural - Downtown Chapel Hill in a larger map

If we increase the bounding box, we'll get a lot more data:

METRO_LIST = (
    {
        # Extent of the region, as a longitude/latitude bounding box.
        'extent': (-79.165922, 35.829095, -78.978468, 36.02426),
        # ...
    },
)

With an extent that encompasses all of Chapel Hill, the importer loaded 9 ZIP codes, 4302 blocks, 1699 streets, and 7189 intersections. Here's a map illustrating the larger extent:


<iframe frameborder="0" height="480" marginheight="0" marginwidth="0" scrolling="no" src="http://maps.google.com/maps/ms?msa=0&amp;msid=214893984853334421171.0004b3fa78bd0c932ef80&amp;ie=UTF8&amp;t=m&amp;vpsrc=6&amp;ll=35.929649,-79.076843&amp;spn=0.266881,0.439453&amp;z=11&amp;output=embed" width="640"></iframe>
View OpenRural - Orange County, NC in a larger map

It's up to the maintainer of an OpenBlock install to determine which extent to use as it is based on the specifics of the application. A large extent will import more ZIP codes and blocks and, therefore, will slow down geospatial queries and may include unwanted geographic areas.

Street

Now that we have NC Orange County data loaded, let's investigate this data with the OpenBlock models.

The Street model contains a catalog of all loaded streets. It's a simple model with only a few fields:

  • street
  • pretty_name
  • street_slug
  • suffix
  • city
  • state

In NC Orange County, we can see that the street data spans 4 cities:

>>> from ebpub.streets.models import Street
>>> Street.objects.order_by('city').values_list('city', flat=True).distinct()
[u'', u'CARRBORO', u'CHAPEL HILL', u'DURHAM', u'HILLSBOROUGH']

Some streets cross city lines and therefore contain two entries:

>>> Street.objects.filter(street_slug='rosemary-st').values_list('city', flat=True)
[u'CARRBORO', u'CHAPEL HILL']

And, for example, if we're looking for Franklin St. in Chapel Hill, NC, we can filter for it here:

<script src="https://gist.github.com/1467493.js?file=gistfile1.py"></script>

Blocks

Blocks are fundamental to OpenBlock and are used by the geocoder. OpenBlock defines a block as "a segment of a single street between one side street and another side street." The Block model is slightly more intricate than Street, but each entry basically represents the address range of a street for each block segment.

To start, we can see that Franklin St. is divided into roughly 32 blocks:

>>> from ebpub.streets.models import Block
>>> Block.objects.filter(street_slug='franklin-st').count()
32

It's sectioned into an east and west segment:

>>> Block.objects.filter(street_slug='franklin-st').order_by('street_pretty_name').values_list('street_pretty_name', 'predir').distinct()
[(u'Franklin St.', u'W'), (u'Franklin St.', u'E')]

And can have an address between 100 and 1899:

>>> Block.objects.filter(street_slug='franklin-st').aggregate(Min('from_num'), Max('to_num'))
{'from_num__min': 100, 'to_num__max': 1899}

So we can find the block that contains the 123 address:

<script src="https://gist.github.com/1467847.js?file=gistfile1.py"></script>

Also, on a side note, it's possible for some blocks to span cities:

<script src="https://gist.github.com/1467849.js?file=gistfile1.py"></script>

Geocoding

Now that we have a basic understanding of how the data is stored within OpenBlock, let's do some geocoding. Most of these examples will use the SmartGeocoder class. SmartGeocoder delegates to specific geocoders (AddressGeocoder, BlockGeocoder, and IntersectionGeocoder) based on how it interprets the string with regular expressions.

Addresses

To start, let's geocode "123 East Franklin Street":

<script src="https://gist.github.com/1467863.js?file=gistfile1.py"></script>

This one was pretty easy for geocoder to parse and find. You can see that not only has it found the associated block, but it also knows the exact geographic point. However, this will fail if passed a non-existent address number (InvalidBlockButValidStreet):

<script src="https://gist.github.com/1467865.js?file=gistfile1.py"></script>

In this case, the geocoder was able to extract the address, but it failed to find the associated block in the database. Non-existent streets also fail (DoesNotExist):

<script src="https://gist.github.com/1467869.js?file=gistfile1.py"></script>

Intersections

The geocoder can locate intersections too:

<script src="https://gist.github.com/1467876.js?file=gistfile1.py"></script>

Notice how the intersection field is populated, rather than block. This will raise a DoesNotExist exception when an intersection is not found:

<script src="https://gist.github.com/1467885.js?file=gistfile1.py"></script>

Street Misspellings

OpenBlock provides a model, StreetMisspelling, to define street aliases. This allows you to map a bad street name to a good street name that exists in the database:

<script src="https://gist.github.com/1467895.js?file=gistfile1.py"></script>

Now geocoding "Glen Haven" will find "Glenhaven".

Multiple Cities

By default, OpenBlock is configured to work with a single city, which is defined in METRO_LIST:

# Metros. You almost certainly only want one dictionary in this list.
# See the configuration docs for more info.
METRO_LIST = (
    {
        # Extent of the region, as a longitude/latitude bounding box.
        'extent': (-79.165922, 35.829095, -78.978468, 36.02426),

        # The major city in the region.
        'city_name': 'Chapel Hill', 
    },
)

The geocoder will fail if it locates a street that's associated with a city unknown to OpenBlock. For example, 100 Pine Street is in Carrboro and not Chapel Hill:

<script src="https://gist.github.com/1467903.js?file=gistfile1.py"></script>

This street exists in the database due to our extent covering most of Orange County. Since we've setup OpenBlock to encompass an entire county, rather than a single city, we need to define additional cities. This can be accomplished one of two ways:

  • Add additional dictionaries to METRO_LIST for each city
  • Import city locations into the database and tell OpenBlock to refer to these

We imported Orange County city boundary data above, so we'll use the latter:

METRO_LIST = (
    {
        # Extent of the region, as a longitude/latitude bounding box.
        'extent': (-79.165922, 35.829095, -78.978468, 36.02426),

        # Set this to True if the region has multiple cities.
        # You will also need to set 'city_location_type'.
        'multiple_cities': True,

        # The major city in the region.
        'city_name': 'Chapel Hill',

        # Slug of an ebpub.db.LocationType that represents cities.
        # Only needed if multiple_cities = True.
        'city_location_type': 'cities',
    },
)

Here we enabled multiple_cities and informed OpenBlock that the location type slug is cities, respectively. Now 100 Pine Street will geocode properly:

<script src="https://gist.github.com/1467912.js?file=gistfile1.py"></script>

What's Next

Now that we've had an overview of the geocoder, we'll jump into OpenBlock's place, location, and address parser. Stay tuned!

Update: Read more in OpenBlock Geocoder, Part 2: Text Parsing and Entity Extraction.

Caktus GroupCaktus Group Django Sprint

Earlier last month Caktus hosted the 3rd Django Sprint at our Carrboro, NC office. 

Caktus GroupScraping Data and Web Standards

We're currently involved in a project with the UNC School of Journalism that hopes to help rural newspapers in North Carolina leverage OpenBlock.  The project is called OpenRural, and if you're a software developer you can find the latest code on GitHub.

OpenBlock needs geographic data to display, and that data can come from a variety of sources.  We've found a number of web sites that offer geographically interesting data to NC residents, and in this post I'd like to discuss my experience attempting to scrape (that is, programmatically navigate and extract data from) the Chapel Hill Police Department's (CHPD's) online database of crime reports.

The CHPD site advertises itself as powered by "Sungard Public Sector OSSI's P2C engine," and a quick Google for "P2C engine" shows that Chapel Hill is not the only city or county in North Carolina that happens to use this product.  Unfortunately, scraping the data on this site proved to be a non-trivial endeavor.

I opted to host and run my scraper script on ScraperWiki, which is a great tool for writing, testing, and running scraper scripts in a variety of scripting languages.  The site even manifests the scraped data in API form, so it could potentially be used as an abstraction layer between the scraped sites and OpenBlock (or any other consumer of the data).  The current state of the script can be found here:

https://scraperwiki.com/scrapers/chapel_hill_police_reports/

The script uses the Python mechanize library to navigate the site being scraped, and BeautifulSoup to find and extract data on the pages retrieved.  After telling mechanize to click the "I Agree" button on the CHPD web site's landing page, it was easy enough to submit the search form for the current day and return a listing of results.

While getting the initial list of results was fairly trivial, one issue I ran into when writing the scraper is that the site uses an odd method of retrieving and paginating results.  Looking at the HTML source, you will see that the search form is submitted by a small piece of JavaScript, like so:

function __doPostBack(eventTarget, eventArgument) {
    if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
        theForm.__EVENTTARGET.value = eventTarget;
        theForm.__EVENTARGUMENT.value = eventArgument;
        theForm.submit();
    }
}

It turns out this little method is used to do quite a lot.  There are calls to it to do everything from sorting, to pagination, to link to other pages on the site.  It effectively works by setting the form action (via two hidden form inputs on the page) and then calling submit() on the form.

You may have also noticed that the form has method="post", rather than method="get" set, which means the web browser will send an HTTP POST (rather than an HTTP GET) every time you modify the form and click the Search button.  Per the HTTP/1.1 specification, POST requests should be used for requests that modify data on the server, whereis GET requests should be used to retrieve information at a given URL.  You can also tell that the site uses POST instead of GET by inspecting the URL in your browser; sites pages that use GET will typically have a portion of their URL that starts with a question mark and is followed by key/value pairs.  The link to the Google search above is an example of the GET method. Searching a site is by definition a retrieval operation (and typically does not involve modifying data on the server), so well-written search forms should use the GET rather than the POST HTTP method.

Confusing POST and GET is a fairly elementary problem, but it's one that we see far too often on the web.  If you've ever been prompted by your browser "re-submit a form" after hitting the back button and are warned that it may modify data on the server, the site you're using is probably not using the GET and POST HTTP methods properly.

In the case of the CHPD site, while it was easy enough to set the values of the hidden form inputs and re-submit the form using POST (after finding this post on StackOverflow, at least), for some reason the site still returns the first page of results to mechanize (even though it properly paginates in a real web browser). I'm still working on it, but in the meantime, check out the code and let me know if you have any ideas. :-)

Gary PosterClojure/conj: The Questions and the Answers

Wow. I just had a full immersion experience into the Clojure language and community, and it was awesome.  I'll write about in two posts: questions I tried to answer, and notes from the presentations.  Here's the first. Thanks to my wife, my boss, and my employer, Canonical, I got to attend the Clojure/conj conference here in Raleigh, as well as the training session beforehand.  It was a full

Frank WierzbickiContributing to Jython

About a year and a half ago my dream job of doing nothing but Jython all day and night came to an end (By the way, does anyone want to pay me to do Jython all day and night? It's about the only thing that could pull me from my awesome job at Canonical which I'm overdue on writing about here). Anyway to re-integrate myself into society I had to go cold turkey on Jython for a while so I could learn how to have a regular job again. I've contributed to Jython here and there by coding some of this and that, but I've failed to take care of the most important part: helping new people that want to get involved in Jython. I've let that go on for too long and I need to turn things around and get back to doing that. Recently a frustrated patch author sent an email about how hard it is to become a Jython contributor. He has some patches that have been sitting around for a long time and I'm pretty ashamed that that is the normal course of things lately. So, as I start giving Jython a bit more of my spare time again, I plan to make it a priority to review patches and try to figure out how to grow the Jython developer community again. So send patches and I promise to look at them. In particular, if anyone wants to put together patches that fix failing tests in the default Jython branch that targets 2.6 compatibility, I'll be right on them. I'll put together another post on contributing to Jython soon.

Og MacielStarting a New Chapter: To Infinity, and Beyond!

New Chapter

Last October I celebrated a couple of milestones in my life:

  • 5 years living in North Carolina;
  • 2 years since I bought my first house;
  • 11 years married to my wife;
  • 5 years working for rPath;

Needless to say, each one of those milestones were very important to me as they mark important decision points in my life! Every single one of them changed my life for the better and I can’t help but feel blessed that I have so many things to look forward to at end of the year that remind me how lucky I have been!

Last week was my also an important day for me, as I was offered and accepted a job to work at Red Hat as a Senior QA Engineer for their CloudForms team! I assure you that it was a bitter sweet moment for me, for a really loved the work I had been doing at rPath. But the chance to work on Red Hat’s cloud initiative was too much of a temptation for me to pass! I have been truly blessed for having had the chance to join rPath 5 years ago at a point where they were still trying to establish themselves among the big technology providers out there. Five years later they’ve accomplished their goal and I can look back and feel proud that all the late nights, cancelled vacations and hard work paid off!

This December I get to work on a very exciting project with a great bunch of guys trying to accomplish a similar task! I am extremely excited about the potential that CloudForms and its derivatives will bring to the masses and I can only hope to be able to look back five years from now and be able to celebrate another job well done!

To infinity, and beyond!

Calvin SpealmanANN: straight.plugin 1.2 Released

<data:post.title></data:post.title>

Erik Youngren contributed the loading of packages as plugin modules, which should be useful to a number of users who have module needs that don't fit in a single .py file. This release also includes .egg packages for Python 2.6, 2.7, and 3.2 versions.


Enjoy!

Get the new release here http://pypi.python.org/pypi/straight.plugin/1.2
Or, go straight to the source at http://github.com/ironfroggy/straight.plugin

TriZPUG EventsTriZPUG November 2011 Meeting: Grand Sprint Report Outs

We'll hear report-outs from both the NC Django Sprint and the Plone Conference Crushinator Sprint. As always, spontaneous lightning talks of ten minutes or less on other topics are also welcome. Anything you've learned about Python, no matter how trivial, can be a lightning talk. There's plenty of after hours parking in the decks on Partners Way.

Gary PosterClojure/conj: The Talks

The previous mammoth post was about the questions and answers I found at the Clojure/conj conference and training.  This mammoth post is my notes from the talks. Thanks to my employer, Canonical, for the opportunity to go to the conference.  When Canonical gives employees time to go to a conference, we have to summarize it.  The summaries are often company-internal emails, but I like blogging

TriZPUG EventsNC Django Sprint 2011

A development sprint is an excuse to get together, write some code, and have a good time doing it. The purpose of this sprint will be to help finish features and push out bug fixes in preparation for the Django 1.4 release. If you're interested in coming to work on other open source Django-based projects, that's welcome too. It doesn't require any previous experience and, if you don't have prior experience contributing to Django, it is the perfect opportunity to start. We'll be there at 9am both days to open the doors. It's a tradition to go out for drinks after the sprinting winds down, perhaps around 4 or 5, but this may be earlier or later depending on the general momentum of the sprint. The sprint is being hosted by Caktus Consulting Group.

Calvin SpealmanNaNoWriMo: Day 3

LONG DAY. Frustrated with the results today, but I am happy that I'm still on track. This is why I got extra words in the first two days!

Daily Goal: 1666

Today: 967

Total: 5005

Remaining: 44995

Remaining Daily: 1606
from my tumblr post

Calvin SpealmanNaNoWriMo: Day 2

I love how the story is revealing itself. I am not planning much, but many possibilities are popping into my head as I start to make connections in the story. Partly i wonder if i should avoid them, maybe they are too obvious since i thought of them just from what i’ve written? I don’t know. I’ll probably run with it, I need the material, after all.


What is Roshyn after, anyway?


Daily Goal: 1666


Today: 2028


Total: 4038


Remaining: 45962


Remaining Daily: 1584

from my Tumblr post

Calvin SpealmanNaNoWriMo: Day 1

I started my novel. I hate the first thousand words, which were just meandering and going no where. I stumbled on something interesting for the second thousand.
Daily Goal: 1666
Today: 2010
Total: 2010
Remaining: 47,990
Remaining Daily: 1600
Today was a Good Writing Day. Even though I could not write in the morning, and only was able to get 300 words on the bus, I still finished out ahead.
from my Tumblr post

Calvin SpealmanNaNoWriMo: Day 0

I have signed up for NaNoWriMo 2011 with a novel draft I’ll be writing named The Compass and The Leash. I’ll make some notes here each day as I go.
Mirrored from my Tumblr post
<data:post.title></data:post.title>

TriZPUG EventsTriZPUG October 2011 Meeting: PLY

Joseph Tate built a front end to a rules system in a data processing system. To do so he designed a Domain Specific Language (DSL) to simplify creating triggering code by non-sophisticated users. Joseph will show you how he did that; what the PLY programming system looks like, and compare PLY to some of the other common Python DSL tools like Yapps and SimpleParse. As always, spontaneous lightning talks of ten minutes or less on other topics are also welcome. Anything you've learned about Python, no matter how trivial, can be a lightning talk. Note: this meeting starts at 6pm as the doors to the building automatically lock at 7pm.

Caktus GroupDjango Without the Web

One of the things I like best about Django is how easy its ORM makes it to work with databases. Too bad Django is only for web applications. Sure, you could deploy a Django app and then make use of it from a non-web application using a REST API, but that would be too awkward.

But there is an easy way to use Django without the web! Here's the trick - write your application as Django management commands. Then you can run it from the command line. Just like 'manage.py syncdb' or 'manage.py migrate', you can run 'manage.py my_own_application' and your application has access to the full power of Django ORM.

Adding a new Django management command is surprisingly easy:

  1. Add a management/commands directory to your application.
  2. Create a anything.py file containing a class that extends django.core.management.base.BaseCommand or a subclass.
  3. Write a handle method that runs your application
  4. Run 'manage.py anything'

Here's an example of a trivial command:

from django.core.management.base import BaseCommand

class Command(BaseCommand):
    def handle(self, *args, **kwargs):
        print "Hello, world"

Create a management/commands directory in your application and save this there as 'hello.py'.

Now try it:

$ ./manage.py hello
Hello, world
$

How about doing something useful?  Here's an example that prints out all of your invoices, so you can see how easy it is to access your data:

from django.core.management.base import BaseCommand
from appname.models import Invoice

class Command(BaseCommand):
    def handle(self, *args, **kwargs):
        print "Invoices"
        for invoice in Invoice.objects.order_by('date'):
            print u"%s %s" % (invoice.date, invoice.summary)

I've used custom management commands to do things like importing data where something more complicated than loading a fixture was needed.

For more details, see the Django documentation.

Chris CallowayPlugging Leaks With Multiprocessing

At one time or another, we all have to deal with Python modules which are simply wrappers around code written in other languages such as C or Fortran. Particulatly when it comes to scientific data, there are simply too many maintained and pre-existing code libraries to ignore. That Python is great glue for such libraries is one of the great things about Python.

However, code libraries written in static languages without garbage collection often have memory leaks. Sometimes the memory leaks are even well known. Python prevents its own objects from leaking memory through its garbage collection. But extension libraries can leave unfreed objects which are unreachable by Python's garbage collector. These leaks can evenutally crash your Python if allowed to build up by repeated invocations of the leaky code within the same program.

If you can isolate the offending leaky extension objects into a function, then you are in luck. Python will allow you to call that function in its own process. Then when the process ends, your operating system will reclaim all that leaked memory for you.

The multiprocessing module is your friend. Use it to run leaky extensions in their own processes:

import multiprocessing

def sir_leaks_a_lot(datum):
    # Put your leaky extension code here.
    # For instance, matplotlib functions which
    # crash with "std::bad_alloc" errors when
    # called repeatedly.
    pass

for datum in data:
    p = multiprocessing.Process(target=sir_leaks_a_lot, args=(datum,))
    p.start()
    p.join()
    assert not p.exitcode, \
           "Exitcode %s from processing %s" % (p.exitcode, datum)

The start() method will run sir_leaks_a_lot with its parameters bound to the elements of the args tuple in the call to Process(). The join() method will wait for the process to finish.

You could run multiple sir_leaks_a_lot processes at once intead of allowing each to finish one at a time. But then, that would allow the leaks to build up and crash your program again. So using join() causes each leaky process to get cleaned up before running the next one.

Now you are ready to generate tens of thousands of large matplotlib plots in a single cron job!

Joe Gregorioclient_secrets.json

The google-api-python-client has just added support for the client_secrets.json file format (in tip, a new release with the support is coming soon).

The file format is (loosely) defined here:

http://code.google.com/p/google-api-python-client/wiki/ClientSecrets

The oauth2client/google-api-python-client support is explained here:

http://code.google.com/p/google-api-python-client/wiki/ClientSecretsSupport

Copied from the 'Motivation' section:

Traditionally providers of OAuth endpoints have relied upon cut-and-paste as the way users of their service move the client id and secret from a registration page into working code. That can be error prone, along with it being an incomplete picture of all the information that is needed to get OAuth 2.0 working, which requires knowing all the endpoints and configuring a Redirect Endpoint. If service providers start providing a downloadable client_secrets.json file for client information and client libraries start consuming client_secrets.json then a large amount of friction in implementing OAuth 2.0 can be reduced.

Chris CallowayDispatcher Pattern Safety

This post is a rehash of a lightening talk I gave at the last TriZPUG meeting.

One of the more useful programming patterns when applied to Python is the Dispatcher Pattern. The getattr built-in function practically implements the entire pattern. It's one of those Python goodies which leads people to say that programming patterns are already built into Python.

The Dispatcher Pattern allows you to create plug-in architectures for your code. The idea is that you have a number of handler code objects for handling different types of data. When your code encounters data which needs handling, dispatcher code selects which handler should be used.

Some code is worth a thousand words. Here's an example I use when teaching the Dispatcher Pattern at PyCamp:

class Plugin(object):
    """Implement a pluggable architecture."""

    def handle_html(self, name):
        print "The HTML file", name, "has been handled."

    def handle_pdf(self,name):
        print "The PDF file", name, "has been handled."

    def handle_rtf(self, name):
        print "The RTF file", name, "has been handled."

    def handle_default(self, name):
        print "The file", name, "has been handled."

if __name__ == '__main__':
    import sys, os.path
    name = sys.argv[1]
    try:
        ext = os.path.splitext(name)[1][1:].lower()
    except IndexError:
        ext = 'default'
    plugin = Plugin()
    getattr(plugin, 'handle_' + ext, plugin.handle_default)(name)

The string name of a handler method is created within getattr's argument list. getattr returns the appropriate handler method and the handler is directly dispatched by call.

The handler objects need not be methods in a class. The handlers could be functions in a module which is imported. getattr doesn't care what kind of object of which the handlers are attributes.

A plug-in architecture truly occurs when the handlers are modules or subpackages within a package for containing pluggable handlers. If the __all__ attribute of the package is properly maintained, then new handler modules may simply be dropped into the plug-in package's directory. By maintaining a naming convention for the handlers, the __init__.py module of the package may glob for a list of handlers to extend onto __all__:

import os, glob

__all__ = [os.path.splitext(os.path.basename(handler))[0]
           for path in __path__
           for handler in glob.glob(os.path.join(path, 'handle_*.py'))]

Such pluggable modules should implement well-defined functions by name (e.g., run, process, create, update) which may be accessed when the handler module is dispatched through getattr operating on the namespace into which the handlers are imported:

import sys
from plugins import *

datatype = 'pony'
getattr(sys.modules[__name__], 'handle_' + datatype).run()

The code above should execute the run function of the handle_pony module in the plugins package if the __all__ attribute of the plugins package was properly maintained in __init__.py.

Obviously, modules placed in a plugins package are trusted code. And as trusted code, we should expect such modules to handle all anticipated exceptions and clean up after themselves.

But in the real world, unanticipated exceptions occur. File formats being handled can change. Web service APIs might morph. Any number of conditions might occur which could lead to unhandled exceptions in a plug-in.

Hopefully our plug-ins don't simply swallow unhandled exceptions. But if they are unanticipated exceptions, we may expect them to bubble up to the dispatcher code.

If the dispatch code is one-shot, that is not executed repeatedly or in a long running process, then allowing the unanticipated exception to halt our program and display as a traceback may suffice. But for the most part, we are interested in not allowing plug-ins to crash our code as well as not silencing the exceptions which plug-ins don't handle.

Because we don't know the type of the unanticipated exception, our plug-in exception handler must cover all the bases. The traceback module is handy for making sure we know what occurred:

import sys, traceback

try:
    getattr(sys.modules[__name__], 'handle_' + datatype).run()
except:
    traceback.print_exc()

For long running processes, one further refinement helps us stay sane:

import sys, traceback

try:
    getattr(sys.modules[__name__], 'handle_' + datatype).run()
except KeyboardInterrupt:
    sys.exit()
except:
    traceback.print_exc()

By using these techniques, you can implement a plug-in architecture which won't crash your programs and will also let you know what went wrong when a plug-in goes awry.

Caktus GroupCaktus 2012 Summer Internship Program

I'm excited to announce that Caktus is looking for candidates for our summer internship program. It is a 12 week paid position in our Carrboro, NC office. We're driving distance from UNC Chapel Hill, NC State Univeristy in Raleigh, and Duke in Durham, so students from all parts of the NC Research Triangle are welcome to apply.

We are looking for a web developer who enjoys working on a team and is excited to work on new and diverse projects. While working with us you will get to work on Django-powered web applications, learn about test driven development and other agile methodologies, perform front-end development in HTML, CSS and JavaScript (jQuery) and become familiar with Linux (Debian-flavor) desktop and server systems. Check out the full job posting here

If you'd like to spend your summer working with some great people on interesting projects please email us at jobs+website@caktusgroup.com with your resume and, if applicable, links to samples of code you have written. Kindly include a brief note describing why you would be a great fit for this opportunity.

Caktus GroupCaktus Hosts 3rd Django Sprint in North Carolina

Here at Caktus, we love Django and use it to make all of our web applications. To help support the Django community, we are hosting a development sprint on November 12th and 13th at our office in Carrboro, NC in preparation for the 1.4 release. The sprint is a great is an excuse for people to get together and focus their undivided attention on improving Django. You will be helping out by providing bug fixes, improving the documentation and also adding features to existing packages.   

If you would like participate in the sprint, no previous experience is necessary and this would be a great time to start contributing.  Mark wrote a great blog piece about how to get started contributing to Django through sprinting that you can read here

We'll be here at 9:00 AM both days and the day usually ends between 4-5:00 PM, depending on the momentum, and afterwards everyone gets together for dinner and drinks. If you would like to attend, please RSVP at the Eventbrite and if you cannot make it to the office, please submit your name to the online roster

We look forward to seeing you!

Og MacielFor Those “Celebrating” Columbus Day

Depiction of Spanish atrocities in the New World

 

Caktus GroupCaktus Group Welcomes Designer and Front End Developer Julia Elman

I'm delighted to announce that Julia Elman has joined our growing team of web developers here at Caktus. Julia started her design career almost 10 years ago in an internal marketing group, and first learned about Django at the SXSW Interactive Festival in 2008. Prior to joining the Caktus team, Julia worked at the Lawrence Journal World (the birthplace of Django) and as a freelance designer.

Caktus is a seasoned team of web developers that creates interactive, content-rich sites and applications with the Django web framework. We put a strong emphasis on best practices, employ an agile method, and also actively participate in the Django development community.

For more information about Caktus and our team, check out our newly updated team page!

Chris CallowayAssignment Considered Harmful

We do a disservice to the understanding of Python when we refer to Python identifiers as "variables" and binding as "assignment." The words "variable" and "assignment" are particularly loaded with meanings from languages other than Python. And those meanings do not reflect the dynamic nature of Python.

For most languages, "variable" refers to a named location in memory and the act of assignment refers to populating that memory location with data. Subsequent assignment of a variable to another variable means to allocate yet another named location in memory and then copy data from one location to another. Data associates with only one variable name in this scheme.

In Python, however, objects are created in memory without the necessity of a name. In order to reference objects (and prevent them being garbage collected), we bind identifiers to objects. A single object may be bound to multiple identifiers (in multiple namespaces, even). And rebinding does not create new copies of objects implicitly.

The hazard in refering to identifiers as "variables" is the suggestion that objects are being "stored" in identifiers. And the pitfall in refering to binding as assignment is the suggestion that objects are being copied.

The use of the words "variable" and "assignment" when referring to identifiers and binding in Python is widespread, even penetrating the official Python documentation and the writings of the BDFL. It would promote a better understanding for newcomers to Python, however, if that unfortunate habit would be left behind.

TriZPUG EventsTriZPUG September 2011 Meeting: Python Show and Tell

Come share your Python experience through lightning talks. Lightning talks are 5 to 10 minutes extemporaneous expositions on a topic of interest to you, something you recently learned, kind of like a show and tell. We'll be meeting at Splatspace, a non-profit member-supported workshop and hacker meeting place. Splatspace is located in the basement of the Snow Building at 331 W. Main St. in Durham. Parking (free exit after 7pm) is in the back of the building in the lot off Ramseur St. on the downtown Durham loop (one way, approach Ramseur from W. Main St. or W. Chapel Hill St.. If you arrive after 7pm, please call 919-704-4225(HACK) to be let in the door.

Calvin Spealman0x0002 Dev Diaries - JS Privates

A little experiment / toy I have, which I might use to keep some decoupling attempts straight.

<iframe src="http://jsfiddle.net/Kx8kW/1/embedded/" style="height: 300px; width: 100%;"></iframe>

Using this, I'll be able to allow the various components to maintain properties on other objects they work with, safely, without interfering with one another.


For example, I have a Draggable behavior that can be enabled in a scene and allows objects in the scene to be, obviously, draggable. During this use, the behavior code may often need to set properties on the objects to track dragging state, but needs to do so without interfering with other behaviors setting properties of their own, even other instances of the same behavior. This simple utility enables this.

privates(this, entity).dragPath = [];
privates(this, entity).dragPath.push(curPos);


...


drawPath(privates(this, entity).dragPath);


The concept this is trying to express is that the first object owns a set of private properties attached to the second object, and this works much like private attributes in many languages but works with delegation and composition, rather than inheritance.


My rendering system will be using this to take the general description of a sprite object and attach to it loaded images, animation state, and flags needed to manage redraw orderings, without mucking in the namespace of the object itself.

Caktus GroupBulk inserts in Django

I recently found a way to speed up a large data import far more than I expected.

The task was to read data from a text file and create data records in Django, and the naive implementation was managing to import about 55 records per second, which was going to take far too long given the amount of data that needed to be imported.

My co-worker Karen Tracey suggested changing to bulk inserts. Instead of creating and saving one Django record at a time, we'd create a whole batch of Django objects, then save them all in one SQL operation. I figured reducing the number of database round-trips would speed things up somewhat, but was not prepared for the actual numbers - I'm consistently getting around two orders of magnitude improvement compared to single record inserts.

As I scaled up, I made one more change - instead of doing the insert in one batch, I limited each batch to a few hundred records. I didn't want to store an unlimited number of Django objects in memory at once, and some benchmarking showed that the benefit of batching the inserts leveled off at a few hundred records.

Caveats

There are a few differences from normal object creation. First, save() is not called on the instances, nor are post_save signals sent, and the model instances' primary keys are not set. If you're doing anything more complicated than dumping a bunch of data into the database, you'll probably need to stick with creating objects individually.

Also, the code we're using to do the bulk insert does not handle ForeignKeys properly. The workaround when creating the Django objects is to set the value of any ForeignKey field to the primary key of the object referred to, if any.

Example

Here's what code for a bulk insert might look like.

from bulkops import insert_many
from our_models import Book

objects = []
for data in data_source:
    # Assume data['foreign_key'] is a reference to another model
    # Change that to its primary key
    data['foreign_key'] = data['foreign_key'].pk
    objects.add(Book(**data))
    # Keep our batch size from getting too big
    if len(objects) &gt; 200:
        insert_many(objects)
        objects = []
insert_many(objects)

Django 1.4

The current development branch of Django has added a bulk insert feature, which seems likely to be included in Django 1.4. It's very similar to the code we're using here - just change "insert_many(objects)" to "Book.objects.bulk_create(objects)". That's subject to change before Django 1.4 is released, of course.

Credit

Credit goes to Karen for suggesting the approach to me, and Ole Laursen's blog post for the original idea and the implementation that we're using.

Links

Ole Laursen's blog post: http://ole-laursen.blogspot.com/2010/11/bulk-inserting-django-objects.html

Implementation: http://people.iola.dk/olau/python/bulkops.py

Original commit to Django development: https://code.djangoproject.com/changeset/16739

Calvin Spealman

New post on my game development progress.
I've been tinkering on a small Javascript and Canvas engine for the last few weeks, and I'm mostly happy with the results, if not as happy with the progress. Still, it has been steady, so I'll focus on being happy for that part. 
I'm starting to split the codebase into two parts: a javascript utility library, very similar to backbone.js, and the game engine on top of it.
Read the full post
<data:post.title></data:post.title>

Og MacielPodcast: Pete Savage

Pete Savage: Git In The Trenches

For those who follow my many different projects and enterprises, you probably already know that I have been hosting a podcast called Castálio Podcast, a bi-weekly show where I interview people from the Brazilian FOSS world and talk about their likes, dislikes and what events and factors shaped their lives!

When I asked my listeners if they would be interested in an episode in English with someone new and exciting, the answer was an overwhelming ‘Yes!’

So for my very first episode in English I chose to interview a good friend of mine from several years: Pete Savage! During the next 58 minutes we talked about how we first met through a PyGtk video he posted a while back during a Linux User Groupmeeting, how he first got involved with Edubuntu, and then moved on to several other projects such as ProgBox and GeekDeck, about the books that he’s written including the reason for writing “Emblem Divide“, how much the Japanese culture plays a role in his daily life, and his Top 5 movies, books and movies! While the episodes’ in English future are yet to be determined, this latest episode can be downloaded here and you can also subscribe to it via the following channels:

Og MacielDjango DevKit Appliance 1.3.1

Django

I rebuilt the appliance to use the latest Django 1.3.1 release to deliver the security fixes found in the previous version. There are also several other updated packages included.

If you want to play with this appliance, feel free to download it in the following formats:

Speaking of Raw Filesystem images, here’s how I currently use it  with QEMU. In my .bashrc file I have an alias that will boot them and redirect it’s internal ports 80 and 22 (apache and ssh) to my system’s port 8080 and 2222 respectively. I also forward port 3389 for Windows systems.

sudo qemu-kvm -m 2048 -hda "$1" -boot c -soundhw ac97 -redir tcp:8080::80 -redir tcp:2222::22 -redir tcp:9999::3389

So when I call my alias and pass a raw filesystem image as an argument, I can then use localhost as the destination to my http and ssh connections.

Django Dev Kit on QEMU

I also have a special configuration in my .ssh/config file to make it easier for me to ssh to these virtual systems and not have to change my known_hosts file every time I boot a different system and try to ssh to localhost on port 2222:

Host qemu
User root
Port 2222
Hostname localhost
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Caktus GroupTesting Web Server Configurations with Fabric and ApacheBench

Load testing a site with ApacheBench is fairly straight forward. Typically you'd just SSH to a machine on the same network as the one you want to test, and run a command like this:

ab -n 500 -c 50 http://my.web.server/path/to/page/

The -n argument determines the number of requests to execute, and the -c argument the determines the concurrency level--or how many requests will be running simultaneously at any given time.

For Python and Django web applications, Fabric is popular tool for deploying code to and running other commands on remote servers. It's built in Python, and its simple syntax makes it easy to use as well. For more information and a primer on Fabric, check out the post that Colin Copeland wrote back in 2010, titled Basic Django deployment with virtualenv, fabric, pip and rsync.

Running ApacheBench from Fabric is useful because you can easily do other things like customize and update your web server configuration in an automated way. For example, here's a sample template for an Apache server configuration that I upload to our web servers using Fabric:

ServerName %(www_server_name)s

WSGIDaemonProcess my_site-%(environment)s processes=%(process_count)s threads=%(thread_count)s display-name=%%{GROUP}
WSGIProcessGroup my_site-%(environment)s
WSGIScriptAlias / %(apache_root)s/%(environment)s.wsgi

ErrorLog %(log_root)s/wsgi.error.log
LogLevel info
CustomLog %(log_root)s/wsgi.access.log combined

You'll notice the %s-style Python string formatting syntax in the Apache config. These are populated by Fabric's files.upload_template method when the file is copied to the remote server, and are based on variables you pass in to the context. Here's a sample Fabric method to upload your Apache configuration to the remote server:

def _join(*items):
    """
    We're deploying to Linux, so hard code that type of path join here. Using
    os.path.join would not work when deploying from Windows.
    """
    return '/'.join(items)

def apache_graceful():
    sudo('/etc/init.d/apache2 graceful')

def update_apache_conf(process_count=15, thread_count=1):
    env.process_count = process_count
    env.thread_count = thread_count
    for ext in ['conf', 'wsgi']:
        source = os.path.join(env.deployment_dir, 'templates',
                              'apache.%s' % ext)
        dest = _join(env.home, 'apache.conf.d',
                     '.'.join([env.environment, ext]))
        files.upload_template(source, dest, context=context, mode=0755,
                              use_sudo=True)
    apache_graceful()

Specifying process_count and thread_count in the arguments to update_apache_conf() means that I can pass those in from the command line, like so:

fab staging update_apache_conf:10,3

This would install an Apache configuration on the server that starts up 10 mod_wsgi processes with 3 threads each.

Running ApacheBench through Fabric is also easy to do, but here's a slightly more complex example I put together that saves the results in time-stamped folders, whose names also include the number of requests, concurrency level, process count, and thread count of the test:

def benchmark():
    config = {
        'number': 500,
        'concurrency': 50,
        'url': 'http://my.web.server/path/to/page/',
    }
    # prime the server with a few requests before logging any results
    run('ab -n 10 -c 1 {url}'.format(**config))
    context = dict(env)
    context.update(config)
    context['now'] = datetime.datetime.now().strftime('%Y-%m-%d_%H:%M:%S')
    dir_name = '{now}_n={number},c={concurrency}'
    if 'process_count' in context and 'thread_count' in context:
        dir_name += '_p={process_count},t={thread_count}'
    dir_name = dir_name.format(**context)
    context['test_dir'] = os.path.join('test_runs', dir_name)
    run('mkdir -p {0}'.format(context['test_dir']))
    for x in range(4):
        context['test_file'] = os.path.join(context['test_dir'],
                                            'ab{0}.txt'.format(x))
        run('ab -n {number} -c {concurrency} {url} &gt; '
            '{test_file}'.format(**context))

You can run these commands together to update the Apache configuration and run a benchmark with a single line from the shell, like so:

fab staging update_apache_conf:10,5 benchmark

This would update the Apache configuration on the remote server, run a few requests to prime the server, and then run the specified ApacheBench test 4 times and save the results in text files in a timestamped directory.

To test lots of different server configurations at once with minimal user interaction, you can further script this by wrapping the above command in a Bash for loop, like so:

for process_count in {1..76..5}; do fab staging update_apache_conf:$process_count,1 benchmark; done

This command iterates from 1 through 76, in steps of 5 (1, 6, 11, 16 ... 76), sets the Apache configuration to use that number of processes, and runs a separate benchmark for each configuration.

Anyway, that's just a little insight into how one might deploy and test a Python or Django application using Fabric and ApacheBench. Hope you find it helpful!

David RayObligatory First Post

When I decided to create a more formal web presence, I realized that I really needed to eat my own dog food, so to speak. Since I primarily use Python in my development work, I thought it only fair to explore the world of python blogging.

The trouble is, I don't have (nor particular want, at this point in time) a virtual server that I have fully control over. Not wanting to deal with the help desk to determine how easy it would be to run more than simple scripts on my LAMP based web hosting provider, I started looking into static blog generators.

This is a Pelican based blog. It was brought to my attention in my Twitter feed. Since I am heavily dependent on ReStructuredText for documentation purposes, and am fairly familary with the templating engine (Jinja2), I decided to embrace the Pelican, and give it a go.

Og MacielSixth Annual Packt Open Source Awards

Packt Publishing

The 2011 Open Source Awards was launched on the 1st week of August by Packt, inviting people to submit nominations for their favorite Open Source project. Now in its sixth year, the Awards continue in its aim of encouraging, supporting, recognizing and rewarding all Open Source projects.

The 2010 Open Source Award Winners included the Open Source Content Management System (CMS) Award winner CMS Made Simple, Open Source JavaScript Libraries Award winner jQuery and Pimcore the winner of the Most Promising Open Source Project Award.

The 2011 Awards will feature a prize fund of $24,000 with several new categories introduced and the vote of the public becoming more influential. This year all CMS projects will compete in an even tighter contest in the Open Source CMS Award category with the now defunct Hall of Fame CMS finalists re-entered into the CMS category. Projects such as Drupal and Joomla! will face off with CMS Made Simple and MODx for the first time since 2008.

While the Most Promising Open Source Project and the Open Source JavaScript Libraries categories will be back for a second year, Packt is introducing new categories for Open Source Business Applications, Open Source Multimedia Software and Open Source Mobile Toolkit and Libraries. These new categories will ensure that the Open Source Awards remain committed to providing the platform to recognise excellence within the community while supporting Open Source projects both new and old.

“We’ve managed to continue to provide new levels of accessibility for Open Source projects, while encouraging a more competitive nature in the contest by increasing the public votes influence. Additionally, we thought it would be a great idea to reward more projects thus we’ve introduced sub-category awards across a number of the categories during the voting stage. We expect the Awards this year to be bigger and better.” said Julian Copes, organizer of this year’s Awards.

Packt has opened up nominations for people to submit their favorite Open Source projects for each category at www.PacktPub.com/open-source-awards-home . The top five in each category will go through to the final, which begins mid-September. For more information on the categories, read Packt’s recent announcement: www.packtpub.com/blog/2011-open-source-awards-announcement

Having bought books from them before, I’m very happy to support their initiative and invite the readers to not only participate of this event but check out their books and EBooks as well!

Caktus GroupGetting Started using Python in Eclipse

Eclipse with the PyDev module has a lot to offer the Python programmer these days. If you haven't looked at PyDev before, or not in a while, it's worth checking out.

Here are some of my favorite features:

  • One-keystroke navigation to the definitions of variables, methods, classes
  • Code completion, including automatically adding import statements
  • Clean up imports
  • Refactoring, including renaming across projects
  • Clean up whitespace

There are many more. I recommend taking a look at the PyDev web site and blog to see what might appeal to you.

Getting Eclipse and PyDev

If you're already using Eclipse, you can add PyDev to it. If not, you also have the option to get a version of Eclipse with PyDev already included. You install PyDev into your existing Eclipse the same way you install any other Eclipse add-on: first tell Eclipse where to find the add-on, then install it.

  • In Eclipse 3.6 and 3.7, select Help/Install New Software...
  • On the panel that pops up, click "Add..." at the top right.
  • Enter any name (e.g. "PyDev")
  • Enter http://pydev.org/updates as the Location, then click OK.
  • In the list of available software, select PyDev. 
  • Click Next, Next, accept the license, Finish.
  • If Eclipse asks whether to trust the PyDev certificate, agree.
  • When the install is complete, allow Eclipse to restart.

To get Eclipse with PyDev already installed, go to http://www.aptana.com/products/studio3/download and download Aptana Studio for your platform. Aptana Studio 3.0.4 is Eclipse 3.6 plus PyDev plus other add-ons.

Preferences

There are some preferences in Eclipse you probably want to change if you'll be working with Python.  Open the preferences by selecting Window/Preferences, then use search to find and set these:

  • Insert spaces for tabs: checked, but note that the PyDev editor ignores this and you need to make a similar setting in the PyDev settings for editing Python files.
  • Show whitespace characters:
    • In Eclipse 3.6, you probably want this off except when you're looking for trailing whitespace.
    • In Eclipse 3.7, you can check the box and then click on "whitespace characters" and set just the trailing whitespace visible, which is unobtrusive enough to leave enabled all the time.
  • Replace tabs with spaces when typing: checked.  This is the one that PyDev obeys.
  • Right trim lines: checked, otherwise you end up with a lot of lines with just indentation on them.
  • Add newline at end of file: checked.
  • Auto-Format editor contents before saving: If you check this, every time you save a file PyDev will fix it to comply with the other settings on this preferences page. That's great if you're working on your own project, but not so good if you're doing maintenance on somebody else's project and don't want to make random changes to white-space all over the place.

Explore the other PyDev settings. The "Code Analysis" section is particularly interesting, as it lets you control the kinds of things that Pydev marks as errors or warnings.

Finally, at least one Python interpreter needs to be configured.  Still in Preferences, go to PyDev/Interpreter - Python.  For now, just click "Auto Config" and click OK on the dialog that pops up.  Then click OK to close Preferences.  PyDev will take a while to analyze the python installation and libraries.

Perspective

Select Window/Open Perspective/Other and choose PyDev.

Starting to use Eclipse and PyDev with a project

I typically use Eclipse with Django projects, though I haven't tried PyDev's Django-specific features yet.

When I want to work with a project in Eclipse, first I check it out locally. Then here are the steps I follow:

  • File/New/Project (not PyDev project, I don't like the PyDev new project wizard)
  • Choose General/Project, click Next
  • Enter a project name
  • Uncheck "use default location" and set the location to the top directory of my project
  • Click Finish
  • Right-click on the project and select PyDev/Set as Pydev Project
  • Right-click on the project and select Properties
  • go to PyDev - PYTHONPATH
  • In the Source Folders tab, use "Add source folder" to add folders that need to be on your python path for your project to work.  Often this is either the top-level project folder or a folder immediately inside it.

Using PyDev with virtualenv

If you use virtualenv (and if not, why not?), there are a couple additional steps to take.

First, add the interpreter from your virtual environment as another Python interpreter:

  • Open Preferences
  • Go to PyDev/Interpreter - Python
  • Click "New..."
  • For the Executable, navigate to your virtual environment's bin directory and select the Python interpreter there.
  • Choose another name for your interpreter if you want, probably something shorter than the default.  I like to use the name of the virtual environment, with "-env" appended.
  • Click OK
  • Now here's the tricky part - a dialog will pop up asking which library folders to add.  Keep the defaults but you also need to add your system python library directories - e.g. /usr/lib/python2.6, /usr/lib64/python2.6, and /usr/lib/python2.6/plat-linux.  Otherwise PyDev won't be able to find all the libraries your python interpreter will be using.
  • Click OK

Then, set the new interpreter as the interpreter for your project:

  • Right-click the project and select Properties
  • Go to Pydev - Interpreter/Grammar
  • Under Interpreter, select your new interpreter
  • Click OK

Now PyDev should be able to find any libraries you have installed in the virtual environment when needed. 

If you install additional libraries, you might need to go back to the interpreter definitions, click "Apply", and tell Pydev which interpreters it should scan again. Until you do that, PyDev might not notice your new libraries.

For more information, see    http://pydev.blogspot.com/2010/04/pydev-and-virtualenv.html 

Links

Caktus GroupCaktus Consulting Group Sponsors DjangoCon 2011

DjangoCon 2011 is coming up next week and I'm excited to announce that Caktus is sponsoring the conference again this year! It is being held once again in beautiful Portland, Oregon from September 5th through the 10th. We've grown quite a bit from last year, there will be 9 team members-Colin, Tobias, Karen, Mark, Dan, Scott, George, Caleb and myself-attending the conference this year. 

We are all really excited to hear some great talks, meet other Django developers and learn more about our all time favorite framework. You can read about why we like it so much in our blog post Why Caktus Uses Django. 

TriZPUG EventsSeattle PyCamp 2011

University of Washington Marketing hosts the inaugural Seattle PyCamp 2011, sponsored by the Seattle Plone Gathering, at The Paul G. Allen Center for Computer Science & Engineering on the campus of the University of Washington. For beginners, this ultra-low-cost Python Boot Camp makes you productive so you can get your work done quickly. PyCamp emphasizes the features which make Python a simpler and more efficient language. Following along with example Python PushUps™ speeds your learning process in a modern high-tech classroom. Become a self-sufficient Python developer in just five days at PyCamp!

J. Cliff DyerDon’t waste your iterators!

Hey all. I kind of put everything in this blog. I hope much of it will be useful to somebody, but most people will probably only care about some of what I write here. Today, I’m writing about python programming. Some days, I use this space as my workout log. If you are using a feed reader, and you only want to see certain kinds of content, you can actually subscribe to individual categories within my blog, by clicking on the category name on the right –> and then using that URL in your feed reader.

If you’re only here for the fitness stuff, feel free to move on.

There’s a pattern I see fairly often in code where someone uses a function that returns a sequence of some sort, filters it, and then wants to use the first result that matches the filter. It looks something like this:

    return [x for x in foo if len(x) > 4][0]

This works, and looks “pythonic” (it uses list comprehensions after all!), but it’s actually a fairly slow and wasteful way to get the results we want.

The actual example I saw which prompted me to post this was from a fun post by Jeff Elmore which explained creating a wu-name generator in six lines of python.

import urllib
from lxml.html import fromstring
def get_wu_name(first_name, last_name):
    """
    >>> get_wu_name("Jeff", "Elmore")
    'Ultra-Chronic Monstah'
    """

    w = urllib.urlopen("http://www.recordstore.com/wuname/wuname.pl",
                       urllib.urlencode({'fname':first_name, 'sname': last_name}))
    doc = fromstring(w.read())
    return [d for d in doc.iter('span') if d.get('class') == 'newname'][0].text

What this will do is find every span in the document, and check to see if its class is ‘newname’. In order to do this, it has to scan the entire document, which may contain a significant amount of unwanted material.

We don’t need a comprehensive list of matching spans. We just need one, and then we can take it and move on. With a list comprehension, we can’t even ask for the first one until the whole document has been processed.

We’re actually better off going through this the old-fashioned way, by using an if nested in a for-loop, and returning the result.

   for d in doc.iter('span'):
       if d.get('class') == 'newname':
           return d

But we like our brevity, and python provides us with generator expressions, which look like list comprehensions, but don’t do any actual work when they gets created. With them, we can ask for the first object before we even begin scanning the document, so python knows to stop processing as soon as it finds the right one. It also doesn’t hold its results in memory; it passes them back as they are retrieved, one at a time. We both save the memory it would take to build up a list of spans *and* get to stop searching the moment we find a span matching our conditional.

The bad news is that we can’t just write:

   return (d for d in doc.iter('span') if d.get('class') == 'newname')[0].text

If we do, we get an exception:

TypeError: 'generator' object is not subscriptable

We’re trying to index into a generator, but the iterator protocol doesn’t support indexing to a particular member. All you can do is start at the beginning and work through it one at a time.

The good news is that since we want the first one anyway, all we have to do is start iterating through the generator, and stop after grabbing the first item.

   generator = (d for d in doc.iter('span') if d.get('class') == 'newname')
   for each in generator:
       # The function gets returned on the first pass through
       # the loop, forestalling any further processing
       return each

But now we’ve lost the terseness of our list comprehension again. All we’ve done is move the if clause out of the for loop and into the generator. Not much of an improvement.

If you understand how generators do their job, though, you can actually maintain the terseness of the original list comprehension version, while enjoying the improved performance of the generator version. Each time you loop over an iterator, the next value is retrieved by calling the .next() method on the generator. So rather than relying on a forloop to go through our iterator for us, we can step through it manually using this method.

   return (d for d in doc.iter('span') if d.get('class') == 'newname').next().text

In python 3 the method is called .__next__() instead of .next(), so this method isn’t quite compatible across python versions. For python 3, we could use:

   return (d for d in doc.iter('span') if d.get('class') == 'newname').__next__().text

If cross-python compatibility is important to you, or if you would rather not muck around with dunder methods, there is a builtin next() function which goes back at least to python 2.6, and probably further, which takes an iterator and calls the appropriate .next() or .__next__() method on the iterator, returning the result. So now we can write:

   return next(d for d in doc.iter('span') if d.get('class') == 'newname').text

We’ve gotten the results we wanted quickly and efficiently, with no appreciable loss of code clarity.

Now, in this particular case, it’s not a huge deal one way or another. Our bottleneck is going to be pulling the document down from the internet, and the page is fairly short, so we’re not going to be wasting too much time scanning over it. But there are times when this trick can save you quite a bit of time. Imagine scanning through a logfile that hasn’t been placed under logrotation, and has hundreds of megabytes of data in it. Imagine if the processing we were doing on each item in the list comprehension took ten minutes. Or imagine if we were calculating wu names for a million users, where every hundredth of a second getting a wu-name translated to nearly three more hours of running time. In any of these cases, knowing when to use iterators, and how to use them effectively can make a big difference.


TriZPUG EventsTriZPUG August 2011 Meeting: FabulAWS and Limone

Tobias McNulty will present FabulAWS, a tool automated server provisioning using a declarative syntax in Python. Chris Rossi, will present Limone, a library for generating content types from a Colander schema. There will most likely be an after meeting at a nearby location with food and beverage.

Caktus GroupLightning Talk Lunch: Service Page API

Leading the second talk of our Caktus Lightning Talk Lunch series, Calvin Spealman presented on the Service Page API:

The Service Page API is a prototype and proof of concept to deliver a wide range of browser plugins across multiple browsers and to extend the APIs available to websites a user visits by allowing plugins to extend the Javascript API with new libraries, integrate with external services, and more. It puts the power in the users hand to control which services can interact. This talk covers the problems with the current state of browser extensions and the difficulty in building them across multiple browsers consistently, and how the Service Page API is a solution to this, with code examples.

The slides from this talk are available on talks.caktusgroup.com and the code can be found on GitHub. Follow his project on GitHub to stay up-to-date on it's status!

Caktus GroupManaging Client Expectations Amid Shifting Deadlines

Estimating development time is notoriously difficult, and when moving deadlines are added to the mix, shift happens.

Estimating development time for clients is difficult enough without having to second guess deadlines. Yet despite the best efforts, if your company has a healthy deal flow, it’s almost inevitable that you’ll eventually have a project deadline shift.

It seems an inexorable law of nature that deadlines always move forward. Projects slated for 10 weeks suddenly become 6 week sprints, and 4 week projects suddenly turn into 14 days of pain. Shifting deadlines cause a lot of stress even when clients and project managers communicate perfectly, but they are an absolute nightmare if either party doesn’t take responsibility early to communicate a new set of expectations.

Since I have the most experience managing projects, I’ll speak from the perspective of a project manager. Here are 3 steps that I have found to significantly reduce stress when clients need to alter the delivery schedule:

1. Do not commit to new milestones without internal communication.

When I was in the 4th grade, my best friend and I spent most of our free time together. Normally I would call him on the phone and while still talking, ask my parents if I could do whatever he and I were planning (normally riding bikes). With my friend on the line I would ask “Mom and Dad, can I go out bike-riding with Casey in an hour?” This approach often made my parents frustrated. I would be made to hang up, talk to them, and call back.

That little story might seem unrelated, but it is very similar to a client calling to say they need to release 4 weeks earlier than planned. Even if nobody could have predicted a change in the schedule, as the manager of that project, you now have a problem. You and your colleagues likely have other projects in the pipe, an abundance of work for that period, a vacation day or two, and a web of unseen working commitments.

The client will be under more pressure still and wants to hear you say “okay, no problem, we can get that done 4 weeks early.” Even if you think your team can do it, never make that assumption when speaking with a client. The best way to handle the situation is tell the client you have to talk with the team members and get back to them. Then, start talking with your team.

2. Consult with team members.

Internal discussion is especially necessary when there are aspects of a project that the project manager doesn’t understand 100%. A lack of understanding could be due to the limitations of a particular PM or the vastness of the project, but the person doing the coding is almost always in a better place than a manager to evaluate changes in engineering specs and deadlines.

In the case of a 4 week project adjustment, the operative question is how to balance the altered client interests with the original contract. Can features be cut? Can other projects be sidelined for a week? Are people willing to work overtime? Obviously the desirability of answers to these questions will depend on your specific situation, but the important part is to have the discussion. This will get everyone on the same page and present a unified front to the client. It’s more professional presentation and management.

3. Clearly communicate internal resolutions to the client.

After the internal meeting, contact the client and communicate the talking points in clear, direct statements. It’s easy, especially when under pressure from clients, to waffle, but resist the urge. A statement like “I spoke to the team and we aren’t sure the deployment schedule is realistic given the change in the deadline” is terrible because it isn’t crystal clear. What you mean to say in this case is “we cannot deploy on time.” So why not say it? Obviously don’t be rude, but straight-forward, simple communication will avoid future misunderstanding.

Simply go through what your team can realistically achieve in plain language, making sure to address the most critical deliverables. I have found it is best to lead with the items that won’t get done to client satisfaction, and conclude with items that will. The balancing affect of good and bad news tends to lean more in favor of positive reactions when those are the last things mentioned.

Conclusion

The easiest way to make your team members hate you and client take business elsewhere is to over promise and under deliver on a tight schedule. When deadlines change mid-project, the project outcome is immediately restricted to a less desirable set of outcomes. Using the communication techniques outlined above, however, project managers can often turn bad situations into opportunities for glowing client stories. Sticking to mutual expectations is what drives client satisfaction, and even when circumstances conspire to restrict expectations, you can still impress.

Caktus GroupJunior Django Developer Wanted

Caktus is currently seeking a junior Django developer for our team. The ideal candidate would have 6 months of experience of building dynamic web applications in any language, at least 3 months of experience using Python and Django, and also have a basic understanding of relational databases such as PostgreSQL and MySQL. The junior developer position will consist of data modeling complex business ideas, creating and integrating Django applications in new projects, maintaining existing Django projects and also assisting with Django deployments. 

 If this sounds like something you may be interested in or know someone who might be, check out the entire job posting here. Also, if you would like to apply, please submit your resume and code samples to jobs+website@caktusgroup.com. 

Og MacielLost in Translation

Deformed Man Toilet

Strange Juice

Og MacielIcons on menus for Openbox

Openbox 3.5.0 was released yesterday, and with it several bugs got fixed and a few new features were added. Out of these features the one that I liked most was the ability to add icons to menus (and submenus as well)! Yeah, I know some other managers already do this but for someone who enjoys running Openbox because of its simplicity and keyboard binding limitless possibilities, I was sure glad to see some eye candy make its way to it.

So, if you want to try it, make sure that your distribution has the new Openbox compiled with Imlib2. Next, add the following line to the <menu> section of your rc.xml file:

<showIcons>yes</showIcons>

Then, modify your menu.xml by adding an icon attribute to what ever menu item or menu you want to add an icon.

<menu id="apps-net-menu" icon="/usr/share/icons/Tango/24x24/apps/internet-web-browser.png"/>
<menu id="apps-net-menu" label="Internet">
    <menu id="apps-net-browsers" label="Browsers">
        <item label="Firefox" icon="/usr/share/icons/hicolor/24x24/apps/firefox.png">
        <action name="Execute">
          <command>firefox</command>
          <startupnotify>
            <enabled>yes</enabled>
            <wmclass>Firefox</wmclass>
          </startupnotify>
        </action>
        </item>
        .
        .
        .
</menu>

Make sure to restart openbox and enjoy your fancy new menu!

TriZPUG EventsTriZPUG July 2011 Meeting: Crushinator

Josh Johnson will introduce Crushinator, an up-and-coming, interactive, iterative skeleton-building application and general replacement for PasteScript. As always, spontaneous lightning talks of ten minutes or less on other topics are also welcome. Anything you've learned about Python, no matter how trivial, can be a lightning talk. There's plenty of parking at CCC. The after-meeting will take place at Milltown, right around the corner within easy walking distance.

Og MacielPodcast: Aline Duarte Bessa – Accerciser


Aline Duarte Bessa - Accerciser

Another episode of my Castálio Podcast, this time with Aline Duarte Bessa, another Brazilian who is participating of the GNOME Women Outreach Program (GWOP). Even with a fever, cold e technical issues getting a working system to record this show, she was gracious to spend some of her free time and tell me about her current task of updating the developer’s documentation for Accerciser.

The episode, recorded in Brazilian Portuguese, can be downloaded here!

Kurt GrandisPython Hack Night #3

We had a good turn out for TriZPUG’s third Python Hack Night tonight. All in all, nine local pythonistas showed up at MetaMetrics in Durham and dug right in. There was good conversation and it seems like progress was made on most fronts. We had a wide range of projects including: personal websites, a scrum workflow tool, a computational teaching problem, a game project, a nose plugin, a Django-based charting framework, and more. We even had an impromptu game AI-building competition emerge.

I was pleased with the results and would love to see one at least once a month. Let’s see what August brings.

TriZPUG EventsPython Hack Night Take Three

Hack Nights are a great opportunity to work with other local Python developers. All levels of experience are welcome and encouraged. Whether you want to learn more about Python, help others out, or bang out some cool projects with other folks, come on out. If you do want to let people know about a project you’ld like some help with or are looking to join or start some work feel free to post about it on the mailing list. Drop Kurt an email (address below in full article) if you are interested in attending so he has an idea what to plan for.

Og MacielCastálio Podcast, or Three Times is a Charm

Castálio Podcast

The idea of doing a podcast is something that has tickled my curiosity fancy for quite some time. As a matter of fact, I am already the survivor of 2 other mildly successful attempts, which sadly, died prematurely due to commitment issues from my partners and my inability to record and edit audio using Linux.

About 6 months ago the need to scratch this itch came back in full power, and thanks to the support of my friend Evandro Pastor, Castálio Podcast was born! Instead of getting together with a couple of friends and discussing about technology and current events, I wanted to do something a bit more different and avoid the typical routine by turning it into a talk show-like interview program. Every other week I’d invite someone from the Brazilian Free and Open Source world and have a chat about their childhood, upbringing, and the tv shows, movies, books and music that shaped them into who they are now. To quote my good friend Kurt, it would be “the equivalent to opening a friend’s MP3 collection and seeing what that person was like“!

Shortly after we recorded our second or third episode, Evandro who not only served as the host (while I chatted with our guest) but also recorded and edited the audio, had a severe case of tendinitis and was not able to participate anymore. Once again I was faced with the old dilema of not having anyone else to run a podcast or to record and edit it… but this time I didn’t want to see another attempt of creating a podcast die a premature death… I had too much vested already on it!

So for the last 6 months I’ve been interviewing, editing, publishing and maintaining the podcast during my free time and having a blast! So far all of my guests are Brazilians and the podcast itself caters for the Brazilian and/or Hispanic audience, but an episode in English is in the works for the very near future. Through the 12 episodes I’ve recorded so far I had a chance to chat with some really interesting, fun and exciting people, such as Igor Soares (Fedora Embassador), Lucas Rocha (GNOME and new Mozillian), Johan Dahlin (Stoq), Diego Zacarão (Transifex) and many others who totally opened up themselves to me (and our followers). I learned a lot from these chats and was inspired by their stories of success, failure, innovation and adventures!

Did you know that Johan is a Swede but lives in Brazil and speaks fluent Brazilian Portuguese? That Lucas used to be part of a Guns and RosesMetallicaIron Maiden and Nirvana cover band when he was only 13-years-old? During the many hours spent talking to my guests I learned about their likes and dislikes, what books and music they enjoy, and indirectly became more acquainted with the real person behind the IRC nick/email address!

I’m really glad I decided to continue with this project and didn’t give up early on. You’ll be glad to know that Evandro is recovering from his tendinitis and may be able to drop by one of these days. The next 2 episodes are already recorded and more guests are scheduled to talk about their passions and projects on different topics, including arduino and writing books! If you know someone whom you’d like to learn a bit more about their story, drop me a line or a comment and hopefully I’ll be able to schedule something.

Here’s to the next 6 months!

Footnotes