Planet TriZPUG#!/usr/bin/env python from __future__ import print_function import sys from straight.command import Command, Option, SubCommand class List(Command): def run_default(self, **extra): for line in open(self.parent.args['filename']): print(line.strip('\n')) class Add(Command): new_todo = Option(dest='new_todo', action='append') def run_default(self, new_todo, **extra): with open(self.parent.args['filename'], 'a') as f: for one_todo in new_todo: print(one_todo.strip('\n'), file=f) class Todo(Command): filename = Option(dest='filename', action='store') list = SubCommand('list', List) add = SubCommand('add', Add) if __name__ == '__main__': Todo().run(sys.argv[1:])
$ ./todo.py todo.txt add "Write an example tool" $ ./todo.py todo.txt add "Get the documentation cleaned up and on readthedocs" $ ./todo.py todo.txt add "Blog about the project" $ ./todo.py todo.txt list Write an example tool Get the documentation cleaned up and on readthedocs Blog about the project
Started the process of getting jiggy with Clojure at work and didn’t like the idea of using Eclipse for my day to day work… so I started looking at how to make vim and clojure get along and came across a great post! Here are the distilled notes plus minor tweaks to get anyone out there trying to do the same thing going:

NOTES:

English: My Red Hat took me and the kids to the NC Museum of Art this last Sunday! It was perfect too as we got to see a lot of the exhibits and still managed to find time to eat at Lilly’s. Did you know that the entrance is free and the museum is sponsored by the State of North Carolina for our viewing and learning pleasure? Take that NYC and your extremely expensive entrance fees! :P
Português: Meu Red Hat me levou com minhas filhas até o NC Museum of Art este último domingo! Foi um dia mais que perfeito já que conseguimos ver um monte das exibições e ainda achamos um tempinho para comer na pizzaria Lilly’s. Você sabia que a entrada é completamente free e que o museo é patrocinado pelo Estado da Carolina do Norte para o nosso deleite? Toma NYC e seus preços de entrada exorbitantes! :P
… or, how I brought an old python code back from the dead!

Inspired by Kenneth Reitz’s recent post and spurred by recent events, I decided to turn an old python code I wrote a while back into something that can be (hopefully) easier to get to than by sheer luck.
I’m talking about ChoppedPress, my script that let’s you split WordPress exported XML files into smaller files that can be easily imported into new WordPress installations. I’m sure that some of you have experienced the frustration of not being able to import this xml file due to size upload constraints on your host providers… One of my close friends who provides mostly support for WordPress gave me the idea a while back and that is how the script came about. Little did I know that other people have find it useful too, specially for migrating away from WordPress! As a matter of fact, I too used it when I moved to Tumblr, but that another story.
So this afternoon I took some time during my lunch break to create a repository and put together some very basic structure to give ChoppedPress a proper home (yay GitHub Pages!!!). For the first time I also uploaded something I created to PyPi… Sure, this may not be a big deal to some of you out there, but I can hardly contain my excitement. :)
Overall, I’m still enjoying a nice buzz from the experience. Obviously, I look forward to comments, suggestions and/or improvements to the code, but more than anything, I hope this will be useful to you too!
I’m excited to announce that Caktus is a sponsor of the first SwitchPoint 2012 conference that is being held in the Saxapahaw Ballroom in Saxapahaw, NC. It is being organized by IntraHealth, an organization that mobilizes local talent for sustainable and accessible health care around the world. The conference is bringing together a number of people from different industries and disciplines to discuss how technology and ideas can increase global health equity. There are quite a few great speakers from RedHat, USAID, WorldBank and the United Nations. The Caktus team is really excited to attend Switchpoint and sponsor this incredible event.
I'm excited to announce that on June 9th and 10th, Caktus will be hosting our first Django bootcamp. It will be a two day intensive bootcamp session where you'll learn the basics of developing a web application using Django through constructing a crossword drill application, created by the Caktus staff. It will go over the architecture of Django and also different third party applications that will allow you to enhance the finished product. For more information regarding our bootcamp, you can check out the schedule of the day's events.
The class will be taught by two of the founding members of Caktus, Tobias McNulty and Colin Copeland, Karen Tracey, one of the core committers of Django, and Mark Lavin, one of our lead developers
Register today to get the early bird rate!
Simon Sinek
I consider myself very lucky for having gotten to a point in my professional career where I can chose where I want to work, a place where I can try to make a difference, and more than anything, a place where I believe in what I do! Thanks my good friend Joe Baltimore for sharing this video today, a great way to kick off the weekend! Simon Sinek: How great leaders inspire action
The NY Times’ restriction of 10 articles/month for non-subscribers, online viewers means I won’t be reading their Books section anymore. It sort of became a good habit for me and something I look forward to on Sunday mornings: as I drink my coffee and enjoy some peace and quietness as the kids are still asleep, I enjoy catching up with the latest books and reviews. This weekend ritual usually ends with a trip down to the public library with the whole family. Both of my kids are already checking out more books and movies than my wife and I together.
But back to this decision by the NY Times, I understand that a for profit establishment wants to, well, make money, and there’s nothing wrong with that for sure. But for this weekend, Books section only reader the options don’t make a lot of sense: pay $6 for the Sunday edition so I can read a small subset of it is a bit expensive, and I am not very sure how much it costs to get the online version.
I'm excited to announce that Caktus is looking for a developer to join our team on a contract basis!
We're looking for a strong software developer who enjoys working on a team and is excited to learn and experiment with new technologies. We do have a preference for local candidates, but will consider all submissions. Initial work will focus on maintaining small Django-powered websites. This position will involve managing existing Django projects, data modeling complex business ideas and deploying Django sites.
You will be working in Linux (Debian-flavor) production environments with Apache and WSGI. At least 6 months Python/Django experience is required. Relational database experience is a must. HTML/CSS and JavaScript experience are also a must, and jQuery is a plus. If you think you might be a great fit or know someone who might be, check out the full job posting here.
Earlier this morning I received the following email from The Pragmatic Programmers:
Dear Og Maciel,
This is just to let you know that Pragmatic Guide to Git (eBook) has recently been updated. You own an electronic version of this book, and so you’ll be able to download this latest version. We have also sent it to Amazon.com for delivery to your kindle.
Changes in This Release
- Third printing: includes a few minor errata fixes.
You can get the update either by logging in to your Bookshelf Home Page, or (if you’re already logged in) by downloading it from here.
Dave and Andy
Awesome right? They not only have informed me of an updated version of a book I bought from them, but have also automatically sent it to my Kindle! More over, I can download my ebook in PDF, mobi or epub format, all without any senseless “protection mechanism”. It is this type of attention and treatment that have won me over and whenever I need to buy a technical book, I immediately check their store.
Also worth mentioning is their monthly, free publication PragPub magazine, also available in many different electronic types.
Anyhow, I don’t get any type of financial incentive for writing this up, so don’t feel that I’m trying to push off some type of affiliation code in order to make money. I happen to enjoy their service and attitude toward their customers and, if you’re ever decide to buy anything from them I hope your experience will be as enjoyable as mine.
I'm excited to announce that Caktus is looking for candidates for our front end developer/designer summer internship program. It is a 12 week paid position in our Carrboro, NC office. We're driving distance from UNC Chapel Hill, NC State Univeristy in Raleigh, and Duke in Durham, so students from all parts of the NC Research Triangle are welcome to apply.
We are looking for someone who is passionate about emerging technologies, design and user interface/interaction. While working with us you will get to work on Django-powered web applications and perform front end development in HTML, CSS and Javascript. You'll work closely with the Caktus staff to create mockups and wireframes of design ideas and user stories. Check out the full job posting here.
If you'd like to spend your summer working with some great people on interesting projects please email us at jobs+website@caktusgroup.com with your resume and, if applicable, links to samples of code you have written. Kindly include a brief note describing why you would be a great fit for this opportunity.
Horde Scout (source:wikipedia)
I gave my PyCon talk this weekend–“Militarizing Your Backyard With Python: Computer Vision and the Squirrel Hordes.” I was not prepared for the number of people who caught me after the talk and throughout the conference telling me about their own battles. Thanks again for all the recommendations about how to improve my firepower, tracking, and classification accuracy.
As requested I’ve posted the presentation on SlideShare and here’s the final squirrel encounter video and the actual PyCon Presentation.
One great resource that was really helpful for getting ideas about sentry guns is Project Sentry Gun. There is Wiring and Processing code to get you started as well as a premade Arduino shield if you’re interested. The folks at Servocity were also very helpful in sizing servos for my project.
As time permits, I’ll post some additional articles detailing the various steps of my project that folks seem to be interested in.
Thanks all. Good luck and don’t get captured.
I'm excited to announce that Caktus is looking for candidates for our front end developer/designer summer internship program. It is a 12 week paid position in our Carrboro, NC office. We're driving distance from UNC Chapel Hill, NC State Univeristy in Raleigh, and Duke in Durham, so students from all parts of the NC Research Triangle are welcome to apply.
We are looking for someone who is passionate about emerging technologies, design and user interface/interaction. While working with us you will get to work on Django-powered web applications and perform front end development in HTML, CSS and Javascript. You'll work closely with the Caktus staff to create mockups and wireframes of design ideas and user stories. Check out the full job posting here.
If you'd like to spend your summer working with some great people on interesting projects please email us at jobs+website@caktusgroup.com with your resume and, if applicable, links to samples of code you have written. Kindly include a brief note describing why you would be a great fit for this opportunity.
This past Feb. 5th I was greeted early in the morning with the following email:
Congratulations for reaching 90 days of service with Red Hat!
It is hard for me to believe that it has been 3 months already since I started this new chapter in my career! My days have been filled with so many new things that it may explain why it literally feels like it was only yesterday that I left rPath to join the CloudForms QE Team here at Red Hat!
I’m still going through the transition period of coming off a startup, super fast paced environment to a (much, much) bigger company trying to solve a similar challenge. There is not a single day that goes by that I don’t meet someone new or learn yet a new trick about YUM or RPM. Keeping track of names, faces, where they sit and what they do has been a challenge on its own, but I believe I’m making some progress. Being the global company that we are, it is not always obvious where the person you spent the last few hours working on IRC is from…
So for the last 3 months I’ve been learning all I can absorb about all the different projects that are being developed here! I feel that I have learned a lot but there is still a lot to learn, which is awesome! When I think about the massive talent pool that we have and the caliber and enthusiasm of my co-workers, plus the magnitude of the challenges ahead and the impact that our projects will have in the enterprise world, I can’t help but feel that I am at the right place and at the right time!
Can I top this off? Yes, I can! For the first time in my life I can proudly say that everything that I work on is not only truly open sourced but have a thriving community of collaborators outside work! In other words, anyone from outside Red Hat can see the issues I’ve worked on, what’s currently assigned to me and even download and play with the source code of the project!
So these first 90+ days have been a blur of excitement and learning for me and I’m really excited about the upcoming months and all the good stuff that is yet to come down the pipe! It is a great time to be a Red Hatter!
Caktus is sponsoring Pycon 2012 in Santa Clara, CA this coming weekend! Nearly the entire office will be attending this year's event, which means Mark, Caleb, Calvin, David, Karen, Dan, Tobias, Colin, Julia, Nicole and I will be on site contributing and learning with the rest of the Python community. Nicole and I will be in charge of manning the booth, and so if you managed to wrangle tickets to the sold-out event, we invite you to stop by our booth #213 and say hello! Also Karen, Mark and Calvin will be sprinting after the talks, working on Django and Python3 tickets.
Apart from the excitement of the incredible schedule and attendees, we will definitely be entering ourselves in the drawing for the Aldebaran robot which has entraced Calvin in particular.
Lately, I’ve been working on creating a simplified work flow for my front end work here at Caktus. There are all sorts of new and helpful tools to optimize the creative process, allowing for faster iterations, and greater overall enjoyment. As with any new tool, there are a few options to choose from: LESS and SASS. Having read lots of reviews and reading through the documentation, I’ve decided LESS is more for me.
LESS provides lots of useful tidbits that you’ve always wanted to do with your style sheets but never imagined possible. For example, variables immediately make your stylesheets less painful to muddle through:
@darkred: #CD0000;
h1 {
color: @darkred;
}
I love that you can use plain text to map to commonly reused values , making it much easier to remember which is which! This is just the tip of the iceberg when it comes to what you can do with LESS.
Since this post is about work flow, I’ll leave it to the authors to explain how to utilize it over here: http://lesscss.org/
One thing that we struggled with is what our work flow would look like when compiling LESS. Being primarily a Django shop, we decided to use django-compressor to create a seamless way to create, edit, version and deploy our LESS files. Here is an example of how we use it in our templates:
{% load compress %}
{% if debug %}
// This is the client-side way to compile less and an ok choice for local dev
<link rel="stylesheet/less" type="text/css" media="all" href="{{ STATIC_URL }}less/style.less" />
<script src="{{ STATIC_URL }}js/less-1.1.3.min.js"></script>
{% else %}
{% compress css %}
// This is the nifty django-compressor way to compile your less files in css
<link rel="stylesheet" type="text/less" media="all" href="{{ STATIC_URL }}less/style.less" />
{% endcompress %}
{% endif %}
Because the {% if %} statement is only true if DEBUG = True in your settings.py file, which should be your local development setting default anyways, django-compressor will only get to work once your site is deployed to a live or production environment. You can easily keep on saving and editing your LESS files without having to think about. This set-up also helps you get started on your project without having to delve into installing django-compressor or npm locally which is the next step!
The LESS documentation recommends you use node.js in order to get everything running. You’ll need to install npm and then use it to install LESS. Then, install django-compressor and place the following in your settings.py file:
COMPRESS_PRECOMPILERS = (
('text/less', 'lessc {infile} {outfile}'),
)
INTERNAL_IPS = ('127.0.0.1',)
And there you have it! Your LESS file is all ready to automatically compile on your server.
Read more about django-compressor
This morning I read a post on readwriteweb about gamification of recruiting and companies hiring using one minute bug fixing tests. You can view the post here if you're really interested. This post really bothered me, because it shows that there are a lot of people that don't get it when it comes to attracting top notch talent.
I'm not going to say that gamification of recruiting and and the one minute tests are going to lead to bad hires. The set of people who do well in those circumstances is not disjoint from the set of good coders, and it will weed out some people who really don't belong. The problem is that they aren't selecting for the specific skillset that makes a really great developer.
Let me make it easy for you. The type of developer you want views themselves as an artist. They are in it for the long haul, and are constantly looking to improve their craft. These people gravitate to interesting challenges, a chance to learn from other like minded people and creative autonomy. You might get some people like this with contests, but mostly you are going to get bright, inexperienced college age kids who feel like they have something to prove. To attract the artists, you need to show them something interesting. Make an artist curious and he or she will find you.
This dovetails nicely with the best way to weed out artists from wannabes - code. The GitHire people got it right; how you interact on a project is how you will perform on the job (minus in person social issues). After providing something interesting to hack on, get people involved by offering bounties on bugs and features. The money doesn't matter so much, but if you explain that you want to screen for new hires from the community around your project, it will show people that you are serious. If someone looks promising, have them sign an NDA bring them in as a contractor until you're 100% certain they are a good fit.
I'm sure some people who read this will think to themselves "how do I create a community around my product while protecting my secret sauce?". You people need to get over yourselves. There are lots of big companies with hundreds of talented programmers at the ready who have probably already rejected your exact idea, "secret sauce" included. The reason they aren't doing it is because it is a questionable prospect, and only big balls and sharp execution are going to make it work. Big businesses are risk averse. They will happily copy your business model and attempt out-market you, though. Having strong community is a great bulwark against this.
I like Python a lot; it has some nice properties, and there are a lot of well designed libraries for it. That doesn't mean it doesn't frustrate me...
The word "pythonic" is thrown around a lot in the Python community. Like most ideas, it looks great at first glance. Unfortunately, it the Python community is often dogmatic about being "pythonic". That is a great way to be part of a culture, but constrains your ability to address some problems.
This presentation I put together for a talk on the subject...
<iframe frameborder="0" height="342" src="https://docs.google.com/present/embed?id=dhrnc3gb_5dv4x5jd3" width="410"></iframe>
I’ve just recently started migrating my blog to Tumblr, and in the process of importing my archives from WordPress I seem to have caused some issues with certain aggregators that are now picking up posts from 2007… Yesterday I also triggered a massive torrent on both Twitter and Facebook…
Please accept my apologies for the incovenience… more on the Tumblr migration to follow!
class FriendsLookup(forms.Form): username = forms.CharField(required=True) @CachedFormMethod(expires=60*15) # expire in 15 minutes def get_friends_list(self, include_pending=False): username = self.cleaned_data['username'] friends = Friendship.objects.filter( from_user__username=username) if include_pending: friends = friends.filter(status__in=(PENDING, APPROVED)) else: friends = friends.filter(status=APPROVED) return friends
[frank jython]$ ./dist/bin/jython
Jython 2.6a0+ (default:9c9c311c201b, Feb 10 2012, 10:29:32)
[OpenJDK 64-Bit Server VM (Sun Microsystems Inc.)] on java1.6.0_23
Type "help", "copyright", "credits" or "license" for more information.
>>> bytes('foo')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'bytes' is not defined
[frank jython]$ hg diff
diff --git a/src/org/python/core/__builtin__.java b/src/org/python/core/__builtin__.java
--- a/src/org/python/core/__builtin__.java
+++ b/src/org/python/core/__builtin__.java
@@ -305,6 +305,7 @@
dict.__setitem__("Ellipsis", Py.Ellipsis);
dict.__setitem__("True", Py.True);
dict.__setitem__("False", Py.False);
+ dict.__setitem__("bytes", PyString.TYPE);
dict.__setitem__("str", PyString.TYPE);
[frank jython]$ ./dist/bin/jython
*sys-package-mgr*: processing modified jar, '/home/frank/hg/jython/jython/dist/jython-dev.jar'
Jython 2.6a0+ (default:9c9c311c201b+, Feb 13 2012, 09:52:10)
[OpenJDK 64-Bit Server VM (Sun Microsystems Inc.)] on java1.6.0_23
Type "help", "copyright", "credits" or "license" for more information.
>>> x = bytes('foo')
>>> x
'foo'
>>> type(x)
<type 'str'>
This is how they might look...
Please note that I've tried to capture the spirit of various languages, with some twists for entertainment value. Don't take it personally if the car I selected for your language of choice isn't what you'd like, this is purely for entertainment :)
Fortran
Sturdy and dependable, though not particularly maneurverable or sexy.
Lisp
Though not a language in its own right, LuaJIT is an amazing technical accomplishment and that little car is just too cute not to share.
Clojure
from bettercache.objects import CacheModel
from django.contrib import auth
class User(CacheModel):
username = Key()
email = Field()
full_name = Field()
def from_miss(self, username):
user = auth.models.User.objects.get(username=username)
self.email = user.email
self.full_name = user.get_full_name()
get() does not find the object in the cache, it will create a newfrom_miss() with the key parameter username touser = User.get(username="bob")
<iframe frameborder="0" height="342" src="https://docs.google.com/present/embed?id=dhrnc3gb_2hpwd6mnn" width="410"></iframe>
After posting about design by contract and symbolic expressions in Python, I was curious about the current state of formal methods. From what I can tell, not much has changed since the 80's, which is a real shame, because I feel formal methods have a lot to offer.
The driving use-case for formal methods in most instances is verification. They provide this in spades, but usually with a significant cost in man hours. Part of the reason for this is that most code is written in a highly imperative style with machine types that provide little for a theorem prover to latch on to. People must go back and make declarative statements about the code and data (which are not always guaranteed to be correct).
Fans of statically typed languages might try to call this a win, but that is actually not the case. Statically typed languages are only slightly better off than dynamically typed languages here, because generally language types only tell you that a certain set of properties is expected, with values of some other uninformative type. As an example, if I wanted to prove that a function has an inverse, I need to prove that each operation has an inverse; there are a great many mathematical functions that do not have a general inverse, but have an inverse function for a subset of their domain. If I know the input is constrained to an invertible domain, I can proceed. Just knowing that the input is a float doesn't help me.
If formal methods were just about proving code so I can be lazy about tests, they wouldn't be exciting to me. The reason I am really interested, is that formal methods can be used to generate code. As an example, lets look at Prolog. In Prolog, you make declarative logical statements, then perform queries, which are resolved by chaining first order logic statements, until a path from your starting point to your endpoint is found or the search space is exhausted. Prolog frames this is terms of querying relations like (socrates, man), (man, mortal) therefore (socrates, mortal). If we imagine this in terms of defining a socrates_mortal function rather than performing a query for (socrates, mortal), it should be clear what I mean. If every function in your language is treated like a declarative statement of relation between X (the inputs) and Y (the outputs), you could use a mathematical proof assistant (in this case, probably using higher-order logic) to produce code from some X to some other Y. If a solution returned is not satisfactory for whatever reason, you could provide additional constraints. If no solution is found, it is possible in some circumstances that the proof assistant could suggest a function that would provide a solution and ask you to code it. Additionally, if you change a function, the proof assistant could immediately tell you if that breaks anything.
I feel that the combination of provably correct code (which saves an enormous amount of time in testing) and automatic code generation makes formal methods unstoppable. In order for this sort of technique to work, metadata has to be ubiquitous in the language. As I mentioned previously, metadata does not imply everything is typed, but rather that there are constraints on the behavior of everything. As an example, take str.split(), for which we we know the following:
As you can see, there is actually quite a lot of useful information, even from a simple function like split. This is the sort of stuff a theorem prover needs in order to go to town. Then, instead of writing a function that goes through a bunch of steps to complete a process, I just specify the start and end points, add some constraints on the solution and let the proof assistant give you a provably correct solution automatically.
There are complications in this process of course. For one, exceptions throw a minor monkey wrench in the works. The search process is also capable of producing very bad functions, so a modified search strategy would be required. These are relatively minor issues though, and are both dwarfed by larger issues:
Despite the minor issues and major paradigm shift required in what it means to program, I feel that formal methods provide the best method to handle the trend towards massively increasing software complexity.
class User(CacheModel): username = Key() email = Field() full_name = Field() user = User( username = 'bob', email = 'bob@hotmail.com', full_name = 'Bob T Fredrick', ) user.save() ... user = User.get(username='bob') user.email == 'bob@hotmail.com' user.full_name == 'Bob T Fredrick'
<data:post.title></data:post.title>
I recently made the case for symbolic programming in the Python ideas group. Reactions were mixed, but that is to be expected with any significant suggestion. I think by fleshing out the idea somewhat, adding some use cases, and perhaps creating an example implementation in PyPy would go a long way towards demonstrating how amazingly useful this could be.
I came to this point after being frustrated with the with data manipulation is typically handled in Python. List comprehensions, map and other methods of performing a function on every element of a python iterable just feel kind of clumsy when you are focused on data transformation, particularly when you start to get any sort of expression complexity. The standard Python technique to handle this is to assign intermediate values, however this breaks down when considering a potentially infinite sequence of input. Additionally, if there are side effects when accessing a variable that are partially dealt with in the next step of computation, by performing everything at once you minimize the time that these effects are present in the system. This can be mitigated to some degree by returning generators everywhere, but that clutters your code.
After some comments from Nick Coghlin, I put together a little library called Elementwise to allow chaining, generative expressions that can be lazily applied to every element of an iterable (or even a graph of iterables). This was a fun project to write, and I got to check out a lot of areas of Python I don't usually explore, such as the alternate use of FunctionType and cell objects. I Hope some people who read this take the time to check it out :)
I appreciate the code as data and philosophy of Lisp, and how thoroughly Mathematica approaches symbolic programming. Let me give a little example of how I imagine those features would appear behind Python tinted lenses...
Assume for a moment that we have a special SymbolicType type where any attempt at attribute lookup (including special methods that skip getattribute access like __add__ and __mul__) causes the expression represented by the calling stack frame to be folded into a coroutine object and returned immediately. You realize the data by send()ing to the coroutine (whatever the method is called, evaluate has a nice ring too). These SymbolicTypes would take a default value like a function argument as well, to provide support for function partials.
Subclasses of this SymbolicType can hook into attribute access just like any standard Python class. These might not behave like standard python methods, because I feel it is more useful to hook your logic in after you have received a value. What is more likely is overriden behavior on the metaclass of SymbolicType __new__ method would wrap supplied methods, treating them like callback functions.
With this, you can chain together operations on SymbolicType instances generatively, creating abstract expressions. This lets you define functions in-line in a manner similar to lambda, over multiple lines, in a clean manner, without the annoying lambda statements everywhere. You could even take the next step by writing your functions in an entirely symbolic manner, and use them as templates for non symbolic functions. This would allow you override specific parts of functions in a very clean way. Additionally, functions that operate on other functions would become much more powerful. Like in mathematica, you could create closed form solutions by operating on the function rather than the data they generate. Logic programming (which is pretty neat but can be somewhat counter-intuitive in use) would be easily usable in the context of regular code.
To be sure, this is all a big departure from Python as it exists now. The beauty of it is that it is all available as a result of the SymbolicType's behavior on attribute acces. With this one feature, the rest of Python will grow and evolve to incorporate symbolic features over time, in an organic way. Nobody will be forced to change the way they code if they don't want to use SymbolicTypes. A lot of the nice things about languages like Haskell/Lisp, Prolog and Mathematica would be available without sacrificing Python's intuitiveness and general "pleasantness of use".
I recently lamented in the Python mailing lists that a great deal of useful information was being thrown away. Initially, my thought was that perhaps returning some kind of parameterized generic collection instead of an ordinary collection would provide a clean way to preserve this information. After some interesting back and forth, I realized that the conversation would be better framed in terms of general metadata. From this perspective, Python 3 already has a tool that is fairly well suited to solve the problem: Annotations. Because it was decided to let a solution evolve organically, a chicken and egg problem was created, where tool makers don't want to support specific annotations because nobody is using them, and nobody uses the annotations because no tools take advantage of them. Additionally, the standard library is a major offender with regard to throwing information away, and a third party library can't fix in any sort of clean way.
I believe that a fairly well tested paradigm exists to deal with the metadata issue while simultaneously bringing a lot of additional benefits to the table: Design by Contract. There is a PEP for the addition of contracts to Python, however the reference implementation is very rough and dated. I looked at the discussions related to PEP316, and came across the following objections:
The "overengineering" view is purely philosophical, and I disagree with in on the grounds that if taken to its logical conclusion, Python would be relegated to basic scripting and as an educational language.
I understand the "unit tests are cleaner and more effective" argument, but I feel that it is misguided for the following reasons:
The point about implementing contracts using annotations or third party libraries is valid, however I would argue that having standard library functions implement contracts would greatly enhance the scope and magnitude of static analysis and optimization opportunities. The only issue with this (and it is significant) is that contracts are defined in such a way that subclassing has very specific effects on pre and post condition strength, which will probably not be uniformly respected.
I think abstract base classes provide the best mechanism for specifying contracts. The main issue with this is that constantly creating abstract base classes to define contract elements is likely to become tedious, although this represents an opportunity for third party libraries. Another potential issue with this is that if abstract classes are not used, you are basically just type checking, which might be useful in some contexts but should be discouraged in general.
We're pretty avid testers here at Caktus and when one of our Django projects required upgrading to Python 2.7, we also needed to upgrade our Jenkins build environment. Luckily, Jenkins supports distributed builds to allow a master install to delegate tasks to slaves instances. This way we can continue to run our primary build system on Ubuntu 10.04, which defaults to Python 2.6, and delegate tasks to an Ubuntu 11.04 environment running Python 2.7. The setup is fairly easy, but since I didn't find much out there already, I figured I write up a quick post outlining what we did.
To start, we'll need a new machine. I setup an Ubuntu 11.04 instance on Linode. Then SSH in, upgrade the packages, and install a Java Runtime Environment:
$ sudo apt-get update
$ sudo apt-get upgrade
$ sudo apt-get install default-jre
That's the only package Jenkins needs by default. Next we'll setup a user for Jenkins to SSH as. To do this, we'll add a new user to the system and copy the master's SSH public key:
$ sudo useradd -m jenkins
$ sudo -u jenkins mkdir /home/jenkins/.ssh
$ sudo -u jenkins vim /home/jenkins/.ssh/authorized_keys2
Now the master Jenkins client can ssh to the slave without a password. Next we need to configure the Jenkins master to connect to the slave. Head over to the Master environment and navigate to "Manage Jenkins" and then "Manage Nodes". Click "New Node" in the sidebar and add a Dumb Slave. On the following page, fill in the following fields:
Hit save and your Jenkins master should open a connection to your slave machine. To use the new slave machine, update an existing Jenkins job and set the "Restrict where this project can be run" Label Expression to "python27". You'll need to install any project dependencies on the slave for it to build properly, but that's basically it!
I really like Werkzeug and Flask a lot. My main frustration in using them is at having to write views that dispatch to other functions based on a variety of information in the request, it makes the code harder to read and debug.
My ideal routing engine would be able to route to a view function using absolutely anything in the request, as well as available layer 3/4 protocol information. Some things I would like to route by that aren't usually routable include:
Currently, if you want to deal with this stuff using Werkzeug routing, you either end up with huge, messy view functions or small view functions that redirect to a variety of other functions, which can make following the code execution path slightly more difficult. Additionally, this is an easy place for bugs to occur, and unless you create a mini-router abstraction you are probably violating DRY.
People familiar with URL routing engines will probably be wondering two things right about now:
To answer the first point, I have been thinking about an algorithm that won't just avoid serious slowdown, but actually be incredibly fast, likely much faster than current approaches
The process of resolving a route involves the following process:
This route data structure is closely related to a Radix tree, and borrows much of of the tree construction process from decision trees. The benefit of this structure is that it will be possible to identify many routes most routes in O(log n). Only routes that are highly ambiguous due to poor use of matching elements will approach the O(n) time behavior seen in most routing engines.
I've always been irritated by having to maintain sphinx documentation separate from my python code. Literate Programming is a nice way to deal with the separation of documentation and code. There is a fairly well developed literate programming tool for python called PyLit that is capable of performing round trip conversion between reStructured Text and Python, and it has a certain amount of charm. I haven't adopted PyLit because would prefer to stay a little closer to the standard Sphinx API documentation format; it is easy to find things and people are used to it.
What features would the ideal union of PyLit and Sphinx-autodoc have?
Technical issues:
ant expose
@ExposedType(name = "str", doc = BuiltinDocs.str_doc)
>>> help(str)
@ExposedNew
static PyObject str_new(PyNewWrapper new_, boolean init, PyType subtype, ...
@ExposedMethod(doc = BuiltinDocs.str___len___doc)
final int str___len__() {
@ExposedMethod(type = MethodType.BINARY, doc = BuiltinDocs.str___eq___doc)
final PyObject str___eq__(PyObject other) {
@ExposedMethod(defaults = {"null", "-1"}, doc = BuiltinDocs.str_split_doc)
final PyList str_split(String sep, int maxsplit) {
src/org/python/expose/
src/org/python/expose/generate/
Django class-based views
Django 1.3 added class-based views, but neglected to provide documentation to explain what they were or how to use them. So here's a basic introduction.
Let's start with an example of a very basic class-based view.
urls.py:
...
url(r'^/$', MyViewClass.as_view(), name='myview'),
...
views.py:
from django.views.generic.base import TemplateView
class MyViewClass(TemplateView):
template_name = "index.html"
def get(self, request, *args, **kwargs):
context = # compute what you want to pass to the template
return self.render_to_response(context)
This will render your template index.html with the context
you computed and return it as the content of an HttpResponse.
Now that we've seen the obligatory example, how about some instructions?
To create a class-based view, start by creating a class that inherits
from django.views.generic.View or one of its subclasses.
In your URLconf, specify the view method as the name of the new
class, plus .as_view():
url(r'urlpattern', MyViewClass.as_view(), ...)
In your class, write a get method that takes as arguments self
(as always), request (the HttpRequest), and any other arguments
from the request as specified in your URLconf.
In your get method, use the same logic you'd have used in an old
view, except that you can assume the request method is GET. Return an
HttpResponse as usual.
If you need to handle POST, write a post method, just like your get
method except that you can assume the request method is POST.
Any request method that you don't write a handler method for will automatically get back a "method not allowed" response; you don't have to do anything special.
Example:
from django.views.generic import View
from django.shortcuts import render
class MyViewClass(View):
def get(self, request, arg1, keyword=value):
return do_something()
def post(self, request, arg1, keyword=value):
return do_something_else()
Django comes with a number of useful subclasses of View that provide
some of the function that often ends up as boilerplate in views, just
by inheriting from them. You saw TemplateView being used already.
You'll probably want to base your views on TemplateView almost
anytime you're generating the content for a response.
Another useful one is RedirectView. This can be used to redirect
all requests. Example:
from django.core.urlresolvers import reverse
from django.views.generic import RedirectView
class MyRedirectView(RedirectView):
url = reverse(...)
That is a complete view, and will return a redirect to url on any
GET, POST, or HEAD request.
You can optionally set permanent = False to return a temporary
redirect instead of the default permanent redirect, and query_string
= True to include any query string from the incoming request on the
redirect URL:
from django.core.urlresolvers import reverse
from django.views.generic import RedirectView
class MyRedirectView(RedirectView):
url = reverse(...)
permanent = False
query_string = True
Unfortunately, using decorators with class-based views isn't quite as simple as using them with the old method-based views.
Maybe you're used to doing this:
from django.contrib.auth.decorators import login_required
@login_required
def myview(request):
context = ...
return render(request, 'index.html', context)
With class-based views, you have to decorate the .dispatch() method of
the class view, which means you have to override it just to decorate
it. And you need to decorate the decorator, because the decorators
provided by Django expect to be decorating method-based views, not
class-based ones:
from django.contrib.auth.decorators import login_required
from django.views.generic.base import View
from django.views.utils.decorators import method_decorator
class MyViewClass(View):
def get(self, request, **kwargs):
context = ...
return render(request, 'index.html', context)
@method_decorator(login_required)
def dispatch(self, *args, **kwargs):
return super(MyViewClass, self).dispatch(*args, **kwargs)
This is an area of class-based views that could use some improvement.
You could apply the decorator in urls.py without needing so much extra code:
urls.py:
from django.contrib.auth.decorators import login_required
...
url(r'^/$', login_required(MyViewClass.as_view()), name='myview'),
...
but that moves the policy from the view code to the URLconf, which is not where people will be expecting to have to look for it, so I wouldn't recommend it.
The method signature for get(), post(), etc. in a view class is:
def get(self, request, *args, **kwargs)
Any unnamed values captured in the URLconf regular expression are passed in args, and any named values are passed in kwargs, just like before.
You can pass extra arguments to your view using the third element
of your URLconf, the same as before, or using a new technique -- passing
them to the .as_view() call in your url settings. E.g.
...
url(r'^/$', MyViewClass.as_view(extra_arg=3), name='myview'),
...
One warning - don't accidently write MyViewClass(extra_arg=3).as_view().
That'll still appear to work, but that extra_arg is just thrown away.
So far, all we've done is the same behavior, written using a different syntax. But class-based views enable a whole new level of function.
Suppose you've got a view that displays some data on a web page, and you write it as a class-based view. Maybe something like this:
from django.views.generic.base import TemplateView
class MyViewClass(TemplateView):
template_name = 'index.html'
def get(self, request, **kwargs):
# Lots of complex logic in here to compute 'context'
self.render_to_response(context)
Now you're asked to provide an HTTP API that returns the same data in json.
Start by refactoring your existing class slightly, moving your business
logic out of the get() method:
from django.views.generic.base import TemplateView
class MyViewClass(TemplateView):
template_name = 'index.html'
def compute_context(self, request, **kwargs):
# Lots of complex logic in here to compute 'context'
return context
def get(self, request, **kwargs):
self.render_to_response(self.compute_context(request, kwargs))
Now, write a new class that subclasses your original class, uses the
same method to compute the data, but overrides get() with different
rendering code:
class MyJsonViewClass(MyViewClass):
def get(self, request, **kwargs):
data = self.compute_context(request, **kwargs)
# Very naive way to put your data into json, but a good starting place
content = json.dumps(data)
return HttpResponse(content, content_type='application/json')
Add a new URL to urls.py pointing to your new class-based view, and you're done. All the logic you worked out earlier is still in use, and the power of subclassing let you provide the data in a new format almost effortlessly.
The previous example was still something you could have done almost as easily with method-based views, by refactoring your code into separate methods and calling them from all your views.
A more powerful use of the new class-based views is to provide common function for many views. If you have a site with many views, and they all inherit from a common view, then you have the potential to change behavior across the site by changing that one view.
Previously, you would probably have used middleware for this kind of thing. The problem with middleware is that it's completely hidden from the view code. When working on your view, you won't even know middleware is affecting things unless you go look at the settings and track down each piece of middleware configured there.
Furthermore, middleware affects every request, not just the views you really wanted it for.
With a common class-based view, every view affected is declared to inherit from that view, making it obvious that we're inheriting behavior from elsewhere. With a good IDE, you can even jump straight to that superclass to inspect it. Any view that doesn't need the common behavior doesn't have to inherit it.
The only documentation page that really discussed class-based views in Django 1.3 is this one:
https://docs.djangoproject.com/en/1.3/topics/class-based-views/
Some of the rationale for the current design of class-based views, and pros and cons of some alternatives that were considered, are documented here:
https://code.djangoproject.com/wiki/ClassBasedViews
Beyond that, the best advice I can give is to go read the code. The
code for the base View is surprisingly small, and can be found at
django/views/generic/base.py.
The OpenBlock geocoder is powerful and robust. It uses PostGIS for spacial queries, can extract addresses from bodies of text, and can understand block and intersection notation. We've run into a few issues with it, however, including a low geocoding success rate. This is a tough problem to solve and depends on a lot of factors (the extent of street and block data in OpenBlock, format of the street addresses, etc.), so your mileage may vary. Below I constructed a simple test using Google's Geocoding API to have as an alternative.
Disclamer: This is the third post in our OpenRural series reviewing OpenBlock and it's geocoder. You may wish to read Part 1: Data Model and Geocoding and Part 2: Text Parsing and Entity Extraction before proceeding.
The Schema and NewsItem models provide OpenBlock with a generic data model to associate news with geographic locations. You can find a fairly extensive introduction in the official documentation, so we won't go into too much detail here.
Since a NewsItem requires a geographic point, let's use the OpenBlock geocoder to find 123 East Franklin Street:
>>> from ebpub.geocoder import SmartGeocoder
>>> geocoder = SmartGeocoder()
>>> location_name = '123 East Franklin Street'
>>> point = geocoder.geocode(location_name)['point']
>>> point.wkt
'POINT (-79.0553588124999891 35.9133110937499964)'
We'll use the "Local News" schema in this example as it is pre-loaded in OpenBlock:
>>> from ebpub.db import models as ebpub
>>> schema = ebpub.Schema.objects.get(name='Local News')
Using this schema, we'll add a new NewsItem with the point created above:
>>> import datetime
>>> news = schema.newsitem_set.create(
... title='Incident downtown',
... description='Something happend downtown today!',
... item_date=datetime.date.today(),
... location=point,
... location_name=location_name,
... )
>>> news.location.wkt
'POINT (-79.0553588124999891 35.9133110937499964)'
That was easy. Now we have a NewsItem that OpenBlock is aware of and can be plotted on a map. However, what do we do if we can't geocode the address?
If we already have a geographic point, then we can circumvent the geocoder entirely:
>>> from django.contrib.gis.geos import Point
>>> manual_point = Point(-79.0553588124999891, 35.9133110937499964)
>>> news = schema.newsitem_set.create(
... title='Incident downtown',
... description='Something happend downtown today!',
... item_date=datetime.date.today(),
... location=manual_point,
... location_name=location_name,
... )
>>> news.location.wkt
'POINT (-79.0553588124999891 35.9133110937499964)'
This means we can also use an external geocoder. For example, we can use Google's Geocoding API with geopy. First, you'll need a Google Maps API key, which we'll use with geopy:
>>> GOOGLE_MAPS_API_KEY = '' # your Google Maps API key
Then we can use geopy to construct a new geocoder:
>>> from geopy import geocoders
>>> g = geocoders.Google(GOOGLE_MAPS_API_KEY)
And we can geocode our address:
>>> address = '123 East Franklin Street, Chapel Hill, NC'
>>> place, (lat, lng) = g.geocode(address)
>>> point = Point(lng, lat)
>>> point.wkt
'POINT (-79.0549350000000004 35.9136495999999994)'
You can even tap into OpenBlock's internals and build a Geocoder that OpenBlock can use:
from django.conf import settings
from django.contrib.gis.geos import Point
from geopy import geocoders
from geopy.geocoders.google import GQueryError
from ebpub.geocoder import Geocoder, DoesNotExist
class GoogleGeocoder(Geocoder):
def __init__(self, *args, **kwargs):
kwargs['use_cache'] = False # haven't implemented cache yet
super(GoogleGeocoder, self).__init__(*args, **kwargs)
self.geocoder = geocoders.Google(settings.GOOGLE_MAPS_API_KEY)
def _do_geocode(self, location_string):
try:
place, (lat, lng) = self.geocoder.geocode(location_string)
except (GQueryError, ValueError), e:
raise DoesNotExist(unicode(e))
location = {'point': Point(lng, lat)}
return location
This is an proof-of-concept geocoder we're using with OpenRural. You can find it on GitHub. Using this geocoder with a sample dataset from the North Carolina Secretary of State Corporation Filings, I was able to increase the geocoding success rate from about 37% to 95%. Again, your mileage will vary, but it can be useful to test out. We can't use Google's API for everything though. Normal users are limited to 2,500 requests per day. Business accounts are allotted 100,000 requests. Additionally, Google requires you to display any points geocoded with their API on a Google Map. So you'll need to evaluate your needs before deciding on using Google's API.
Amazon's Simple Queue Service (SQS) is a relatively new offering in the family of Amazon Web Services (AWS). It's also an appealing one, because it proposes to quickly and easily replace a common component of the stack in a typical web application, thereby obviating the need to run a separate queue server like RabbitMQ. While RabbitMQ — the typical favorite for Celery users — is not necessarily difficult to install or maintain, removing it from the stack of a web application means one less component that might fail, offloading that service to AWS — especially for applications with a small to moderate queue volume — might prove financially advantageous.
While it's quite easy to use Celery with Amazon's Simple Queue Service (SQS), there's currently not a lot of information out there about how to do it. There's this post on the celery-users list that didn't leave me with much hope, and this question on StackOverflow that sounded slightly more promising. I still couldn't find a step-by-step how to, however, and it ended up being quite easy, so here's my take:
Upgrade to the latest versions of kombu, celery, and django-celery. At the time of this writing, those versions are 1.5.1, 2.4.5, and 2.4.2.:
pip install kombu==1.5.1
pip install celery==2.4.5
pip install django-celery==2.4.2
Add the following lines to settings.py (or local_settings.py depending on your setup):
BROKER_TRANSPORT = 'sqs'
BROKER_TRANSPORT_OPTIONS = {
'region': 'us-east-1',
}
BROKER_USER = AWS_ACCESS_KEY_ID
BROKER_PASSWORD = AWS_SECRET_ACCESS_KEY
In the above, AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY should point to the appropriate AWS access key and secret for account you want to use. Pro tip: Use AWS's Identity and Access Management (IAM) to setup an API key and secret that only has access to the services your web application will use (typically one or more of SQS, SES, and SimpleDB).
Finally, if you'll be running multiple servers or environments on the same AWS account (e.g., two different web apps or staging and production environments of the same app), you may want to customize the SQS queue name being used (the default is "celery"). To make this change, add the following lines to your settings.py (or again, local_settings.py):
CELERY_DEFAULT_QUEUE = 'celery-myapp-production'
CELERY_QUEUES = {
CELERY_DEFAULT_QUEUE: {
'exchange': CELERY_DEFAULT_QUEUE,
'binding_key': CELERY_DEFAULT_QUEUE,
}
}
For the curious, Celery's support for SQS lies in the underlying Kombu library, the latest version of which includes a transport for SQS. While some points I found (including the StackOverflow post) suggest using the BROKER_URL syntax for pointing to AWS, I found it simpler to use the BROKER_USER and BROKER_PASSWORD variables. I also saw some reports that slashes in your API secret could confuse the underlying URL parser, and since my API secret happened to include a number of slashes, I went straight to using BROKER_USER and BROKER_PASSWORD.
Anyways, I hope this helps someone else looking to solve the same problem, and don't hesitate to comment if you run into any issues or have a better way to go about this!
This is the second post in our OpenRural series reviewing OpenBlock and it's geocoder. OpenBlock Geocoder, Part 1: Data Model and Geocoding covers the internals of the OpenBlock geocoder and it's geocoding capabilities. As this posts builds upon topics covered there, you may wish to read Part 1 before proceeding. In this post we step back from the internals of the geocoder and explore how to use it along with other OpenBlock tools to parse unstructured text.
I'd also like to give a shout out here to Paul Winkler who was kind enough to answer questions and point me in the right direction on the topics below. Thanks Paul!
OpenBlock's original design is centered around providing news at a hyper-local level. That is, down to your own city block. This allows interested citizens to see events ranging from police incidents, to restaurant inspections, to local news articles all aggregated on a map of your block. OpenBlock provides scraping tools to assist downloading this data from the web, but the obvious problem here is that most data isn't packaged or tagged with geographic information. Let's look at an example article teaser from The Daily Tar Heel in Chapel Hill, NC:
No. 4 North Carolina led Evansville 63-27 with just more than 14 minutes to go in the first half when senior forward Tyler Zeller scored his 999th career point at the Smith Center on Tuesday night.
The article mentions the game at the Smith Center, which is the location we want to extract and plot on a map. This is where OpenBlock utilities to ingest unstructured text helps.
Places are simple models containing only a name and geographic point. OpenBlock implements a mechanism to find places defined in the database from a body of text. For example, say we have the following string we'd like to parse:
>>> message = 'A good movie is playing at the Varsity Theater in Chapel Hill tonight.'
OpenBlock can extract "Varsity Theater" if we define it as a Place. You can create and import places in the OpenBlock admin, but to keep things simple, we'll just create one here:
<script src="https://gist.github.com/1469282.js?file=gistfile1.py"></script>Here we created a new Point of Interest place (which is loaded by default on any OpenBlock install) geocoded to 123 East Franklin Street. Now we need a way to parse places from strings. Most of this functionality is found in ebdata. And ebdata contains a Natural Language Processing package, nlp. We can use it's place_grabber to extract matching places:
<script src="https://gist.github.com/1469286.js?file=gistfile1.py"></script> We can feed this right back into the Place model to retrieve the database objects and their geographic locations: <script src="https://gist.github.com/1469357.js?file=gistfile1.py"></script>The parser is case sensitive however, so it'll fail if it's not an exact match:
>>> grabber("VARSITY THEATER")
[]
Obviously this is a brute-force method and requires you to pre-load all places of interest into the database beforehand. It's pretty rudimentary, but does provide this functionality out-of-the-box.
OpenBlock can also extract locations defined in the database. We already have cities loaded, so we'll use them in this example. Just like the place grabber, the location grabber is case sensitive, so we'll define a location synonym with the proper case:
>>> from ebpub.db.models import Location, LocationSynonym
>>> ch = Location.objects.get(name='CHAPEL HILL')
>>> LocationSynonym(pretty_name='Chapel Hill', location=ch).save()
By default, the location grabber igonores types of "city" and "borough". To keep things simple, we'll just create one that includes all location types:
>>> grabber = places.location_grabber(ignore_location_types=[])
Now we can use the grabber to extract locations:
>>> grabber(message)
[(50, 61, 'Chapel Hill')]
If you plan to parse a lot of text in succession, the OpenBlock grabbers cache the locations/places on instantiation. So you won't hit the database after the initial run. Cool!
ebdata.nlp can also parse addresses. For example, let's use a simple string:
>>> from ebdata.nlp.addresses import parse_addresses
>>> parse_addresses('The Varsity Theater is located at 123 N Franklin St')
[('123 N Franklin St', '')]
Under the hood, OpenBlock uses a large regular expression to do this, so it's not actually hitting the database or attemping to do geocoding. You'll notice that it returns a 2-item tuple. The second item is for the city:
>>> parse_addresses('The individual was seen on 123 N Franklin St in Chapel Hill')
>>> [('123 N Franklin St', 'Chapel Hill')]
It can parse block locations too:
>>> parse_addresses('The construction is on the 100 block of Franklin St.')
[('100 block of Franklin St.', '')]
And intersections:
>>> parse_addresses('The incident occured at the intersection of Franklin and Hillsborough')
[('Franklin and Hillsborough', '')]
It all comes together with the geocoder:
<script src="https://gist.github.com/1469324.js?file=gistfile1.py"></script>As you can see, OpenBlock provides a few useful utilities to parse unstructured text. They're fairly limited and, especially with the address parser, will most likely return a lot of false positives. But I think OpenBlock has provided a great starting point. Stayed tuned for more posts on inner-workings of the OpenBlock project!
As Tobias mentioned in Scraping Data and Web Standards, Caktus is collaborating with the UNC School of Journalism to help develop Open Rural (the code is on GitHub). Open Rural hopes to help rural newspapers in North Carolina leverage OpenBlock. This blog post is the first of several covering the internals of OpenBlock and, specifically, the geocoder.
The OpenBlock geocoder can only geocode from the data is has. It doesn't leverage a 3rd-party API or service. It only uses what's loaded in PostgreSQL (with PostGIS and GeoDjango) and, in this example, what comes from the US Census Bureau and local city and county GIS offices.
Further, the imported data is typically filtered by a bounding box setting in METRO_LIST. The setting, extent, is a list of leftmost longitude, lower latitude, rightmost longitude, upper latitude. This defines a bounding box - the range of latitudes and longitudes that are relevant to your area. A small or restrictive box will limit imported ZIP code and block data to areas that fall within the box.
Let's look at an example with these shapefiles:
We'll start with a restrictive extent that only consists of downtown Chapel Hill:
METRO_LIST = (
{
# Extent of the region, as a longitude/latitude bounding box.
'extent': (-79.066272, 35.91671, -79.040481, 35.910663),
# ...
},
)
This selection loaded 2 ZIP codes:
$ django-admin.py import_nc_zips Importing zip codes... # ... Skipping 27511, out of bounds Skipping 27513, out of bounds Created ZIP Code 27514 Created ZIP Code 27516 Skipping 27517, out of bounds Skipping 27519, out of bounds # ... Created 2 zipcodes.
And limited the block data as well:
$ django-admin.py import_county_streets 37135 Importing blocks, this may take several minutes ... Created 73 blocks Populating streets and fixing addresses, these can take several minutes... Populating the streets table streets: created: 28 block_intersections: created: 160 Done.
Restricting the area will limit the ability of the geocoder. In this case, for example, it can geocode the intersection of Franklin and Henderson, which is right downtown, but not Franklin and Estes (don't worry, we'll get into more geocoding details in the next section). A map helps illustrate this more clearly. Below you can see the bounding box with pins on the two intersections:
If we increase the bounding box, we'll get a lot more data:
METRO_LIST = (
{
# Extent of the region, as a longitude/latitude bounding box.
'extent': (-79.165922, 35.829095, -78.978468, 36.02426),
# ...
},
)
With an extent that encompasses all of Chapel Hill, the importer loaded 9 ZIP codes, 4302 blocks, 1699 streets, and 7189 intersections. Here's a map illustrating the larger extent:
It's up to the maintainer of an OpenBlock install to determine which extent to use as it is based on the specifics of the application. A large extent will import more ZIP codes and blocks and, therefore, will slow down geospatial queries and may include unwanted geographic areas.
Now that we have NC Orange County data loaded, let's investigate this data with the OpenBlock models.
The Street model contains a catalog of all loaded streets. It's a simple model with only a few fields:
In NC Orange County, we can see that the street data spans 4 cities:
>>> from ebpub.streets.models import Street
>>> Street.objects.order_by('city').values_list('city', flat=True).distinct()
[u'', u'CARRBORO', u'CHAPEL HILL', u'DURHAM', u'HILLSBOROUGH']
Some streets cross city lines and therefore contain two entries:
>>> Street.objects.filter(street_slug='rosemary-st').values_list('city', flat=True)
[u'CARRBORO', u'CHAPEL HILL']
And, for example, if we're looking for Franklin St. in Chapel Hill, NC, we can filter for it here:
<script src="https://gist.github.com/1467493.js?file=gistfile1.py"></script>Blocks are fundamental to OpenBlock and are used by the geocoder. OpenBlock defines a block as "a segment of a single street between one side street and another side street." The Block model is slightly more intricate than Street, but each entry basically represents the address range of a street for each block segment.
To start, we can see that Franklin St. is divided into roughly 32 blocks:
>>> from ebpub.streets.models import Block
>>> Block.objects.filter(street_slug='franklin-st').count()
32
It's sectioned into an east and west segment:
>>> Block.objects.filter(street_slug='franklin-st').order_by('street_pretty_name').values_list('street_pretty_name', 'predir').distinct()
[(u'Franklin St.', u'W'), (u'Franklin St.', u'E')]
And can have an address between 100 and 1899:
>>> Block.objects.filter(street_slug='franklin-st').aggregate(Min('from_num'), Max('to_num'))
{'from_num__min': 100, 'to_num__max': 1899}
So we can find the block that contains the 123 address:
<script src="https://gist.github.com/1467847.js?file=gistfile1.py"></script>Also, on a side note, it's possible for some blocks to span cities:
<script src="https://gist.github.com/1467849.js?file=gistfile1.py"></script>Now that we have a basic understanding of how the data is stored within OpenBlock, let's do some geocoding. Most of these examples will use the SmartGeocoder class. SmartGeocoder delegates to specific geocoders (AddressGeocoder, BlockGeocoder, and IntersectionGeocoder) based on how it interprets the string with regular expressions.
To start, let's geocode "123 East Franklin Street":
<script src="https://gist.github.com/1467863.js?file=gistfile1.py"></script>This one was pretty easy for geocoder to parse and find. You can see that not only has it found the associated block, but it also knows the exact geographic point. However, this will fail if passed a non-existent address number (InvalidBlockButValidStreet):
<script src="https://gist.github.com/1467865.js?file=gistfile1.py"></script>In this case, the geocoder was able to extract the address, but it failed to find the associated block in the database. Non-existent streets also fail (DoesNotExist):
<script src="https://gist.github.com/1467869.js?file=gistfile1.py"></script>The geocoder can locate intersections too:
<script src="https://gist.github.com/1467876.js?file=gistfile1.py"></script>Notice how the intersection field is populated, rather than block. This will raise a DoesNotExist exception when an intersection is not found:
<script src="https://gist.github.com/1467885.js?file=gistfile1.py"></script>OpenBlock provides a model, StreetMisspelling, to define street aliases. This allows you to map a bad street name to a good street name that exists in the database:
<script src="https://gist.github.com/1467895.js?file=gistfile1.py"></script>Now geocoding "Glen Haven" will find "Glenhaven".
By default, OpenBlock is configured to work with a single city, which is defined in METRO_LIST:
# Metros. You almost certainly only want one dictionary in this list.
# See the configuration docs for more info.
METRO_LIST = (
{
# Extent of the region, as a longitude/latitude bounding box.
'extent': (-79.165922, 35.829095, -78.978468, 36.02426),
# The major city in the region.
'city_name': 'Chapel Hill',
},
)
The geocoder will fail if it locates a street that's associated with a city unknown to OpenBlock. For example, 100 Pine Street is in Carrboro and not Chapel Hill:
<script src="https://gist.github.com/1467903.js?file=gistfile1.py"></script>This street exists in the database due to our extent covering most of Orange County. Since we've setup OpenBlock to encompass an entire county, rather than a single city, we need to define additional cities. This can be accomplished one of two ways:
We imported Orange County city boundary data above, so we'll use the latter:
METRO_LIST = (
{
# Extent of the region, as a longitude/latitude bounding box.
'extent': (-79.165922, 35.829095, -78.978468, 36.02426),
# Set this to True if the region has multiple cities.
# You will also need to set 'city_location_type'.
'multiple_cities': True,
# The major city in the region.
'city_name': 'Chapel Hill',
# Slug of an ebpub.db.LocationType that represents cities.
# Only needed if multiple_cities = True.
'city_location_type': 'cities',
},
)
Here we enabled multiple_cities and informed OpenBlock that the location type slug is cities, respectively. Now 100 Pine Street will geocode properly:
<script src="https://gist.github.com/1467912.js?file=gistfile1.py"></script>Now that we've had an overview of the geocoder, we'll jump into OpenBlock's place, location, and address parser. Stay tuned!
Update: Read more in OpenBlock Geocoder, Part 2: Text Parsing and Entity Extraction.
We're currently involved in a project with the UNC School of Journalism that hopes to help rural newspapers in North Carolina leverage OpenBlock. The project is called OpenRural, and if you're a software developer you can find the latest code on GitHub.
OpenBlock needs geographic data to display, and that data can come from a variety of sources. We've found a number of web sites that offer geographically interesting data to NC residents, and in this post I'd like to discuss my experience attempting to scrape (that is, programmatically navigate and extract data from) the Chapel Hill Police Department's (CHPD's) online database of crime reports.
The CHPD site advertises itself as powered by "Sungard Public Sector OSSI's P2C engine," and a quick Google for "P2C engine" shows that Chapel Hill is not the only city or county in North Carolina that happens to use this product. Unfortunately, scraping the data on this site proved to be a non-trivial endeavor.
I opted to host and run my scraper script on ScraperWiki, which is a great tool for writing, testing, and running scraper scripts in a variety of scripting languages. The site even manifests the scraped data in API form, so it could potentially be used as an abstraction layer between the scraped sites and OpenBlock (or any other consumer of the data). The current state of the script can be found here:
https://scraperwiki.com/scrapers/chapel_hill_police_reports/
The script uses the Python mechanize library to navigate the site being scraped, and BeautifulSoup to find and extract data on the pages retrieved. After telling mechanize to click the "I Agree" button on the CHPD web site's landing page, it was easy enough to submit the search form for the current day and return a listing of results.
While getting the initial list of results was fairly trivial, one issue I ran into when writing the scraper is that the site uses an odd method of retrieving and paginating results. Looking at the HTML source, you will see that the search form is submitted by a small piece of JavaScript, like so:
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
}
It turns out this little method is used to do quite a lot. There are calls to it to do everything from sorting, to pagination, to link to other pages on the site. It effectively works by setting the form action (via two hidden form inputs on the page) and then calling submit() on the form.
You may have also noticed that the form has method="post", rather than method="get" set, which means the web browser will send an HTTP POST (rather than an HTTP GET) every time you modify the form and click the Search button. Per the HTTP/1.1 specification, POST requests should be used for requests that modify data on the server, whereis GET requests should be used to retrieve information at a given URL. You can also tell that the site uses POST instead of GET by inspecting the URL in your browser; sites pages that use GET will typically have a portion of their URL that starts with a question mark and is followed by key/value pairs. The link to the Google search above is an example of the GET method. Searching a site is by definition a retrieval operation (and typically does not involve modifying data on the server), so well-written search forms should use the GET rather than the POST HTTP method.
Confusing POST and GET is a fairly elementary problem, but it's one that we see far too often on the web. If you've ever been prompted by your browser "re-submit a form" after hitting the back button and are warned that it may modify data on the server, the site you're using is probably not using the GET and POST HTTP methods properly.
In the case of the CHPD site, while it was easy enough to set the values of the hidden form inputs and re-submit the form using POST (after finding this post on StackOverflow, at least), for some reason the site still returns the first page of results to mechanize (even though it properly paginates in a real web browser). I'm still working on it, but in the meantime, check out the code and let me know if you have any ideas. :-)