Latest Tweets:

Using MapReduce with Django-nonrel on App Engine.

A while ago, I had read that the best way to clean up App Engine’s datastore was to use the MapReduce API.  For one, you delete datastore entities in parallel.  Secondly, the datastore will return a maximum of 1000 entities per query, if you did it serially, you would have to loop through queries, potentially taking longer than the maximum execution time allowed for processing a single App Engine HTTP request.

Having changed my schema a bit, my Django app started failing when it loaded old data from my datastore.  I decided to try out the MapReduce to clean up some of the illegal old objects from my datastore.  I discovered that the MapReduce API doesn’t work well with Django models.  It turns out the InputReader classes provided with the API is uses App Engine’s python db API.  Fortunately, source is included, so I could write my own InputReader to map Django models instead of db models.

I left the API fetching entities, without converting them back to Django models.  This suited me well, since I was looking to fetch entities that wouldn’t properly convert to my new Django models anyways.  Here’s the code for the InputReader class.  I’ve tested it with App Engine SDK 1.6.2 (with the MapReduce bundle)

 import djangoappengine.main from django.db.models.sql.query
 import Query from mapreduce.input_readers
 import AbstractDatastoreInputReader
 from mapreduce import util from google.appengine.datastore import datastore_query

 class DjangoKeyInputReader(AbstractDatastoreInputReader):
 """An input reader that takes a Django model ('app.models.Model') and yields Keys for that model"""
 def _iter_key_range(self, k_range):
 query = Query(util.for_name(self._entity_kind)).get_compiler(using="default").build_query()
 raw_entity_kind = query.db_table query = k_range.make_ascending_datastore_query( raw_entity_kind, keys_only=True)
 for key in query.Run( config=datastore_query.QueryOptions(batch_size=self._batch_size)):
 yield key, key 

class DjangoEntityInputReader(AbstractDatastoreInputReader):
 """An input reader that takes a Django model ('app.models.Model') and yields entities for that model"""
 def _iter_key_range(self, k_range):
 query = Query(util.for_name(self._entity_kind)).get_compiler(using="default").build_query()
 raw_entity_kind = query.db_table query = k_range.make_ascending_datastore_query( raw_entity_kind)
 for entity in query.Run( config=datastore_query.QueryOptions(batch_size=self._batch_size)):
 yield entity.key(), entity 

The most time-consuming part of the project was trying to figure out the MapReduce API documentation. There’s a few versions that come up when I do a google search. It turns out, it’s all the same package, but the documentation comes from various dates.

The Mapper API is an older version that just covers the mapping portion of the pipeline.  This is what I used, and it still works.  It has an easy getting started guide.  The documentation, however, is outdated, yet is still contains details about the mapping portion that are missing in the newer documentation.  Ignore the old download which is still sitting around, use the latest MapReduce bundle which includes the old Mapper API.

The latest documentation covers the full MapReduce pipeline.  However, it just glazes over the entire pipeline at a high level, and isn’t very useful for actual implementation.

What you want to download is the latest MapReduce bundle from the App Engine SDK download page