MongoDB from Python:
Accessing MongoDB from Python applications is easy and familiar to many Python developers. PyMongo uses the rich dictionary support in Python to create a similar API as MongoDB’s native JavaScript query syntax. After all, there some understanding for execution and performance. There is also a second API built upon the atomic operators of MongoDB $set, $push, etc. which truly leverages the full power of MongoDB and its aggregate-level atomicity.
MongoDB with MongoEngine:
Python and PyMongo allow direct coding against MongoDB from Python. This is most appropriately compared to programming at the level of raw SQL for RDBMSes. That level is a necessary building block, but for most applications working at a higher level and building upon custom classes is more appropriate. This module explores one of the most popular Object-Data Mappers for Python and MongoDB: MongoEngine.
Entity Design:
Designing entities in MongoDB and document databases more generally is very different than 3rd-normal-form from SQL tables. To be successful with MongoDB, as a developer you will need to master this skill. Getting your entity design correct is key to high performance and flexible applications.
Connecting and Inserting in to the Database:
One thing that’s nice about the pymongo connection is that it’s automatically pooled. What this means is that pymongo maintains a pool of connections to the mongodb server that it reuses over the lifetime of your application. This is good for performance since it means pymongo doesn’t need to go through the overhead of establishing a connection each time it does an operation. Mostly, this happens automatically. You do, however, need to be aware of the connection pooling, however, since you need to manually notify pymongo that you’re “done” with a connection in the pool so it can be reused.
The easiest way to connec to to a MongoDB database from python is below :
In: import pymongo In: conn = pymongo.Connection()
Inserting documents begins by selecting a database. To create a database, you do… well, nothing, actually. The first time you refer to a database, the MongoDB server creates it for you automatically. So once you have your database, you need to decide which “collection” in which to store your documents. To create a collection, you do… right – nothing.
In: db = conn.tutorial In: db.test Out: Collection(Database(Connection(‘localhost’, 27017), u’tutorial’), u’test’) In: db.test.insert({‘name’: ‘My Document’, ‘ids’: [1,2,3], ‘subdocument’: {‘a’:2}}) Out: ObjectId(‘4f25bcffeb033049af000000’)
here the insert command returned us an ObjectId value. This is the value that pymongo generated for the _id
property, the “primary key” of a MongoDB document. We can also manually specify the _id if we want and we don’t have to use ObjectIds:
In: db.test.insert({‘_id’: 42, ‘name’: ‘My Document’, ‘ids’: [1,2,3], ‘subdocument’: {‘a’:2}}) Out: 42
Indexing and profiling:
Simply put, indexes are the single biggest contributor for extremely high performance MongoDB deployments and applications. Make sure your applications use indexes to full advantage. Finding the queries that need optimized can be tricky, especially when there is a translation layer in the middle such as MongoEngine and an ODM.
MongoDB has an extremely fast query that it can use in some cases where it doesn’t have to scan any objects, only the index entries. This happens when the only data you’re returning from a query is part of the index:
In: db.test.find({‘a’:2}, {‘a’:1, ‘_id’:0}).explain() Out: … u’indexBounds’: {u’a’: [[2, 2]]}, u’indexOnly’: True, u’isMultiKey’: False, …
here the indexOnly
field is true, specifying that MongoDB only had to inspect the index (and not the actual collection data) to satisfy the query.
GridFS:
MongoDB has a facility to store, classify, and query files of virtually unlimited size in binary data, text data, etc. GridFS and show you how to work with it from Python. You can upload, download, and list files in GridFS. Also you can create custom classes and store them within our GridFS files which can then be used for rich reporting and querying that does not exist in standard file systems.
creating a GridFS instance to use:
>>> from pymongo import MongoClient >>> import gridfs >>> >>> db = MongoClient().gridfs_example >>> fs = gridfs.GridFS(db)
Every instance is created with and will operate on a specific database instance.
Saving and Retrieving Data:
The simplest way to work with GridFs is to use its key/value interface. To write data to GridFS, use put()
>>> a = fs.put(“hello world”)
put() creates a new file in GridFS, and returns the value of the file document’s "_id" key. Given that "_id" we can use get()to get back the contents of the file
>>> fs.get(a).read() ‘hello world’
In addition to putting a str as a GridFS file, we can also put any file-like object (an object with a read() method). GridFS will handle reading the file in chunk-sized segments automatically. We can also add additional attributes to the file as keyword arguments:
>>> b = fs.put(fs.get(a), filename=“foo”, bar=“baz”) >>> out = fs.get(b) >>> out.read() ‘hello world’ >>> out.filename u’foo’ >>> out.bar u’baz’ >>> out.upload_date datetime.datetime(…)
The attributes we set in put()are stored in the file document, and retrievable after calling get(). Some attributes (like "filename") are special and are defined in the GridFS specification.
Aggregation framework:
The aggregation framework in MongoDB allows you to execute rich queries and transformations on the server. While normal queries leverage documents in the exact structure, aggregation similar to map-reduce is much more flexible. It can transform, group, and query data as well as act as a data pipeline on the server.
Replication:
Replication is key to MongoDB’s fault tolerance. It can also be used for data locality across data centers, scaled-out reads, offsite backups, reporting without performance degradation, and more. PyMongo makes working with replica setseasy. Here we’ll launch a new replica set and show how to handle both initialization and normal connections with PyMongo.
Sharding:
MongoDB is a high performance database even in single-server mode. However, to truly leverage MongoDB’s performance potential, you will need to use sharding. This technique allows you to run a cluster of MongoDB servers working in concert to each hold some portion of the data and share some portion of the queries. It is sharding that gives MongoDB the ability to scale horizontally on commodity hardware. To actually add the shards to the cluster, go through the query routers, which are now configured to act as our interface with the cluster. You can do this by connecting to any of the query routers like this: mongo –host query0.example.com –port 27017
This will connect to the appropriate query router and open a mongo prompt. You will add all of shard servers from this prompt.
To add first shard, type:
sh.addShard( “shard0.example.com:27017″ )
You can then add your remaining shard droplets in this same interface. You do not need to log into each shard server individually.
sh.addShard( “shard1.example.com:27017″ ) sh.addShard( “shard2.example.com:27017″ ) sh.addShard( “shard3.example.com:27017″ )
If you are configuring a production cluster, complete with replication sets, you have to instead specify the replication set name and a replication set member to establish each set as a distinct shard. The syntax would look something like this:
sh.addShard( “rep_set_name/rep_set_member:27017” )