Alfresco and MongoDB

by Harry Moore

Why Alfresco

  • Management of structured and document based content.
  • Metadata.
  • Custom repository services.
  • Aspects, custom models and behaviors.
  • Workflow.
  • Templating (Form based structured content).
  • Metadata.
  • Web Scripts (The greatest custom API framework – EVER!).
  • Gets your content ready!

Why MongoDB

  • Flexible (No Scheme) storage structure.
  • BSON (binary JSON) “Document” storage “feels” more natural than flat tables and foreign keys.
  • High traffic, web-ready Scaling:
    • Read scaling with Replica Sets.
    • Write scaling and distributed data with sharded clusters.
  • GridFS interface for storing large files.
  • Develop faster, Deploy easier, Scale bigger
  • More than a dozen supported language drivers (even more community supported drivers) – http://www.mongodb.org/display/DOCS/Drivers
  • Many large production deployments: Disney, Forbes, shutterfly, craigslist, MTV, sourceforge, SAP, and more – http://www.mongodb.org/display/DOCS/Production+Deployments

Why would you use Alfresco and MongoDB together

Rothbury Software has been an Alfresco Platinum Partner since 2006. Alfresco’s proven Content Management system provides a collaborative environment for content creation and control.

Rothbury also recently partnered with 10gen, the creators of MongoDB. MongoDB is a highly scalable and flexible storage solution. This combination provides for control of your content repository and the ability to get your content to a LOT of people, anywhere in the world, very quickly; i.e. publish your content on the web.

Together, Alfresco and MongoDB offer the best enterprise level technology stack for authoring and delivering content to the web. Maybe you don’t want your entire content repository exposed to the web. You want to deliver content to selected channels. Ex.:

  • Components of a campaign targe ted to the web
  • Product related downloads
  • Merge transactional data with web content
  • Mobile – deliver the content but let the web site worry about presentation

How to Deploy Content from Alfresco to MongoDB

There are several options available:

  • Push approach – Alfresco hosted code updates MongoDB from a custom Alfresco Action using the MongoDB Java driver.
  • Pull approach – Standalone application pulls content from Alfresco (download servlet, CMIS API, custom Web Script, etc) and updates MongoDB

The push approach would work well in situations where you want to deploy individual pieces of content as they change (maybe from a behavior policy bound to an add Aspect event). We’ll look at an example of how to “push” in this article.

The Pull approach works best in batch situations where you need to deploy many content updates at once. Probably want to schedule the deployments too. You want to off load this heavy lifting of the deployment to another process/application server. Look for a future article for an example of a “Pull”.

Here is an example of a push

The method used to push documents from Alfresco to MongoDB should be flexible. We may want to deploy from a workflow, an action or triggered when a property changes. I’m going to make the deployment component a “service” and expose it as a root scoped JavaScript object named ‘mongoService’.

This is an example. Not a production ready solution. For example, you wouldn’t want to open a new connection to the database each time you wanted to insert a document.

I’ll start off by creating an interface, MongoService, that describes my service. I’m going to have a method on my service named “insert”. It will actually perform and “upsert” in MongoDB. An upsert is similar to an insert but will create the document if it does not already exist. If the document does exist it will be replaced with the document we are inserting.

Now create the implementation class MongoServiceImpl.java. You’ll notice that the code that sets the content of the MongoDB document checks the size of the Alfresco node’s content and if it is larger than 1 megabyte it will stream the content to GridFS using nodeRef.toString() as the name of the GridFS file so it will easy to find later. If the content is less than 1 megabyte it is inserted directly into the document as a string. It will up to the client application reading the documents from MongoDB to determine if it needs to go.

GridFS is a very nice feature of MongoDB, which lets you store files.

Implement the interface (the “insert” method):

There are several constructors to choose from when creating a MongoDB. I chose the one that takes a List of ServerAddress objects. Using a list of ServerAddresses I can give the Java driver a list of “seed” nodes to choose from if it should lose a connection. In a ReplicaSet scenario, you really only need to connect to one of the mongod processes in the cluster and the driver will grab all the cluster information it needs from that one server to reconnect if the master should change (via an election process. A seed list is useful for Sharded clusters to give the drive a list of the mongos processes. The Java driver will determine which of the mongos processes to use.

Get a reference to the database server: mongo = new Mongo()

Setting the WriteConcern to SAFE tells the driver to wait for a confirmation that the data was written to at least the primary node in the cluster. This is slows down writes but lets you detect errors so you will know if your data made it to MongoDB.

mongo.setWriteConcern(WriteConcern.SAFE);

There are several options for configuring WriteConcern; including WriteConcern.NONE which is a fire and forget. You will never know if your data made it to the database.

Get a DB object: DB mongoDb = mongo.getDB(database);

Note that the database does not need to exist on the MongoDB server to get a reference to it. MongoDB will create the database (and the collection for that matter) the first time you write data to it.

Now the collection:

DBCollection dbCollection = mongoDb.getCollection(collection);

Build the BSON document to send to MongoDB

Note the use of Alfresco’s nodeRef string as the document _id. This way I don’t have to store another identifying value in Alfresco to find the correct document to update later on. You can see this in the dbCollection.update. The first argument is a selector used to find any existing document to update:

dbCollection.update(new BasicDBObject(“_id”, nodeRef.toString()), document, true, false);

In addition to passing the document we built from the nodeRef properties, the third argument says to perform an upsert. Otherwise we would need to do an insert the first time a document is deployed. We are working with a single document at a time so the final argument (set to false here) specifies we are not performing a “multi” update.

Local helper methods

Get a list of the tags (or categories) applied to the Alfresco document. This method will return a list of the names of the tags (or categories); not their paths. So it works well for tags but for categories we would probably want to construct a path from the root category in the classification:

If the content is large (I arbitrarily chose 1 megabyte) then store the content using MongoDB’s GridFS interface. The data is still in the same MongoDB database but in a different collection, two collections, actually. One to store the metadata for the document and another to store the file’s content broken into “chunks”:

Get an input stream to read the Alfresco node’s content. Get a GridFS object and use it to create a GridFSInputFile file from an input stream. Set the file name to the nodeRef String. Then save the GridFSFile.

For content less than 1 megabyte in size just get it as a String and store it directly in the document in a property named ‘content’. If the ‘content’ property is expected to grow over time with subsequent updates then it would be best practice to store the content in another collection and use a DBRef (similar to a foreign key) in the ‘content’ property here. This is because MongoDB tries to update a document in place if it can, which is very fast. However if the new document is larger than older one, MongoDB may need to move things around or even allocate more space for the collection (an expensive operation).

House keeping. These give Spring a place to inject our dependencies.

Create the Spring bean that will expose our service to JavaScript

ScriptMongoService.java:

Create the Spring context

Finally, we need to wire up the Spring beans to expose the service to JavaScript: rs-mongodb-repository-context.xml:

Test Script

We’ll need a script to run the service. Create the following script in Data Dictionary/Scripts test-mongo.js:

Build and deploy into Alfresco:

You will need the MongoDB official Java driver in your classpath. You can download it from github: https://github.com/mongodb/mongo-java-driver/downloads

Copy the jar file to Alfresco’s webapp. Ex.: {tomcat}/webapps/alfresco/WEB-INF/jar or package it into an AMP.

Compile and deploy the jar into Alfresco and restart.

Create a space rule to run the script:

Create a space rule on a folder in Share. The rule should fire the script action with the above script whenever a document is modified and it has the tag “mongo” applied.

Start MongoDB

Create a configuration file. This is not necessary if you run with all defaults. You don’t want to run with smallfiles and noprealloc in production.

/etc/mongodb.conf:

Start MongoDB from a terminal:

Deploy some content

Create some content and apply the tag:

 

Check the database for the data

Log in to MongoDB and run a query:

If you sent a file whose content is larger than 1 megabyte you will see ‘null’ for the ‘content’ property. Look in the alfrescoLargeFiles.files and .chunks collections in the mongo shell:

Note that you will not see these files if you use the mongofiles clt because that tool assumes the default bucket name “fs” but we used “alfrescoLargeFiles”. See https://jira.mongodb.org/browse/SERVER-1970

Conclusions

I have worked with Alfresco since 2006 and it has been a blast. There isn’t much you can’t do with it, integrate it with, or build on top of it. MongoDB is making a big splash in the “Big Data” market. Events have been packed with interested technology managers and developers. Rothbury Software believes these technologies complement each other in a way that will benefit current and future clients. We have several active projects using Alfresco and MongoDB.

Look for a future article on how the document data authored in Alfresco and deployed to MongoDB can be consumed by mobile apps.