Lately I’ve been involved in a project where we used Documentum’s dump/load feature to copy a lot of documents from one repository to another. We successfully copied millions of documents, folders and other objects, but this success did not come easy. In this blog I would like to share some of the issues we had for the benefit of others using dump and load.
A standard tool
Dump and load is a tool that can be used to extract a set of objects from a Documentum repository into a dump file and load them into a different repository. Dump and load is part of the Documentum Content Server. This means it can be used with any Documentum repository in the world. The tool is documented in the Documentum Content Server Administration and Configuration Guide (find it here on the EMC Support site). The admin guide describes the basic operation of dump and load, but does not discuss its limitations. There is also a good Blue Fish article about dump and load that provides a bit more background.
A fragile tool
Dump and load only works under certain circumstances. Most importantly, the repository must be 100% consistent, or a dump will most likely fail. So my first tip: always run dm_clean, dm_consistencychecker and dm_stateofdocbase jobs before dumping and fix any inconsistencies found.
The dump tool has limitations. Dump can be instructed to dump a set of objects using a DQL query. The dump tool will run the query and dump all selected objects. It will also dump all objects that the selected objects reference. That includes the objects ACLs, folders, users, groups, formats, object types, etc.etc. This is done in an effort to guarantee that the configuration in the target repository will be ok for the objects to land. This feature causes a lot of trouble, especially when the target repository has already been configured with all the needed object types, formats, etc. It causes a 100 object dump to grow into a dump of thousands of objects, slowing the dump and load process. Worse, the dump tool will dump any objects that are referenced from the original objects by object ID. This causes the folder structure for the selected documents to be included as well as the content objects, but it can also cause other documents to be included, including everything that these documents reference (it it s recursive process). This method can backfire, for instance if you select audit trail objects for instance, all objects that they reference will be included in the dump.
Now this would not have been so bad if the dump tool had not had size limitations, but it does. We found for instance that it is impossible to dump a folder that has more than 20.000 objects in it (though your mileage may vary). The dump tool just fails at some point in the process. We discussed it with EMC Support and their response was that the tool has limitations that you need to live with.
As another example we came across a repository where a certain group had many supergroups. This group was a member of more than 10.000 other groups. This was also too much for the dump tool. Since this group was given permissions in most ACLs, it became impossible to do any dumps in that repository. In the end we created a preparation script that removed this group from the other groups and a post-dump script to restore the group relations.
The load tool has its own limitations. Most importantly we found that the bigger the dump file, the slower the load. This means that a dump file with 200.000 objects will not load in twice the time it takes to load 100.000 objects, it will take longer. We found that in our client’s environment we really needed to keep the total object count of the dumps well below 1 million, or the load time would go from hours to days. We learned the hard way when we had a load fail after 30 hours and we needed to revert it and retry.
Secondly, objects may be included in multiple dump files, for instance when there are inter-document relations. For objects like folders and types this is fine, the load tool will see that the object already exists and skip it. Unfortunately this works differently for documents. If a document is present in 3 dump files, the target folder will hold 3 identical documents after they have been loaded. Since you have no control over what is included in a dump file and you cannot load partial dump files, there is little you can do to prevent these duplications. We’ve had to create de-duplication scripts to resolve this for our client. We also found that having duplicates can mean that the target docbase can have more documents than the source and that the file storage location or database can run out of space. So for our production migration we temporarily increased the storage space to prevent problems.
Another limitation concerns restarting of loads. When a load stops half way through, it can be restarted. However we have not seen any load finish successfully after a restart in our project. Instead it is better to revert a partial load and start it all over. Revert is much quicker than loading.
Finally we found that after loading, some meta data of the objects in the target repository was not as expected. For instance some fields containing object IDs still had IDs of the source repository in them and some had NULL IDs where there should have been a value. Again we wrote scripts to deal with this.
As a final advice I would encourage you to run all the regular consistency and cleaning jobs after finishing the loading process. This includes dm_consistencychecker, dm_clean, dm_filescan, dm_logpurge etc. This will clean up any stuff left behind by deleting duplicate documents and will ensure that the docbase is in a healthy state before it goes back into regular use.
As you may guess from this post, we had an exiting time in this project. There was a tight deadline, we had to work long hours, but we had a successful migration and I am proud of everyone involved.