deduplication | Another DAM Blog

December 24, 2010
by Henrik de Gyor Leave a comment

Why should I pay for the DAM when the entire organization uses it?

Someone asked me about this question and remembered I wrote about this briefly in an earlier blog post, but wanted me to ellaborate. So here is blog post to explain it.

Let us say one group or department is the original requester of a DAM solution within an organization. Likely this same department becomes the business owner, stakeholder and/or sponsor of the DAM solution. This same department or group pays to administer and maintain the DAM. They might pay for any monthly/quarterly/annual licensing fees and/or service level agreements (SLA) for the DAM solution as well. Now let us say other departments see the value in using the DAM to keep the organization’s branding, graphics, photographs, publications, presentations, reports, video or other intellectual property (IP). The DAM gets more user adoption by more departments. Now who pays for the DAM within this organization?

Often, what occurs is the original requester, sponsor or stakeholder continues paying for the DAM solution. Because of this, they might say “Wait, I am paying out of my department’s budget for other departments to benefit from this solution as well? What’s in for me? Why should I pay for the DAM when the entire organization uses it?”

Consider this idea “Why am I the only one paying for it? If we share the DAM, share the cost.”

Enter the idea of chargeback or simply charging the department who requests to acquire/create/use something with the actual expense in resources used by refunding it. This idea is likely a change for many companies in how they deal with budgets and how departments are accountable for the resources they use. This also keeps a department which may overtax another department’s resources in check. So, with this idea every department or group has their own budget as usual, but since every DAM user should have a unique login (right?) and possible different collections of assets they can access or share, why not split the total cost of these expenses based on actual usage of the DAM solution per department? Charge each department based on usage of the DAM solution.

If one department uses the DAM more than another department by a measurable amount or percentage, should they pay a larger share of the cost each month/quarter/year? Should each department be able to share this cost evenly or should each department pay for what they use based on a percentage? Or have one department pay for it all?

How do you measure usage of the DAM? With usage reports from the DAM which could list:

Who are the DAM users (by individual login) accessed the DAM? (keeping individual user accountability)
Who has the most active DAM users within a given period of time?
Who wants/needs/asks for the most time in administration, maintenance, support and/or training?
When did they access the DAM? (keeping time accountability)
How often did those users or group of users access the DAM? (time based usage)
How long did they access the DAM over a period of time? (number of minutes or hours)
How much was downloaded/exported from the DAM? (by the number of assets and/or file size if bandwidth is measured)
How much was uploaded/imported to the DAM? (by the number of assets and/or if bandwidth is measured)

I would recommend looking what you are paying for internally and externally to gauge what are the costs of doing business.

Some DAM vendors charge for bandwidth (how many GB is uploaded/downloaded to/from DAM within a given period). Some don’t.

Server space costs money regardless of whether it under your own IT department’s domain, a vendor’s domain or in the cloud. Who is using the storage space?

Is the data deduplicated? Do you want to dedupe the DAM data to minimize duplicate assets?

Some DAM vendors charge per DAM login or per concurrent user. Some DAM systems limit how many users you can have or the total users at one time. Can your organization add/remove DAM users without the vendor’s help?

How much does it cost to administer, support, maintain a DAM and train the DAM users? How much does it cost in errors and problems when you don’t?

Why should I pay for the DAM when the entire organization uses it?

Are these costs of doing business worth sharing as you share business tools such as a DAM solution?

February 19, 2010
by Henrik de Gyor Leave a comment

What do you use to deduplicate assets?

Continuing on my earlier blog post about deduplication of DAM assets…

If you deduplicate assets, what method do you use?

February 12, 2010
by Henrik de Gyor Leave a comment

Do you deduplicate your DAM assets?

Do you deduplicate the assets in your Digital Asset Management (DAM) solution? Please answer this quick anonymous poll:

February 11, 2010
by Henrik de Gyor 10 Comments

How do I avoid duplicate assets in a DAM?

In some cases, organizations have unique file naming conventions, but file names are often created by people, which more often yield not-so-unique file names. One person names a file one way and another person names the exact same file another way because they use it differently or in a different place. While this demonstrates a clear lack of consistency and governance, which happens way too often. This is especially true if you not using a Digital Asset Management (DAM) solution with clear guidelines and stop gaps to catch these sort of things as part of a workflow.

We end up with collections of assets which may have:

many similar file names (are these true redundancies or simply the result of a poor file naming convention?)
some assets with the same file names (whether the content is the same or not throughout a folder structure)
every asset having different file names even though some may be exactly the same (content wise)
Assets just copied multiple times across folder structures (which makes sorting by file name out of the question)

So here is the dilemma. What do you do with exact duplicates?

Throw away all your IP and start over? (Not wise)
Painstakingly open and look through each and every asset to try comparing and contrasting them all, one asset at a time? (some people do this, as painful, time-consuming and expensive as it sounds since it may even take a subject matter expert to review some assets)
Have a computer “just do something about it“
Use a file browser to make this review process go faster? (getting warmer, but we can actually do better than visually checking each asset, with or without any metadata)
And we have not even begun to discuss different versions of the same assets and different file types (different file extensions)
What would you do?

Some DAM solutions will look for matching file names and catch those during the upload process to the DAM (based on matching file names, regardless of whether the actual content is duplicated or not, as described earlier).

There is even a better way…

Enter the world of algorithms. Yes, an algorithm is complex code. Do not worry because these complex codes can be nicely packaged into easy to use and very powerful tools for data deduplication (also referred to as ‘deduping’ or ‘dedupe’ in shorthand). The algorithm does a bit-by-bit comparison of each asset, regardless of file name, and creates a checksum. Checksums are a string of letters and numbers (alphanumeric) which act like a fingerprint, unique to that asset. If an asset is an exact duplicate (really beyond any visual comparison), it will produce the same checksum. If two assets have the same checksum, they are exact duplicates.

How does it work with assets?

Add a period in a text file
Move a line in graphic
Clip an audio file
Color correct a photograph

Doing any of these changes the bits which make up these assets and that will yield a different checksum. If two assets have the same arrangement of bits, you will likely have the same checksum.

So how accurate is it?

One algorithm called MD5 is accurate to 1 in 1 Octillion (That is 10 to the 27th power or 1,000,000,000,000,000,000,000,000,000 according to most American English standard dictionary numbers). That should be accurate enough for quite a while, don’t you think? Read on. This gets better.

Regardless of the operating system (OS) or the computer (PC/MAC/whatever) you use to run this algorithm, MD5 can catch exact duplicate assets and when it does, it will produce the exact same checksum or fingerprint.

I briefly mentioned MD5 during a metadata webinar and people got really excited.

Where did MD5 come from?

MD5 was originally designed in 1991 by MIT’s Ron Rivest as a cryptographic hash function using 128-bit hash values. While the intelligence community encrypts with other algorithms now due to security concerns, MD5 was a standard among software vendors in order to verify whether a download was exactly as intended to be downloaded (not hacked). MD5 has now been replaced by SHA-256 as a U.S. National Standard. Shying away from a 512 bit algorithm (SHA-512) which are even more taxing to a system, MD5 is still one of the commonly used data deduplication methods. Note that MD5 is not recommended for any SSL, password security or any security today. We are talking about using MD5 just for data deduplication here, not security.

How common is this tool found?

The command for MD5 is built into UNIX machines (Apple’s Terminal application). There are a bunch of PC programs which use MD5 (or SHA-256) and are available online for a nominal fee. Some DAM systems are available with MD5. Some DAM systems are available with a less powerful algorithm called CRC32 which is a 32 bit hash.

What does a MD5 checksum look like?

5d41402abc4b2a76b9719d911017c592

To technology folks, this exciting stuff with major potential. For the rest of us, you do not need to run away, but understand that a DAM should be able to create, read and compare these values. A DAM should also be able to report on this along with the rest of the metadata for every asset available.

What are the benefits of MD5?

We can run MD5 on a collection of assets (in a DAM or not) and compare the checksums. If any checksums match, you just found duplicate assets. Several MD5 tools do this comparison of checksums. Handle duplicates however your organization deems fit in a systematic manner. Just be aware of where the assets were intended to be used, particularly if the file names do not match.
We could also search for assets using the checksums (even as metadata on a per asset basis in the DAM if you assign a field to it) to reduce duplicates.
We could request for a DAM vendor to compare all checksums in the DAM (one per asset) for any uploads.
It is very common to have many duplicate assets within an organization. Some organizations have run MD5 on their assets and reduced duplicate assets by over 80%.
MD5 can even work on a string of text (outside of a file) to verify if it is the same as another string of text.
This can reduce storage on servers of any duplicates. Why would we want to store exact duplicates repeatedly?
MD5 runs on any Operating System and any computer which can handle the checksum function.

What are the risks behind MD5?

If we have embedded metadata (metadata embedded inside the asset) that is edited differently between two duplicate assets, you may get a different checksum (duplicate not found).
Layered masks not visible to the naked eye may throw MD5 off if one asset has a layered mask and another asset with duplicate content does not (duplicate not found).
Collisions may happen. MD5 is no longer recommended for any security needs. SHA-256 trumps MD5.
The MD5 tool may tax your system performance while creating and comparing checksums for a collection of assets. SHA-256 is even more taxing on a system. SHA-1 may be less taxing on a system, but can also have collisions (not good for security).
This does not necessarily identify nor eliminate all duplicates, but MD5 can help address most of the them.
People may continue creating and acquiring duplicate assets, but deduplication on DAM system will help act as a stop gap to additional duplicates being introduced to the DAM.

How to use MD5 on assets?

You could…

Run MD5 on all assets already in the DAM (dedupe existing DAM assets)
Run MD5 on all assets to be uploaded on a ongoing basis and compare those checksums to the checksums of assets already existing in the DAM (dedupe all asset uploads against existing DAM assets)**

**Note this may, depending on the DAM system, require either:

A configuration of the DAM (varies among DAM systems) if it already exists as a feature
A customization to the DAM for this process to be automated upon upload (if it is not an available feature to the DAM system)
A manual effort prior to upload to DAM or even outside of the DAM (which may catch less duplicates if neither the customization nor configuration is available).

Where can we find more information about data deduplication?

Google it.
Ask DAM vendor(s) about whether they have some data deduplication methods in their DAM system. Many people (include vendors) may not be aware of the need for data deduplication. If your DAM vendor does not have it, ask for data deduplication to be part of their roadmap of upcoming improvements with accompanying documentation. The more people ask, the more likely the vendor will add this to their roadmap.

Let us know when you are ready for assistance in deduplicating your digital assets for your business or consulting for your Digital Asset Management needs.

How do you avoid duplicate assets in a DAM?

Another DAM Blog

Blog about Digital Asset Management

Tag Archives: deduplication

Why should I pay for the DAM when the entire organization uses it?

What do you use to deduplicate assets?

Do you deduplicate your DAM assets?

How do I avoid duplicate assets in a DAM?

Share

Share

Share

Share