Continuing on my earlier blog post about deduplication of DAM assets…
If you deduplicate assets, what method do you use?
Continuing on my earlier blog post about deduplication of DAM assets…
If you deduplicate assets, what method do you use?
Do you deduplicate the assets in your Digital Asset Management (DAM) solution? Please answer this quick anonymous poll:
In some cases, organizations have unique file naming conventions, but file names are often created by people, which more often yield not-so-unique file names. One person names a file one way and another person names the exact same file another way because they use it differently or in a different place. While this demonstrates a clear lack of consistency and governance, which happens way too often. This is especially true if you not using a Digital Asset Management (DAM) solution with clear guidelines and stop gaps to catch these sort of things as part of a workflow.
We end up with collections of assets which may have:
So here is the dilemma. What do you do with exact duplicates?
Some DAM solutions will look for matching file names and catch those during the upload process to the DAM (based on matching file names, regardless of whether the actual content is duplicated or not, as described earlier).
There is even a better way…
Enter the world of algorithms. Yes, an algorithm is complex code. Do not worry because these complex codes can be nicely packaged into easy to use and very powerful tools for data deduplication (also referred to as ‘deduping’ or ‘dedupe’ in shorthand). The algorithm does a bit-by-bit comparison of each asset, regardless of file name, and creates a checksum. Checksums are a string of letters and numbers (alphanumeric) which act like a fingerprint, unique to that asset. If an asset is an exact duplicate (really beyond any visual comparison), it will produce the same checksum. If two assets have the same checksum, they are exact duplicates.
How does it work with assets?
Doing any of these changes the bits which make up these assets and that will yield a different checksum. If two assets have the same arrangement of bits, you will likely have the same checksum.
So how accurate is it?
One algorithm called MD5 is accurate to 1 in 1 Octillion (That is 10 to the 27th power or 1,000,000,000,000,000,000,000,000,000 according to most American English standard dictionary numbers). That should be accurate enough for quite a while, don’t you think? Read on. This gets better.
Regardless of the operating system (OS) or the computer (PC/MAC/whatever) you use to run this algorithm, MD5 can catch exact duplicate assets and when it does, it will produce the exact same checksum or fingerprint.
I briefly mentioned MD5 during a metadata webinar and people got really excited.
Where did MD5 come from?
MD5 was originally designed in 1991 by MIT’s Ron Rivest as a cryptographic hash function using 128-bit hash values. While the intelligence community encrypts with other algorithms now due to security concerns, MD5 was a standard among software vendors in order to verify whether a download was exactly as intended to be downloaded (not hacked). MD5 has now been replaced by SHA-256 as a U.S. National Standard. Shying away from a 512 bit algorithm (SHA-512) which are even more taxing to a system, MD5 is still one of the commonly used data deduplication methods. Note that MD5 is not recommended for any SSL, password security or any security today. We are talking about using MD5 just for data deduplication here, not security.
How common is this tool found?
The command for MD5 is built into UNIX machines (Apple’s Terminal application). There are a bunch of PC programs which use MD5 (or SHA-256) and are available online for a nominal fee. Some DAM systems are available with MD5. Some DAM systems are available with a less powerful algorithm called CRC32 which is a 32 bit hash.
What does a MD5 checksum look like?
5d41402abc4b2a76b9719d911017c592
To technology folks, this exciting stuff with major potential. For the rest of us, you do not need to run away, but understand that a DAM should be able to create, read and compare these values. A DAM should also be able to report on this along with the rest of the metadata for every asset available.
What are the benefits of MD5?
What are the risks behind MD5?
How to use MD5 on assets?
You could…
**Note this may, depending on the DAM system, require either:
Where can we find more information about data deduplication?
Let us know when you are ready for assistance in deduplicating your digital assets for your business or consulting for your Digital Asset Management needs.
How do you avoid duplicate assets in a DAM?
With a Digital Asset Management (DAM) system, or any system containing intellectual property within an organization, unique logins (username and password) for every individual user with access is common. Unique logins should not be limited to people with a specific level of access, a particular role nor a certain level of permissions, but everyone with access to the DAM.
Why? A few reasons…unless you enjoy seeing your IP sold on an online auction
Security
When some one leaves an organization (for any reason), they should not walk away with any access to any intellectual property (IP), applications nor digital assets which are owned and/or licensed by the organization. This can reduce the potential risk of having your competitors having direct access to your DAM. This also limits the risk of IP spreading wildly out of control. This goes hand in hand with the use of permissions and role structure.
Accountability
Unique logins allow a certain level of accountability for every user. Everyone should be kept accountable for what they do (or don’t do), regardless of their role, title and/or seniority. True accountability does not play favorites. It should be clear as black on white.
Reporting
Once you establish individual logins, it should be easy to report who has:
Reporting capabilities are common in many DAM systems. Reporting also allows you measure the performance of the system, user adoption as well as user results from the DAM. Unique logins per individual allow at least administrators to pin point exactly who did what with which assets and when this occurred.
As a best practice, passwords should be changed on a regular interval (such as every few months) for additional integrity. There are some regulations which mandate passwords to change often. Can your DAM users change their own passwords?
What does a strong password look like?
If possible, explore the option of having a single sign-on (SSO) feature for time savings so users only need to remember one unique username and password for all the systems they access instead of different logins for different systems.
Do you have unique logins for each DAM user?