Introduction to the Genetic Archive

by Michael Friend

2021-09-08

This essay presents a proposal for a future approach to motion picture archiving, referred to as the genetic archive. The fundamental principle is that it is not necessary (nor desirable) to keep every version of every element created in the process of shooting, posting, and distributing a work; instead, archive only the minimal, original, source elements necessary to reproduce it, with the metadata and recipes that are necessary and sufficient to assemble all valid versions. The term “genetic” archive is used because the critical source elements that are essential for the creation of viewable and distributable masters may be considered the “DNA” of the work, and the various versions derived for distribution and mastering are the “children” of this “parental” material at the work’s core.

We take as axiomatic that media works will need to be indefinitely remastered and optimized for new technical contexts, while maintaining the creative vision of the originators. We also believe that the most primordial images – camera files and VFX – are the unique and superior data resource for all future applications. We have seen improvements in extraction and debayering algorithms, and we can expect improvements in future images from the archive to be based on the advance of algorithmic protocols. We also take the “work” as axiomatic. This is the notion that defines the archive as the components necessary to make the valid versions of a show. We do not consider the excess data valuable – there may be some stock footage or other functionality that we have to account for; for example, we would have our stock footage library extract any neutral material from camera files. But basically, the idea that a work is infinitely expansible as new editorial versions is incorrect. There will be a very few editorial versions of the work, which will be perennially reconfigurable by the archive for future technical contexts. There will always be exceptions to this practice in terms of super-productions or idiosyncratic works. Documentaries have some special attributes, and currently, the VFX material is starting to displace simple camera files in ways that will change the organization of the deep archive.

The first driver for this kind of archive is deduplication on two axes. Currently, major productions produce gigantic archives of camera files and dailies. Over 90% of this data is useless and irrelevant, because we do not normally go back and use the alternate takes that comprise so much of the archive. The notion of the “work” is that the archive contains only the materials required to make any valid version of the show. Modern practice with respect to director's cuts, extended versions, etc. means that we will not have to go back into the unused camera files to make any of the valid versions of the show. So redundancy abatement #1 is the reduction of camera data by using the EDLs to isolate and archive only the files used in a valid version. The second abatement is along the temporal axis. Today, for an archive, we create a DI or CTM, a DSM or VAM, and frequently a DCDM and HDRDM. These are all excellent elements, but they are individually very expensive and need to be migrated or stored in a cloud. And yet, in the future, these elements will not be used. They are largely artifacts of today's preferred distribution and display modes. So instead of carrying this data volume perennially, we want to archive just those primordial files that are the source of the production (mainly the used camera files with handles and the VFX with layers) and the instructions. This frees the archive from the perennial cost of holding all these full-weight versions.

The second driver is organization. Today, we take in this mass of material with very little documentation or strong organization. We need to capture more of the process metadata that allows us to organize and document the full deep archive (this is clearly an opportunity for IMF). Organization will save time and money in remastering. The reduction of data we talked about above is useful to this process because it eliminates the need to track and document useless data, and makes the path to remastering much clearer. To be clear, it seems likely that future iterations of the product will be created using things like EDLs, CDLs, “comparison engine” software and AI for matching data into a new technical space. In other words, while we want to collect process metadata and some of the components of the workflow, we think that eventually, these will be superseded by modern data management tools.

We foresee the capture of the full production archive, and the reduction and organization of that data into a genetic archive of the work.

What is in the genetic archive? It is the aggregate of original data files used in the production. The production archive is the vast amount of heterogenous data created in the course of the production. The genetic archive is the organization of the critical and most primordial forms of that data for redeployment in the future.

Unlike the traditional archive, which is normally a collection of finished elements, the genetic archive is a set of resources which parent all valid versions of a show, and the instructions to finish these resources as valid versions. This is similar in some ways to an IMF, which is a collection of resources organized to produce any desired version. These variations are assembled by playlists. A genetic archive would be organized as playlists – one for each version (theatrical, home video, extended version). The genetic archive is not organized to localize the versions – localization is normally handled by the IMF. But the playlists of a genetic archive would assemble the DI, the DSM, the HDRDM, the DCDM, and would not only parent new or modified IMFs as necessary, but also any future ‘technical versions’ required. For example, the genetic archive would be the highest quality source for the creation of new elements – a super-HDR version, an 8k version, an HFR/HDR hybrid – whatever technical variations may be called for in future contexts.

The genetic archive is organized out of the massive body of production material that constitutes the ”production archive,” the primordial elements used to make the show:

  • Camera files: charts and tests, all cameras, dailies. From the totality of camera files, the genetic archive would select and preserve only the used shots. This typically includes multiple units, cameras and file types.
  • VFX: the genetic archive for VFX would consist of the final elements and all the plates, composites, mattes, and layers as uncomposited elements. This is in order to be able to adjust the VFX in remastering.
  • Other visual data: Includes opticals and other visual editing elements used to make finished versions of the product, as well as stock footage and titling. These elements may have different levels of quality, but they are the ‘most original’ state of what was used in the show. Opticals are composite units built to support specific effects in imagery in addition to VFX. Stock footage is generally externally acquired, and so quality is dependent on what the production accessed. In the case of documentary stock footage, this may be distressed archival footage derived from many sources, standards, and levels of quality. As far as titling, this includes both text and image, and is normally a layer or alpha-channel resource that applies text to picture. Titling also includes art titles that may exist as a standalone resource like a finished effect, but wherever possible the separate components of the title build should be maintained.

(These elements described in the three sections above constitute the entire source for the image of a given product.)

  • Audio: the ‘collated archive,’ all production data and conformed audio, from earliest stems to final mixes. The retention of the audio components as individual items allows full access to the mix in remastering. Also, since audio data is lightweight and is normally delivered in a ProTools matrix, it is relatively economical to organize and archive.
  • Metadata: EDLs, CDLs, Baselight (Resolve, etc. and other DLs (framing, etc.); color correction (DI suite) data re-coloring, windowing and local adjustment, cropping, and matting; HDR data, other processing data (i.e., “Dolby Vision”); ASC CDL metadata. Increasingly the modern production workflow is guided by metadata of many different types. The most primary metadata is simple descriptive information. But there is a tremendous amount of process metadata that describes the changes – in editing, color, shape, texture, and look of the work.
  • Editorial drives (mainly Avid or other drives that contain the developmental history of the product and instructions on how data was processed). The editorial drives contain the assembly history of the product, and contain the process metadata specific to the various versions of the film. This includes the organization of the various valid versions of the work as well as para-data – gag reels, screen tests, alternate scenes, etc.

These are the data classes that are produced by the production, and a targeted subset of these resources becomes the genetic archive.

Many resources are important to retain in a modern media enterprise, but not all these resources are archival, and not all of these resources are equally important in an archival context. For example, a DSM or DI is more important than a DCP or a ProRes proxy. In restoration, an SRC may very well be more important than a more processed IPDC. This is because the former types of data are richer and can support a much wider variety of processing and remastering activities. There is a hierarchy of values implicit in the traditional archive. The most original data (the cut original camera negative) is at its apex. From the perspective of the genetic archive, the more primordial files have precedence over derivative elements. Generally speaking, the distribution class of materials is secondary – members of that class can be recreated from primal archive elements or assets (that is, if all localization assets are lost, new localization material can be created or recreated from the primary archive; but if original camera files are lost, that level of data is lost and irreplaceable).

The reality of the workflow in the motion imaging business, which starts with the most original data and systematically destroys data in a deliberate process to create a highly specific viewing experience, is at the core of this principle. The process of replacing an element or creating an element for a new technical context, each time it is performed, is a ‘downhill’ process through which data is permanently lost in the derivative element (the replaced or new master). This process may be performed multiple times in order to modulate data for different target environments, but the process is dependent upon the availability of the primordial, super-abundant data resource. This is not to suggest this data is the only locus of value in the archive, which also retains the instructions used to create each version of a product, and reference data that can be used to support remastering.

The idea of a genetic archive is to maintain a resource of the minimal elements necessary to reproduce the work, and the instruction sets necessary to assemble all valid versions. One of the objectives of the genetic archive is to collect all process metadata and instructions to be able to reproduce any piece of the workflow chain between primordial camera file and final picture data ready to go to distribution. It's unlikely that these metadata instructions will be useful for the future, as we will likely use AI to conform, reprocess and re-grade new versions of a work in new technical environments rather than replicating the arcane workflow of the original production. Nevertheless, the metadata and editorial drives that contain the record of these instructions are important resources for the transitional state of archives. They usually contain information that can support and improve the process of remastering, regardless of whether that is done by methods close to the original production workflow, or whether the process is executed using AI or machine learning. Similarly, the organization of primordial material need not follow EDLs if there is a ‘comparison engine' that can sort out the primordial data for re-mastering, but it will be helpful to have correct and comprehensive EDLs in the archive. Even if the sets of instructions are not re-used, they may contain important technical data, and since the instructions are comparatively small amounts of data, they should be retained in the archive. It is important to understand the process of remastering not as emulation of the original post-production workflow that created the final master, but as a process performed on the primordial data set in order to match the final visual values of the reference in the original and subsequent technical contexts.

Over time, new technical requirements for masters will have to be accommodated. Each time a new version is added by the remastering team, a new instruction set is added to the archive. One of the most challenging aspects of this process is the QC phase, where the new master is recreated from the new instructions. Once this ability to recreate a version is confirmed, it is not necessary to retain a physical iteration of the data, and the on-going work of deduplication is sustained. This process suggests the importance of having stable libraries of data at the root of the process. The genetic archive is open with respect to software. Better debayering, grain reduction, or up-resing algorithms which may become available can be imposed on the primordial data to meet processing requirements for new display environments. Because there is no point to emulating the original process, improvements can be made in that arc from code values to screen.

The genetic archive is a method for addressing some of the current problems of archiving data for moving images. It is not a solution to every archive situation, nor does it engage with future forms of production which might produce a different type of data. But it does apply to the last twenty years of mainstream motion pictures, as well as the data that will be captured from legacy film archives.