Profile picture
Krista Jamieson @ArchiveThoughts
, 52 tweets, 9 min read Read on Twitter
The Basics of Digital Archiving: A Thread. There will be many analogies to physical archives. It will not mention technologies/software at all. The tech details will be simplified (for your reading pleasure and based on my own knowledge).
First: what content am I talking about? Born digital and digitized material (not going there on the "what is a record?" debate), under the stewardship of an archive, museum, special collection, or library. I am NOT talking about the "archive" button of your email or your laptop.
What is archiving? Preserving and providing access. That doesn't change just because something is digital. It looks different, but the goals are the same. Preservation means storage, security, integrity, conservation, etc.
Access means available, discoverable (descriptions, but also search systems), and respectful of restrictions and legal obligations, from donation agreements to privacy laws to freedom of information (if you're a public sector archive). I like to call this "Appropriate Access."
The most commonly cited workflow for digital archiving is the Open Archival Information System reference model, also called the OAIS model. BUT this model is ONLY about processing digital records. It doesn't talk about how to ingest them or how to provide appropriate access.
So while I don't want to talk about how digital stuff is donated to an archive, I have to because integrity and blah blah blah. I am only going to talk about the technical side and not the admin side.
Digitized and born digital materials come in many different file formats and on many carriers or points of transfer. You can get floppy disks, CDs, USB keys, external hard drives, entire computers, email attachments, transfers of materials from the cloud, etc.
In ALL of these cases, you need to do two things: 1) make sure that you received what they intended to send you (no extra or hidden files they did not intend to transfer but that all the files they did intend to transfer were included). Confirm the contents with the donor.
2) make sure that you received what they sent you (as in, make sure none of the bits and bytes that make up the digital files got lost along the way. This is the digital version of making sure a box wasn't lost in the mail).
The combination of these two things ensures the INTEGRITY of the donation. Make sure your donor signs off on EVERYTHING. Their job isn't done until you know you have what you were supposed to get and that its all there. This is part of the paper work of digital donations.
Format Identification: now, this can actually come before you finish finalizing w your donor or after, depends on your institution. Formats matter a lot for preservation. You need to know what you have if you want to take care of it.
Formats are the equivalent of knowing the difference between nitrate film (needs to be kept cold) (ok there is more to it than that, but for simplicity, that is a basic requirement) and glass plate slides (which can't freeze or they become brittle and can crack). It matters!
Some place have what are called Format Registries. Here is Simon Fraser University Archives' format registry: sfu.ca/content/dam/sf…. Format registries lay out what are preferred and acceptable formats for different types of documents. This informs preservation.
So what is digital preservation? Roughly speaking it can be broken into 2 parts: bitstream and renderability. Digital stuff (ALL digital stuff) is written in 1s and 0s, called bits. The bitstream is all of the bits that go into making a file.
Renderability (no idea if this is what anyone else calls it, but its what I call it) is being able to open a file and have it look right. Digital files are complicated. They need operating systems and multiple layers of software to line up for them to be opened.
Remember that digital files (even those in the cloud) are stored on something physical, too. "Cloud" storage is just a server farm. It still exists somewhere concrete. Servers can crash or get flooded or get broken in half. Vandalism and accidents and weather happen.
(cloud storage server farms are often backed up multiple times in different locations so that even if the power goes out of the building burns to the ground all your stuff isn't lost.)
Remember how at the beginning I said it was important to make sure all the bits and bytes got transferred to your institution? That's the bitstream. One of the ways to check it is to create a checksum (also called a hash) and compare it before and after transfer.
Bits and bytes can be missing for a few different reasons, usually during a transfer, but that can also happen randomly: this is called bit rot. Sometimes its a hardware issue, sometimes its a transcription error and sometimes its an ???? error.
Dropping bits can have no effect on the renderability of a file or it can corrupt the entire file. It can be hard to tell what is going to happen and it depends on which specific bits rot away.
So the first part of preservation is this: keeping your bitstream intact. Keep it duplicated multiple times. Check on all of them to make sure that what you put onto the storage medium (on the bit level) is still there. Make sure they all match. Replace those that have corrupted.
Even though that part seems hard, it is actually the easier part of digital preservation. The harder part is keeping your files renderable. Operating systems and softwares change. They are updated constantly. a ".doc" file from 1995 and one from 2018 are not the same.
Some formats are locked to certain proprietary softwares (like autocad for architecture and interior design work). Some are open or have lots of softwares that can read them (PDF for example). Some have been stable for decades (WAVE) and some haven't (video formats, for example)
There are 3 main approaches to solving this, which are often used in conjunction: normalization, migration, and emulation. Emulation is creating a virtual environment to open your file, like recreating a 1995 operating system to open a WordPerfect file.
Normalization is converting files to a handful of standardized formats. So, turning all word processing files into .doc, all audio files into .wav, all image files into .tiff, etc. This gives you fewer formats to be dealing with.
Migration is updating the version of the format to the latest version. So turning a .doc file from 1995 into a .doc file from 2018.
Emulation, normalization, and migration are usually used in conjunction. You don't want to migrate 100 different formats forward constantly, so you normalize them to a dozen formats and then migrate that dozen formats forward.
Or you normalize the formats from 100 to 12 and then migrate them to a certain version and then create an emulator for that version so you don't have to continue migrating them forward, etc. Normalization may be one way you use your file format registry.
But remember: if your bitstream isn't preserved, the file won't render no matter what you do. And with enough technical skills (and time/money, etc) someone could technically create an emulator for almost any file.
Every time you normalize or migrate, you will need to preserve the new bitstream the same way you did the original! Normalizing or migrating WILL change the bitstream! (its embarrassing how long that took me to realise. Let's not talk about it.)
Currently most of us are keeping the original bitstream AND the normalized and/or migrated file. We're paranoid. Its like keeping a deteriorated and demagnetized VHS tape after its been transferred to a new tape and digitized. JUST IN CASE!
Important note: you will have to keep doing these things... forever? Yup, FOREVER. There is no one and done with digital. It needs constant care. There is no finding it 300 years later and hoping for the best. All we can do is get it to the next intervention.
But let's pretend you got the whole bitstream and renderability thing sorted (LOLZ) we also need to store it somewhere. I already said you need multiple copies in case something happens to your storage carrier.
You also need cyber security to prevent hacking and digital vandalism, viruses, etc etc etc. You need to assume a certain percent of your harddrives or digital tape or whatever WILL fail because there is simply an error rate in manufacturing.
(and of course, as soon as you rely on one copy will be the exact moment you find that statistically probably failure). Don't hope it won't happen. Plan contingencies so that WHEN it happens it doesn't screw you over. Because it will happen.
There are lots of types of storage and I will let the IT people duke it out over the best kinds, but all I will say is that this is not the place to cheap out. If its in an archive in the first place, it needs good quality storage. Period. I know we're expensive, but that's life.
Alright, let's consider preservation handled for the time being. How access? What access?! WHO ACCESS?!?!?! All good questions. Depends on the content. Also useful when talking w the donor to get them to flag privacy concerns.
Digital stuff can be made available to users in ways other than putting it online. Please remember this. Donors may not want their life google-able. Digital is also super useful for accessibility. You can't screen reader or caption or translate paper.
One option is to have reading room stations for digital materials. Which sounds antiquated, I know, but is it really different than having people come in to look at a ledger book? Not actually. It isn't any more or less accessible than reading room access to hard copies.
There likely will be digital materials that can ONLY be accessible onsite. Most countries (not all!) copyright legislation allows archives to provide access onsite but not online. We can digitize almost anything but not nec put it online (check locally to confirm!)
Copyright (in Canada) is a balance of creator rights and user rights. Putting it online is publishing it, but there is little if any commercial value to a lot of archival stuff (and we don't charge for researcher use). It puts us in a weird place for copyright.
In Canada, copyright is also not enforced by the government: the copyright holder has to sue you for infringement. They sue whoever published their stuff, so if an archive puts it online, it would be the archive getting sued. If a researcher does it, the researcher gets sued.
So what matters here is who is the copyright holder and is the material still in copyright? The copyright holder is (usually) the creator. But the work you do for your institution belongs to your employer.
I like to remember that staff photographers for newspapers are not the copyright holder for photos they take (the newspaper, as their employer, is), but freelancers don't give up their copyright when they sell their photo to a newspaper. (unless they signed a contract saying so)
When copyrighted material goes public domain is... complicated, currently changing (thanks new NAFTA! & Disney!), and in Canada, depends on media type. Photos and audio have different rules than text. Its great. /s
Talk to a lawyer. All I am going to say is that (in Canada, anyway) managing copyright is all about managing risk and figuring out how much risk your institution is willing to take. Aka how likely are you to be sued if you put something online & are you ok with that risk?
Some donations are also restricted until a certain date, but that doesn't mean no one can access those files until then. Donors can access stuff restricted by them. You need to be able to provide that access even if no one else can see it.
All in all, archivists need to be a bit of a jack of all trades to figure this shit out, digital is COMPLICATED and required a lot more long term, consistent, and concerted effort than hard copy, and (maybe) putting things online makes access 100x more complex and risky than ever
48 tweets. I beat my last archival education thread by 1 tweet, Whoot whoot! Also: ...sorry. Hope it was helpful at least? Feel free to ask questions!
Also: never once in this thread did I talk about metadata! Oops! So yeah, there are like 6 kinds of metadata, all of which matter for digital archiving! Descriptive: about the content. Technical: about the files themselves. Structural: about how the files go together.
Administrative: info about the digital materials and the donation, includes restrictions, etc. Preservation: logging every single intervention a file has ever had. Accessibility: info about what accessibility features are built into the file.
Lots to say about all of them! Not going to say anything about any of them right now! Most common format currently is METS for digital preservation.
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Krista Jamieson
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member and get exclusive features!

Premium member ($30.00/year)

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!