1 upvotes, 2 direct replies (showing 2)
View submission: Ask Anything Wednesday - Engineering, Mathematics, Computer Science
I have heard it asserted that a human's complete genetic sequence requires 1 to 4gb of disk, depending on the encoding and compression mechanisms. If I wanted to preserve my genetic sequence for a future civilization to discover more than a millennium from now, what existing (non-theoretical) storage medium would best survive a duration of thousands of years under ideal conditions?
Could our modern standard NTFS/EXT4 disk formatting structure and our UTF encoding be reverse engineered without apriori knowledge of our language and alphabetic system?
Comment by oviforconnsmythe at 05/03/2025 at 23:56 UTC
6 upvotes, 0 direct replies
The most robust and best long term 'storage medium' for genomic data is DNA. The oldest DNA sample we have extracted and sequenced is 1-2 million years old. Yes, it suffers from environmental degradation but if stored properly (eg with the intent to preserve it for 1000y) it is remarkably stable. As technology advances over the next millennium, its far more likely that genomic data will be reliably decoded from a universal standardized 'language' like DNA compared to modern day digital encoding/compression. That's assuming the scientist from 1000y from now even has the hardware to connect todays storage devices. My first PC (early 2000s) had an IDE interface based HDD- trying finding an IDE cable/adapter nowadays, just 25y later. It can be done but its rare. Also, note that digital storage mediums would need to be protected from geomagnetic abnormalities (like the Carrington event) to avoid destruction.
That said - to answer your question - an optical disk or photographic film would probably be ideal as strange as that sounds. Both technologies rely on optical "engraving" of data (eg from my understanding, with a CD/DVD, a laser engraves binary code into a reflective layer in the disk that is later decoded based on the pattern of reflection). With film, bombardment with photons alters the chemistry of silver halide crystals present in the film, such that an image is imprinted. After developing the film, passing light through the film will reveal the imprint. I'm not sure that the storage capacity would be for film, but I imagine the limitations are based on the scale at which data can be imprinted and later read. Look up the Arctic World Archive - they used a film based medium to store data in the permafrost layers of the arctic.
Comment by Cadoc7 at 05/03/2025 at 23:58 UTC
4 upvotes, 1 direct replies
What existing (non-theoretical) storage medium would best survive a duration of thousands of years under ideal conditions
Stone tablets.
There is no digital storage hardware that would survive a millennium much less multiple. Tape is the longest lasting standard one we have and you generally want to replace that every 20-30 years. There are some specialized formats used by archivists that might get you a bit further, but nowhere close to a millennium.
Preserving digital data that long would require a RAID-like system for mutual error correction. That would in-turn require nearly constant electricity (you can have outages, but you wouldn't want it off for say an entire year), a renewing supply of hardware to replace failed modules, and technicians to do the replacements. And you'd really want it in multiple sites to protect against disasters (man-made or natural).
Could our modern standard NTFS/EXT4 disk formatting structure and our UTF encoding be reverse engineered without apriori knowledge of our language and alphabetic system?
This question assumes that they can read the bytes in the first place. Just building compatible hardware would be a monumental achievement for some kind of alien (or even far future) archeologist. It is hard to overstate how many abstraction layers there are in computing, even for stuff as relatively low-level as a file system implementation. Just reading from a disk is a complex interplay between the OS, the CPU, the motherboard, RAM, and even the controller in the hard drive, with each layer of hardware having it's own (usually multiple!) protocol(s) to talk to the other pieces of hardware. It would be a major, maybe unsolvable, challenge just to get the point where you can start reverse engineering the contents if you didn't have a starting spot already. It's not something you can stumble through - the modern ecosystem is a teetering pile that was haphazardly tossed together across decades of mutual bootstrapping, and it's a miracle any of it actually works. Reverse engineering from first principles might be the work of centuries and never succeed.
Ignoring that part, UTF on it's own, kinda. You would treat like any other unknown language. In the same way that ancient languages can have meanings guessed on context clues, you could guess that a given byte sequence could mean something specific. But it would be very, very difficult and nowhere near exhaustive. Most ancient language studies benefit from extra context - the Rosetta stone, paintings, carvings, oral traditions, etc. that would allow the connection between say a hieroglyph and a picture of bread. Those contexts generally aren't available when you interact with digital - it's all just bits. And UTF makes that harder by containing multiple alphabets, non-printable characters, variable length characters, modifier characters, non-language characters, and so on. ASCII would be easier because of the much smaller character set and regular format, but even then it would be rough.
Simple file systems might be possible, but the more complex, the harder it would be. Again, being able to read our hardware would be a massive challenge itself, and it is more relevant here because file system formats cannot be divorced from hardware. Most modern file systems will treat a SSD and HDD differently - HDDs prefer contiguous physical data locations (de-fragging is the process of moving files around to maximize file contiguity) while SSDs don't care and can freely shard a file across a billion cells. That said, formats are much more regular and precise with a specific purpose, so someone who knew what it was intended for might have some success. The hard job would being able to distinguish the metadata of the file system from the data of the files that are stored. There would be a lot of obstacles though, and I would not expect total success.