2021-11-30 The difference between archiving and record keeping

I’m sitting on the sofa, drinking cold tea. I watched some Star Trek, prepared some sourdough, talked to my sister, wrestled with encrypted emails in Emacs, and now I’m finally ready to write something.

I dislike bots: programs that scour the net in the search of information to feed their databases. All the search engines on the web work this way. They vacuum it all up and ask questions later.

Did authors consent to their work being used in this way? Maybe they want visitors interested in the text they wrote to find their pages, sure – but did they also consent to making Google the most privacy invading data kraken of the world, the make their engineers rich? We exist in a tug of war between publishers of newspapers wanting to be paid by search engines for the tiniest snippets, and Google benefiting immensely from the texts written and media produced by countless individuals. It seems to me that there is currently no way to answer this question. We, as a society, have not found a way to deal with the situations, that is to say, our legal processes have not led to satisfactory answers. In the mean time, the search engines continue to hoover it all up, keep copies of our works, decide who gets to see our work, whose work is silently forgotten, train their algorithms on our works, fill their pockets on our works. We have no say.

I’m not just talking Google. Look at the web server logs of your sites, search for all the user agents containing the word “bot”. What do you find? Looking at the last 10,000 hits on my site, I see the following bots at the top of the list:

---------------Bandwidth-------Hits-------Actions--Delay
    Everybody       184M      10000
     All Bots        25M       2162   100%    38%
--------------------------------------------------------
    YandexBot      5242K        399    18%    74%    14s
      MJ12bot      2853K        395    18%    98%    13s
      bingbot      6951K        387    17%     7%    15s
    Googlebot      4335K        281    12%    29%    20s
     Applebot      1548K        132     6%     0%    45s
     Neevabot      1232K        123     5%     0%     7s
   SemrushBot       364K        101     4%     1%    58s
DataForSeoBot      1167K        100     4%     4%    59s

OK, so I understand Yandex, Bing, Google, Apple. But what about the rest?

a UK based specialist search engine used by hundreds of thousands of businesses in 13 languages and over 60 countries to paint a map of the Internet independent of the consumer based search engines
the only ad-free, private search engine. Created by ex-Google execs …
Do SEO, content marketing, competitor research, PPC and social media marketing from just one platform
level the playing field by providing quality data to SEO enthusiasts and professionals around the world

And look at numbers. Over 20% of those 10,000 hits are bots. What a waste. And I’m not even counting the bots that are misbehaved and that I have banned. And this waste is real, no matter the technology. Wherever you go, a search engine is already wasting your bandwidth – for ads, for analysis, for curiosity. They’re just scooping it all up.

The smaller my site, my hosting server, the more vulnerable I am against this resource drain. If my dynamic site doesn’t cache rendered pages, I not only waste 20% or more of my bandwidth, I also start to waste 20% or more of my computing cycles. For other people that don’t have my best interest in mind.

Sadly, the mindset is omnipresent. People are proud of their email archives going back decades, and they don’t realise how they are treating the people that write them in a similar way. It’s not the same as keeping all the bills you’re paying for documentation, because those are sent by corporations, not people. It’s also not the same as keeping the love letters you’ve received, as keepsakes. You’re not going to reread all those emails. All you’re doing is keeping records. And now you can set the record straight when people change their mind. Those records give you power over them.

Of course you laugh at the idea. You’re not an abusive spouse. You’re not a controlling parent. You don’t work for the secret service, the police, or search engines. The power you exert is small. But it’s yet another pinprick. Imagine if somebody kept recorded all the conversations, going back decades. Would this be a good friend to have? Imagine if somebody kept video recording every encounter, going back decades. Would this be a good family member to have? Of course not. It’s creepy.

And yet, we have built our tools such that being creepy is the simplest option. We are trained to archive all our mails instead of deleting them. Who knows, perhaps you’re going to need it to one-up your friends one day. Google is certainly going to use them to build that profile of yours.

This is happening because writing programs is hard. They are full of bugs. It’s easier to keep them simple. Here’s is that email. Do you want to save it, or do you want to delete it? The memories of our conversations don’t work that way. I remember the awkward words of that early morning in a coffee house where I confessed my love to my wife. I think it took my an hour until I said it. I remember the awkwardness, two or three sentences, and the rest is a blur. Memory fades. Some things stay: words, emotions, pictures, when important enough, they stay. But forgetting is hard. What to forget? How to determine what is going to be boring? What programmer would want to make such decisions. This is why our programs don’t forget stuff. It’s too difficult.

And so we keep tabs on everybody, because disk space is cheap and nobody knows we’re keeping records on them.

And it’s not just email, of course. Do you keep your chat logs? Backup your conversations in those chat messengers? Sure, maybe they are end to end encrypted. But you also never ever delete them unless you’re weird. And worse, if you are using technology where – in the name of security – each message is based on the previous one, you can’t delete the previous messages! Most programmers experience this when working with git. If they “change the history” by rebasing their code on a different commit, all their commit hashes change, and those who kept the “original” are now disconnected. You can disown this other history, but in a way, you can never really delete it unless everybody who has a copy agrees to delete it as well.

Now, you can say “isn’t this how copying of stuff happens online?” Well, it does – because our software is written that way. And if it is written to forget, then it’s very simple: you can delete all your old status messages on Mastodon, but it’s hard to implement the kind of forgetting that our brains perform. Just because it works that way doesn’t mean that it’s good. We built computers to be this way. We chose to keep them that way. Every one of us enjoys the power and control of having a copy. Except of course somebody is always more powerful than we are, and has more data about us than we have about them. Who watches the watchers? Certainly not us!

When we talked about these issues in the wiki days, what stayed with me was the power of “Forgive and Forget”. And on the other end of the spectrum, there are the “Record Keepers”. As you can tell, I don’t like record keepers. I was recently reminded of all of this after exchange a few emails with Greta Goetz whose talk I had seen at the Emacs Conference 2021. She mentioned Henry Zhu’s interviews on Maintainers Anonymous with Eric ’omnigamer’ Koziel where they talked about the difficulty of knowledge discoverability and she said that “wikis generally do seek to archive knowledge” and pointed out how “Forgive and Forget” might go both ways. Good point!

Omnigamer: Rather than making an account, jumping on a forum and posting their thoughts, a lot of people will find it easier to make a Discord and that’s going to be the place to go for all of the discussion … but it really does take a hit to discoverability. Only people basically in the know would know to get to that Discord, or [to] even have lasting records of things that were discussed there is not as straightforward, but such is what it is. … Henry: So it’s interesting to see how we’re all moving back and forth, what’s convenient, what’s helping new people, and also this idea of archival in general, just making sure the things that we talk about are saved so that people in the future can see the context. – MA 1: Omnigamer on Speedrunning as Research

MA 1: Omnigamer on Speedrunning as Research

This is exactly it. We live somewhere on the intersection between privacy and “data is toxic” on the one hand, and record keeping on the other hand. Sometimes I am challenged by people: what about archives? What about future historians? Our cultural legacy? Remember all those movies rotting away because copyright scares everybody until nobody dares restore them? Copyright is destroying our cultural legacy! Record keeping is not archiving.

Archiving is not about “keeping everything.” It’s about selecting the things to keep for the future. What kind of uses do we envision for the information that is kept? What are future historians going to care about? Then keep just that. At least that’s how the state-run archives work around here. You “present” files on topics you think are worth keeping, and they “choose” the subset they are interested in, and that’s that.

I treat my micro-blogging as ephemeral and my blog as more stable. And yet, I have also deleted stuff from there that I feel is wrong. As I said on that Record Keeping page: “I never read through years of tweeting history! This only benefits your enemies, never your friends. I want to expire my toots. We can always write a blog post about the good stuff.”

2017-04-27 Record Keeping

Record Keeper (Meatball Wiki)

Forgive and Forget (Community Wiki)

Forgive and Forget (Meatball Wiki)

Greta Goetz’s Digital Garden

​#Bots ​#Philosophy ​#Cryptography ​#Forgetting

Comments

(Please contact me if you want to remove your comment.)

I very much agree on this.

– hyperreal 2021-12-01 03:32 UTC

---

Archives cut both ways—for all the ways they can be used to abuse someone, they can also be used to absolve someone. I started keeping copies of all my outgoing email back in college when a friend of mine started doing that to protect himself from an academically abusive professor. And yes, I’ve gone through old emails from time to time and find absolute gems (like the time another friend and I were playing word games with our emails).

It’s also hard to figure out what future histories will consider “important”. Today, archeologists find 2,000 year old garbage “valuable” since often times, that’s all we have left of previous civilizations. I’m sure you don’t consider your mundane life worth preserving, but ask someone from 1,500 years from now.

Yes, someone can dig up an old USENET post I made in the mid 1990s. So what? Yes, I wrote that. If what I said is bad, then fine, I was stupid then, I’ve changed since. If you can’t accept that, then that’s on you, not me. Yes, I know, easier said than done, but I’ve long since adapted that mentality.

I learned the hard way that information moves in mysterious ways and that if you don’t want the “wrong” people to get it, don’t publish it (and by “publish” I mean “write it down on the Internet”—*nothing* is private on the Internet, and if you believe otherwise, you don’t understand how it works).

– Sean Conner 2021-12-01 20:25 UTC

Sean Conner

---

True, but at the same time, the way journalists comb through social media of the politicians they want to write about suggests that we as a society have not learned how to deal with it. If you can say that you don’t care, I suspect it’s because you are safe from such attacks.

Additionally, I don’t think that things that are possible ought to necessarily be legal or endured. The law places many limits on us, to protect us from each other. The fact that enforcement is weak in our online lives is simply a failure of the executive, not a misunderstanding of our reality. Our societies are constructed.

– Alex 2021-12-01 20:47 UTC

---

It seems like you come down to “it’s illegal unless otherwise noted,” whereas I come from the “it’s legal, unless otherwise noted.” I wonder if that comes down to the legal systems we grew up in—remember, the US started off distrusting a royal monarchy and threw off regressive governance, and we’ve never really trusted government since then.

– Sean Conner 2021-12-01 22:41 UTC

Sean Conner

---

Assorted thoughts follow. I think the idea of software selectively deleting stuff to attempt to simulate memory is very user-unfriendly. I absolutely do not think that simplicity is the reason this hasn’t been implemented. A data store needs to have a predictable retention behavior because otherwise people will falsely rely on it. As such, deleting things with clear criteria is already a feature of some apps; notably Snapchat, Messenger, and Signal all have temporary chat features that delete messages after configurable time periods.

As Sean says above, it’s not possible to predict what will be important in the future. Whether that’s what’s important to you, as your life or your person/values may change in myriad ways, or what’s important to society. Archives work to extend lifespan of finite things so must be selective out of practicality. There’s no practical reason we must be artificially scarce in our selection of our own record keeping.

Creepiness is as much social convention as it is acts/details. We know things are saved. Many people adjust themselves accordingly, with little inconvenience. Instead of thinking saving it is creepy, we consider going digging creepy.

That said, I would welcome more control over expiration of data. I’m not sure it’s within the domain of the law, though.

– Tom 2021-12-02 01:00 UTC

Tom

---

The Right to be Forgotten, and the General Data Protection Regulation show, as far as I am concerned, that it can certainly be within the domain of the law to control how data is kept, and for how long it is kept. But I believe that irrespective of the law, there is a moral imperative to not accumulate the data.

I think that keeping data is creepy because I don’t think that keeping data and never using it is an argument except for the NSA who argues that they can automated keep records on everybody and that this is OK as long as nobody is looking at it. The power asymmetry remains because the potential is there. Nobody cares about your smoking of marijuana except if your president of the United States. That is the best example of what I mean.

The argument about the uncertainty of what future historians will have a use for is strange. Yes, we care about pottery shards because we have nothing else. But that is not what I’m talking about. We have endless amounts of data. Having it brings with it social cost, keeping it brings with it environmental cost, defending against its abuse brings with it cost. Perhaps, if people where interring data stores of Usenet archives and copies of the Internet Archive, I’d agree. But obsessively keeping everything does not help future historians. It inundates them in the trivial. And yes, these days archaeologists investigate the trivial traces of our ancestors because not much else is left of them. Where it is, we do care about it: buildings, books, manuscripts, papyri, frescoes, all forms of art – the carefully selected artifacts. And yes, sometimes also the wax tablet of a Roman student learning to write, but mostly we’d love to find more books.

I do think forgetting is both hard and necessary. Whenever people can’t find an email, the answer is not only that search needs to be improved but also that spam and all the other unnecessary mails need to be deleted. If you are getting automated notifications and deleting it, you’re starting down the path of choosing what to keep. When people at the office announce their birthday and cake at the cafeteria, the birth of their kids, and you delete those mails, one more step. Good morning messages, confirmations, into the bin they go. Reminders, confirmations? Delete. The organisation of a party years ago? Forget it. And so on. Wouldn’t it be great if there were software that would forget for you? the problem is that it doesn’t know what is important to you, because your emotions and your memories are not tied to it. And so you don’t trust it, whereas the working memory defines who you remember to be, so there is no alternative.

– Alex 2021-12-02 06:17 UTC

---

Going back over the replies again (interesting points!) I’d like to comment on my own record keeping. I used to keep all my mail. When I deleted my mail from Gmail, I kept a copy for myself. These days I’m trying to be selective about the incoming mail I keep. I keep receipts of things purchased and bills to pay, of course. I also keep mail I receive in a separate folder for a while when a topic is hot, but then I delete that folder when it’s done. I keep a copy of my sent mail. In my sent mail, I try to write as if I were writing a letter: reminding the recipient of something we talked about, and then commenting on it, instead of citing whole paragraphs or appending whole emails. This top-posting kind of email writing is what I use at the office, but at home I try to model it along letter writing. And I do keep a copy for myself. Who knows, perhaps some of the ideas might make it into a blog post one day. Then again, I hardly ever read those emails. At least I don’t think they give me power over anybody by remembering the things they said. I only keep what I said.

As for social media: I delete my Mastodon statuses after a while, but I keep an archive of the messages I wrote, and the posts that I liked. The point would be that I don’t make a copy of my whole feed, just of the things I might want to come back to (and I don’t trust the Mastodon instances to stay up for many years).

I think being selective about it is the key, at least for me.

– Alex 2021-12-02 07:47 UTC