💾 Archived View for gemi.dev › gemini-mailing-list › 000354.gmi captured on 2023-11-04 at 12:43:37. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Gemini Archiving and WARC

📧 Messages: 12
🗣️ Authors: 9
📅 First Message: 2020-09-01 23:43
📅 Last Message: 2020-09-04 19:45

Charles E. Lehner <cel (a) celehner.com>

📅 Sent: 2020-09-01 23:43
📧 Message 1 of 12

Hi Gemini List,

Has anyone thought about, or implemented, archiving of Gemini content/traffic?

WARC (Web ARChive)? is a standard format used for web archiving. It uses 
text headers for metadata like in HTTP and email. It looks to me like WARC 
could be adapted for Gemini. The WARC spec supports multiple URI schemes, 
although it doesn't specify any other than http/https, ftp, and dns?. 
Bespoke formats could also be used, of course, or just downloading files 
wget-style, but using a standard format could allow for interop with "the 
WARC ecosystem"?.

Archive Team? has also worked on archiving non-HTTP protocols like FTP? and Gopher?.

I think there is an opportunity for people to maintain high-quality 
archives of Gemini content, like what the Internet Archive? and 
archive.today? do for the HTTP(S) Web. Now is a good time to start, while 
many of the original Gemini hosts? are still online.

Regards,
Charles E. Lehner

? https://en.wikipedia.org/wiki/Web_ARChive
? https://iipc.github.io/warc-specifications/specifications/warc-format/war
c-1.1/#ftp-scheme
? https://www.archiveteam.org/index.php?title=The_WARC_Ecosystem
? https://www.archiveteam.org/
  https://en.wikipedia.org/wiki/Archive_Team
? https://www.archiveteam.org/index.php?title=FTP
? https://www.archiveteam.org/index.php?title=Gopher
? https://en.wikipedia.org/wiki/Internet_Archive
  https://archive.org/
? https://archive.today
  https://en.wikipedia.org/wiki/Archive.today
? gemini://gemini.circumlunar.space/servers/

Link to individual message.

acdw <acdw (a) acdw.net>

📅 Sent: 2020-09-02 01:23
📧 Message 2 of 12

On 2020-09-01 (Tuesday) at 23:43, Charles E. Lehner <cel at celehner.com> wrote:

> Hi Gemini List,
> 
> Has anyone thought about, or implemented, archiving of Gemini content/traffic?
> 
> WARC (Web ARChive)? is a standard format used for web archiving. It 
> uses text headers for metadata like in HTTP and email. It looks to me 
> like WARC could be adapted for Gemini. The WARC spec supports multiple 
> URI schemes, although it doesn't specify any other than http/https, 
> ftp, and dns?. Bespoke formats could also be used, of course, or just 
> downloading files wget-style, but using a standard format could allow 
> for interop with "the WARC ecosystem"?.
> 
> Archive Team? has also worked on archiving non-HTTP protocols like FTP? 
> and Gopher?.
> 
> I think there is an opportunity for people to maintain high-quality 
> archives of Gemini content, like what the Internet Archive? and 
> archive.today? do for the HTTP(S) Web. Now is a good time to start, 
> while many of the original Gemini hosts? are still online.
> 
> Regards,
> Charles E. Lehner
> 
> ? https://en.wikipedia.org/wiki/Web_ARChive
> ? 
> https://iipc.github.io/warc-specifications/specifications/warc-format/war
c-1.1/#ftp-scheme
> ? https://www.archiveteam.org/index.php?title=The_WARC_Ecosystem
> ? https://www.archiveteam.org/
>   https://en.wikipedia.org/wiki/Archive_Team
> ? https://www.archiveteam.org/index.php?title=FTP
> ? https://www.archiveteam.org/index.php?title=Gopher
> ? https://en.wikipedia.org/wiki/Internet_Archive
>   https://archive.org/
> ? https://archive.today
>   https://en.wikipedia.org/wiki/Archive.today
> ? gemini://gemini.circumlunar.space/servers/
>

I personally think this is a great idea, but I know some might not be so 
on-board with it. I'm thinking of solderpunk's post (in their gopherhole, 
actually): gopher://zaibatsu.circumlunar.space:70/0/~solderpunk/phlog/the-i
ndividual-archivist-and-ghosts-of-gophers-past.txt

So is there a way to opt-out of archiving for publishers? Some in the 
community might want to know about it, though I personally am of the 
opinion that if you've published it, it's now the property of the commons.

-- 
~ acdw
acdw.net | breadpunk.club/~breadw

Link to individual message.

alex wennerberg <alex (a) alexwennerberg.com>

📅 Sent: 2020-09-02 01:40
📧 Message 3 of 12

Quoting acdw (2020-09-01 18:23:22)
> On 2020-09-01 (Tuesday) at 23:43, Charles E. Lehner <cel at celehner.com> wrote:
> So is there a way to opt-out of archiving for publishers? Some in the 
community might want to know about it, though I personally am of the 
opinion that if you've published it, it's now the property of the commons.

Perhaps via robots.txt? I block ia_archiver from pages that I don't want
archived on http(s), for example.

Alex

Link to individual message.

Tom <tgrom.automail (a) nuegia.net>

📅 Sent: 2020-09-02 05:08
📧 Message 4 of 12

On Wed, 02 Sep 2020 01:23:22 +0000
acdw <acdw at acdw.net> wrote:

> On 2020-09-01 (Tuesday) at 23:43, Charles E. Lehner
> <cel at celehner.com> wrote:
> 
> > Hi Gemini List,
> > 
> > Has anyone thought about, or implemented, archiving of Gemini
> > content/traffic?
> > 
> > WARC (Web ARChive)? is a standard format used for web archiving. It 
> > uses text headers for metadata like in HTTP and email. It looks to
> > me like WARC could be adapted for Gemini. The WARC spec supports
> > multiple URI schemes, although it doesn't specify any other than
> > http/https, ftp, and dns?. Bespoke formats could also be used, of
> > course, or just downloading files wget-style, but using a standard
> > format could allow for interop with "the WARC ecosystem"?.
> > 
> > Archive Team? has also worked on archiving non-HTTP protocols like
> > FTP? and Gopher?.
> > 
> > I think there is an opportunity for people to maintain high-quality 
> > archives of Gemini content, like what the Internet Archive? and 
> > archive.today? do for the HTTP(S) Web. Now is a good time to start, 
> > while many of the original Gemini hosts? are still online.
> > 
> > Regards,
> > Charles E. Lehner
> > 
> > ? https://en.wikipedia.org/wiki/Web_ARChive
> > ? 
> > https://iipc.github.io/warc-specifications/specifications/warc-format/w
arc-1.1/#ftp-scheme
> > ? https://www.archiveteam.org/index.php?title=The_WARC_Ecosystem
> > ? https://www.archiveteam.org/
> >   https://en.wikipedia.org/wiki/Archive_Team
> > ? https://www.archiveteam.org/index.php?title=FTP
> > ? https://www.archiveteam.org/index.php?title=Gopher
> > ? https://en.wikipedia.org/wiki/Internet_Archive
> >   https://archive.org/
> > ? https://archive.today
> >   https://en.wikipedia.org/wiki/Archive.today
> > ? gemini://gemini.circumlunar.space/servers/
> >  
> 
> I personally think this is a great idea, but I know some might not be
> so on-board with it. I'm thinking of solderpunk's post (in their
> gopherhole, actually):
> gopher://zaibatsu.circumlunar.space:70/0/~solderpunk/phlog/the-individual
-archivist-and-ghosts-of-gophers-past.txt
> 
> So is there a way to opt-out of archiving for publishers? Some in the
> community might want to know about it, though I personally am of the
> opinion that if you've published it, it's now the property of the
> commons.
> 

Ounce you publish something to the internet there is no retracting it.
This is one of the first things I was taught the first time I used the
net. Alongside never using your real name on the net unless your
publishing something.

-- 
 _______________________________________ 
/ Concentrate on th'cute, li'l CARTOON  \
| GUYS! Remember the SERIAL NUMBERS!!   |
| Follow the WHIPPLE AVE. EXIT!! Have a |
| FREE PEPSI!! Turn LEFT at th'HOLIDAY  |
| INN!! JOIN the CREDIT WORLD!! MAKE me |
\ an OFFER!!!                           /
 --------------------------------------- 
\
 \
   /\   /\   
  //\\_//\\     ____
  \_     _/    /   /
   / * * \    /^^^]
   \_\O/_/    [   ]
    /   \_    [   /
    \     \_  /  /
     [ [ /  \/ _/
    _[ [ \  /_/

Link to individual message.

Brian Evans <bme (a) mailfence.com>

📅 Sent: 2020-09-02 22:28
📧 Message 5 of 12

I can appreciate the instinct to archive, but I fall into the camp that 
would generally prefer that it not be done (while respecting that with
the way the technology is built, there is not a reasonable way to
prevent it).

I think a great tragedy of the internet is the inability to be forgotten
and to retract and change and not have your past mistakes dictate
your present. I dont have a technical solution for that in gemini, or
for that matter in gopher... but think that community norms and
expectations should develop around it organically (which is of
taking place currently in this discussion and will continue to do
so over time). I definitely support the commons for articles,
information, and "knowledge"... but hesitate to extend that to what
are sometimes the only personal outlets that some people may have.

I think if something like `robots.txt` were to be used for this
purpose I would recommend doing it at the directory level (and thus
break from how robots.txt works). In gemini many (most?) users are a
part of multiuser systems. If `robots.txt` at the root were used it 
would generally control the whole domain and not allow for
individual users to opt in or out. To that, I would also put in a vote for
an opt-in system rather than an opt-out system (like robots.txt). Opt-in
empowers all users to make choices whereas opt-out is often limited
to those that know to do so and have the technical know how to do
so.

There are also environmental and energy arguments against full
protocol archiving, though those costs may be small while gemini is
at or around its current size.

Anyway, just a few thoughts.

Link to individual message.

Sotiris Papatheodorou <sotirisp (a) protonmail.com>

📅 Sent: 2020-09-03 09:17
📧 Message 6 of 12

On Wednesday, September 2, 2020, Brian Evans wrote:
> I think if something like `robots.txt` were to be used for this
> purpose I would recommend doing it at the directory level (and thus
> break from how robots.txt works).

On Wednesday, September 2, 2020, Brian Evans wrote:
> To that, I would also put in a vote for
> an opt-in system rather than an opt-out system (like robots.txt).

Agreed on both points! I was thinking of implementing a personal archiving 
system like the one mentioned by Solderpunk in
gopher://zaibatsu.circumlunar.space:70/0/~solderpunk/phlog/the-individual-a
rchivist-and-ghosts-of-gophers-past.txt

Link to individual message.

Caranatar <caranatar (a) riseup.net>

📅 Sent: 2020-09-04 03:54
📧 Message 7 of 12

This seems like an incredibly cynical and myopic take. It's also
expected that everything on the internet will track you, will be
constantly expanded for the purpose of commercialization instead of user
experience, etc.... Yet Gemini purposefully rejects those notions in
favor of something better. The idea that the same shouldn't apply here
is odd.

-caranatar


Tom writes:

> On Wed, 02 Sep 2020 01:23:22 +0000
> acdw <acdw at acdw.net> wrote:
>
>> On 2020-09-01 (Tuesday) at 23:43, Charles E. Lehner
>> <cel at celehner.com> wrote:
>> 
>> > Hi Gemini List,
>> > 
>> > Has anyone thought about, or implemented, archiving of Gemini
>> > content/traffic?
>> > 
>> > WARC (Web ARChive)? is a standard format used for web archiving. It 
>> > uses text headers for metadata like in HTTP and email. It looks to
>> > me like WARC could be adapted for Gemini. The WARC spec supports
>> > multiple URI schemes, although it doesn't specify any other than
>> > http/https, ftp, and dns?. Bespoke formats could also be used, of
>> > course, or just downloading files wget-style, but using a standard
>> > format could allow for interop with "the WARC ecosystem"?.
>> > 
>> > Archive Team? has also worked on archiving non-HTTP protocols like
>> > FTP? and Gopher?.
>> > 
>> > I think there is an opportunity for people to maintain high-quality 
>> > archives of Gemini content, like what the Internet Archive? and 
>> > archive.today? do for the HTTP(S) Web. Now is a good time to start, 
>> > while many of the original Gemini hosts? are still online.
>> > 
>> > Regards,
>> > Charles E. Lehner
>> > 
>> > ? https://en.wikipedia.org/wiki/Web_ARChive
>> > ? 
>> > https://iipc.github.io/warc-specifications/specifications/warc-format/
warc-1.1/#ftp-scheme
>> > ? https://www.archiveteam.org/index.php?title=The_WARC_Ecosystem
>> > ? https://www.archiveteam.org/
>> >   https://en.wikipedia.org/wiki/Archive_Team
>> > ? https://www.archiveteam.org/index.php?title=FTP
>> > ? https://www.archiveteam.org/index.php?title=Gopher
>> > ? https://en.wikipedia.org/wiki/Internet_Archive
>> >   https://archive.org/
>> > ? https://archive.today
>> >   https://en.wikipedia.org/wiki/Archive.today
>> > ? gemini://gemini.circumlunar.space/servers/
>> >  
>> 
>> I personally think this is a great idea, but I know some might not be
>> so on-board with it. I'm thinking of solderpunk's post (in their
>> gopherhole, actually):
>> gopher://zaibatsu.circumlunar.space:70/0/~solderpunk/phlog/the-individua
l-archivist-and-ghosts-of-gophers-past.txt
>> 
>> So is there a way to opt-out of archiving for publishers? Some in the
>> community might want to know about it, though I personally am of the
>> opinion that if you've published it, it's now the property of the
>> commons.
>> 
>
> Ounce you publish something to the internet there is no retracting it.
> This is one of the first things I was taught the first time I used the
> net. Alongside never using your real name on the net unless your
> publishing something.


-- 
sent from emacs using mu4e

Link to individual message.

Sean Conner <sean (a) conman.org>

📅 Sent: 2020-09-04 05:22
📧 Message 8 of 12

It was thus said that the Great Caranatar once stated:
> Tom writes:
> 
> > Ounce you publish something to the internet there is no retracting it.
> > This is one of the first things I was taught the first time I used the
> > net. Alongside never using your real name on the net unless your
> > publishing something.
>
> This seems like an incredibly cynical and myopic take. 

  I also think it's an incredibly realistic take.

> It's also
> expected that everything on the internet will track you, will be
> constantly expanded for the purpose of commercialization instead of user
> experience, etc.... Yet Gemini purposefully rejects those notions in
> favor of something better. The idea that the same shouldn't apply here
> is odd.

  Even though Gemini (and gopher to an extent) reject those ideas, it
doesn't mean privacy or control over the content.  I wrote about this last
year:

	http://boston.conman.org/2019/10/29.2
	gopher://gopher.conman.org/0Phlog:2019/10/29.2
	gemini://gemini.conman.org/boston/2019/10/29.2

(take your pick of format)

  I even quote the same solderpunk article (and another one not by
solderpunk) about how they're ... well ... "wrong" is the wrong word here,
but it's close ... perhaps "misguided" is what I'm thinking of.  Information
that is publically available (and by any measure, most of Gemini is public)
can, and will, travel in mysterious ways, which I discuss in my post above.

  I can find stuff I posted to USENET in 1993 *today*.  I can still find my
first website from 1997.

  -spc (I think I seriously just dated myself ... )

Link to individual message.

Dr. Otto Skrzyk <drskrzyk (a) tilde.team>

📅 Sent: 2020-09-04 05:23
📧 Message 9 of 12

On Thu, Sep 03, 2020 at 11:54:08PM -0400, Caranatar wrote:
> This seems like an incredibly cynical and myopic take. It's also
> expected that everything on the internet will track you, will be
> constantly expanded for the purpose of commercialization instead of user
> experience, etc.... Yet Gemini purposefully rejects those notions in
> favor of something better. The idea that the same shouldn't apply here
> is odd.
> 
> -caranatar


Calling it myopic is a bit harsh and probably misses a point that you
put forward as a support - one of the selling points of gemini is that
it rejects complexity and some of the concerns of a more commercialized
internet. One of those concerns is the potential for misuse of the
information or infrastructure beyond the intent of the content creator
or host. That or the right to retract that information. 

You'll have to forgive me seeing some irony that someone with a
riseup.net email address would speak against someone putting forth advice
about taking caution in what you post on the internet. Riseup exists
largely in part because others share this "cynical and myopic take."

Regardless, the issues being brought up here seem to circle around
content control and archival ethics and less about the protocol. 
> 
> 
> Tom writes:
> 
> > On Wed, 02 Sep 2020 01:23:22 +0000
> > acdw <acdw at acdw.net> wrote:
> >
> >> On 2020-09-01 (Tuesday) at 23:43, Charles E. Lehner
> >> <cel at celehner.com> wrote:
> >> 
> >> > Hi Gemini List,
> >> > 
> >> > Has anyone thought about, or implemented, archiving of Gemini
> >> > content/traffic?
> >> > 
> >> > WARC (Web ARChive)? is a standard format used for web archiving. It 
> >> > uses text headers for metadata like in HTTP and email. It looks to
> >> > me like WARC could be adapted for Gemini. The WARC spec supports
> >> > multiple URI schemes, although it doesn't specify any other than
> >> > http/https, ftp, and dns?. Bespoke formats could also be used, of
> >> > course, or just downloading files wget-style, but using a standard
> >> > format could allow for interop with "the WARC ecosystem"?.
> >> > 
> >> > Archive Team? has also worked on archiving non-HTTP protocols like
> >> > FTP? and Gopher?.
> >> > 
> >> > I think there is an opportunity for people to maintain high-quality 
> >> > archives of Gemini content, like what the Internet Archive? and 
> >> > archive.today? do for the HTTP(S) Web. Now is a good time to start, 
> >> > while many of the original Gemini hosts? are still online.
> >> > 
> >> > Regards,
> >> > Charles E. Lehner
> >> > 
> >> > ? https://en.wikipedia.org/wiki/Web_ARChive
> >> > ? 
> >> > https://iipc.github.io/warc-specifications/specifications/warc-forma
t/warc-1.1/#ftp-scheme
> >> > ? https://www.archiveteam.org/index.php?title=The_WARC_Ecosystem
> >> > ? https://www.archiveteam.org/
> >> >   https://en.wikipedia.org/wiki/Archive_Team
> >> > ? https://www.archiveteam.org/index.php?title=FTP
> >> > ? https://www.archiveteam.org/index.php?title=Gopher
> >> > ? https://en.wikipedia.org/wiki/Internet_Archive
> >> >   https://archive.org/
> >> > ? https://archive.today
> >> >   https://en.wikipedia.org/wiki/Archive.today
> >> > ? gemini://gemini.circumlunar.space/servers/
> >> >  
> >> 
> >> I personally think this is a great idea, but I know some might not be
> >> so on-board with it. I'm thinking of solderpunk's post (in their
> >> gopherhole, actually):
> >> gopher://zaibatsu.circumlunar.space:70/0/~solderpunk/phlog/the-individ
ual-archivist-and-ghosts-of-gophers-past.txt
> >> 
> >> So is there a way to opt-out of archiving for publishers? Some in the
> >> community might want to know about it, though I personally am of the
> >> opinion that if you've published it, it's now the property of the
> >> commons.
> >> 
> >
> > Ounce you publish something to the internet there is no retracting it.
> > This is one of the first things I was taught the first time I used the
> > net. Alongside never using your real name on the net unless your
> > publishing something.
> 
> 
> -- 
> sent from emacs using mu4e

-- 
      Dr . Otto Skrzyk
  gemini : gemini://tilde.team/~drskrzyk
     web : https://drskrzyk.tilde.team/
mastodon : @docskrzyk at hackers.town

Link to individual message.

acdw <acdw (a) acdw.net>

📅 Sent: 2020-09-04 16:43
📧 Message 10 of 12

I agree 100% with Sean's post (http://boston.conman.org/2019/10/29.2) -- 
the act of posting something to a gemini *is* publishing, so it's out 
there -- toothpaste-tube-style. That being said, I think any archiver or 
spider should also respect *robots.txt* files -- though them being opt-in 
vs. opt-out is kind of moot, since spiders gonna spider, you know? It's 
the very nature of the Internet to communicate.

However, I thought Dr. Otto brought up a vv good point as well:

> Regardless, the issues being brought up here seem to circle around
> content control and archival ethics and less about the protocol. 

Inasmuch as gemini is a technical specification/machine protocol, I think 
there's nothing to say about it vis-a-vis archiving. Socially, though, we 
have norms -- which are good to nail down in a nascent community.

-- 
~ acdw
acdw.net | breadpunk.club/~breadw

Link to individual message.

Brian Evans <bme (a) mailfence.com>

📅 Sent: 2020-09-04 18:43
📧 Message 11 of 12

acdw writes:
> I think any archiver or spider should also respect *robots.txt* files -- 
though them being opt-in vs. opt-out is kind of moot, since spiders gonna 
spider, you know?

I think opt-in vs opt-out is definitely not moot. The web largely operates 
on an opt out basis (where there is an option at all). We are at a point 
where we can develop different norms for a different system, and I think we should.

I definitely agree that there is nothing that can be down about spiders 
that do not follow recommended community guidelines and that when you post 
something that is not behind a client cert requirement or the like that it is public.

However, I do think that using robots.txt for spiders of all sorts is a 
bad idea for gemini and will create less user choice in the long run. 
robots.txt is suggested often because it exists and is there... but it is 
not designed for multi-user systems (the predominant form of system on 
gemini at present) and is explicitly designed to opt you out... meaning 
that if users dont even know that spiders are a thing (as many 
non-technical people do not) then they do not get to have a choice. My 
suggestion as simply about community norms and trying to push, at least 
for spiders that are willing to respect a community standard, an opt in 
that works at the directory level and can be managed by users rather than 
by system administrators. The idea being that if someone does not have a 
document, lets call it `green-light.txt`, saying yes to various sorts of 
spidering that a well behaved spider should ignore content in that directory.

Having said all of that: I agree this is not a protocol issue and the 
conversation and is more about philosophical/ethical preferences and could 
be moved over to gemini posts rather than here on the mailing list. So I 
will likely not post more on it here... but maybe I'll write something up 
on my gemlog tonight.

Link to individual message.

Tom <tgrom.automail (a) nuegia.net>

📅 Sent: 2020-09-04 19:45
📧 Message 12 of 12

On Fri, 04 Sep 2020 16:43:42 +0000
acdw <acdw at acdw.net> wrote:

I want to clarify something. What I said does not purely revolve around
inevitable misuse of the data. It is also based on freedom of the user.
The only way you could prevent the user from doing something on his own
machine ounce the data has been copied over the net is with some kind
of rookit and spyware under the umbrella term Digital Restrictions
Management. I Hope we can all agree that DRM is heinous and a key point
in the downfall of the web. https://www.defectivebydesign.org/

The most you could do is ask someone not to unlist their
archive for some TTL period. I feel this would be a good compromise
between archivist and authors. Archivists are going to archive because
without an immutable content addressable storage back-end like IPFS or
the LoC ARC Resolver everything is fickle and could disappear at any
moment, lost to entropy.

-- 
 ________________________________________ 
/ telepression, n.:                      \
|                                        |
| The deep-seated guilt which stems from |
| knowing that you did not try           |
|                                        |
| hard enough to look up the number on   |
| your own and instead put the           |
|                                        |
| burden on the directory assistant.     |
|                                        |
\ -- "Sniglets", Rich Hall & Friends     /
 ---------------------------------------- 
\
 \
   /\   /\   
  //\\_//\\     ____
  \_     _/    /   /
   / * * \    /^^^]
   \_\O/_/    [   ]
    /   \_    [   /
    \     \_  /  /
     [ [ /  \/ _/
    _[ [ \  /_/

Link to individual message.

---

Previous Thread: [ANN] The Duckling Proxy

Next Thread: NQ2 or NQ3 or NGQ3

Gemini Archiving and WARC

Charles E. Lehner <cel (a) celehner.com>

acdw <acdw (a) acdw.net>

alex wennerberg <alex (a) alexwennerberg.com>

Tom <tgrom.automail (a) nuegia.net>

Brian Evans <b__m__e (a) mailfence.com>

Sotiris Papatheodorou <sotirisp (a) protonmail.com>

Caranatar <caranatar (a) riseup.net>

Sean Conner <sean (a) conman.org>

Dr. Otto Skrzyk <drskrzyk (a) tilde.team>

acdw <acdw (a) acdw.net>

Brian Evans <b__m__e (a) mailfence.com>

Tom <tgrom.automail (a) nuegia.net>

Brian Evans <bme (a) mailfence.com>

Brian Evans <bme (a) mailfence.com>