💾 Archived View for gemini.complete.org › many-to-one-with-filespooler captured on 2024-12-17 at 10:00:46. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2024-07-09)

-=-=-=-=-=-=-

Many-To-One with Filespooler

Since Filespooler[1] is an ordered queue processor by default, it normally insists on a tight mapping between the sequence numbers in job files and execution order in a queue.

1: /filespooler/

This poses challenges when you want to send data from multiple sources to a single destination.

There are two main strategies for doing this:

1.

On the receiving side, maintain a separate queue for each origin. This is quite easy to do, since Filespooler queues are very lightweight. It can easily scale to thousands of origins, and preserves ordering of incoming data from each origin. It is usually the right answer.

2.

Dump all the incoming data into a single queue, and use a different strategy to sort this out.

As mentioned above, option 1 is suitable for most people. For instance, when receiving backups on a backup server, you probably need the incrementals from each machine to be applied in order, and therefore option 1 is appropriate. That setup should be fairly straightforward.

So the rest of this page will discuss option 2.

The need for non-sequential processing with Option 2

Because option 2 is combining data from lots of machines into a single queue, and each machine has its own seqfile (counter file) which we can't assume is kept in sync with all other machines, we have the reality that the sequence number becomes fairly useless. We may have duplicate sequence numbers, we will certainly have sequence numbers arrive out of order, and so forth.

Since we can't use sequential numbers, we have to give up the guarantee that all packets will be executed in the order they were created, even if they are delivered out of order. One reason for this is that `fspl queue-process` may run when packets 1, 2, and 5 have been delivered, but not 3 and 4. Since the sequence numbers are no longer useful, in the best case, we can assume `fspl queue-process` will (attempt to) execute packets 1, 2, and 5. It'll have to get 3 and 4 on subsequent runs once they're delivered.

This would be appropriate for workloads that don't care about ordering (for instance, calculating sha256sums of input), or for workloads that do care about ordering but can detect out-of-order delivery and return error codes themselves (for instance, many uses of `zfs receive` or git[2] with gitsync-nncp over Filespooler[3]). It is not suitable for workloads that care about ordering but are unable to handle out-of-order delivery (for instance, incremental backups based on `tar`).

2: /git/

3: /gitsync-nncp-over-filespooler/

Non-sequential processing in Filespooler

There are two ways to accomplish this in Filespooler:

1.

Manually, with shell scripts or something similar, which call `fspl stdin-process` in whatever manner they desire.

2.

Automatically, using `fspl queue-process --order-by=Timestamp`.

Because option 2 is easiest and is suitable for most purposes, this page will focus on it.

The `--order-by=Timestamp` option changes the behavior of `queue-process` in several ways:

1.

Instead of strictly using incrementing sequence numbers to find job files to process, it will sort the list of all job files according to the creation date embedded in each and use that. For many workloads, this is a useful replacement for guaranteed strict ordering.

2.

The next job id (sequence number stored in the seqfile) associated with the queue is neither read nor incremented.

3.

Since sequence numbers are not used at all, this option is capable of correctly processing queues that have multiple job files with the same sequence number. (In the default case, Filespooler will detect this condition and abort with an error.)

Example

Let's set up a single receiving queue and two separate sending queues. For this example, we'll put everything on one machine, but we could as well make them separate.

$ fspl queue-init -q queue

Let's send some packets to it.

$ echo date | fspl prepare -s sender1 -i - | fspl queue-write -q queue
$ echo date | fspl prepare -s sender1 -i - | fspl queue-write -q queue
$ echo date | fspl prepare -s sender1 -i - | fspl queue-write -q queue
$ echo date | fspl prepare -s sender2 -i - | fspl queue-write -q queue
$ echo date | fspl prepare -s sender2 -i - | fspl queue-write -q queue

OK, that will have sent 5 packets to the queue: three from sender 1 and two from sender 2. The packets we created will have sequence numbers 1, 2, 3, 1, 2 -- duplicates! Let's see what we have:

$ fspl queue-ls -q queue
ID                   creation timestamp          filename
1                    2022-05-10T05:26:37-05:00   fspl-ebe7f35a-0613-4680-ae34-057bc74bd555.fspl
1                    2022-05-10T05:26:13-05:00   fspl-9b00f6b5-b1ae-4ed8-9656-062750400abe.fspl
2                    2022-05-10T05:26:29-05:00   fspl-41a95ea8-f7cb-4ae5-ac9c-94510673d0da.fspl
2                    2022-05-10T05:26:39-05:00   fspl-3ae378c1-64c6-4711-945a-91b3afd2bc26.fspl
3                    2022-05-10T05:26:29-05:00   fspl-6b756d7a-7c00-4ade-af2e-94f173572a63.fspl

Indeed, look at that. Since `queue-ls` sorts by ID, we can see the duplicate IDs. `queue-process` will, by default, refuse to work with this:

$ fspl queue-process -q queue hd
Error: Attempted to process "fspl-3ae378c1-64c6-4711-945a-91b3afd2bc26.fspl" with seq 2, which was already seen in "fspl-41a95ea8-f7cb-4ae5-ac9c-94510673d0da.fspl"

If you look at this with more detail, you'll see it didn't actually process anything; it aborted before it could even start processing things:

$ fspl --log-level debug queue-process -q queue hd
DEBUG prepare_seqfile_lock{path="queue/nextseq"}: filespooler::seqfile: Attempting to prepare lock at "queue/nextseq.lock"
DEBUG open{path="queue/nextseq"}: filespooler::seqfile: Attempting to acquire write lock
DEBUG open{path="queue/nextseq"}: filespooler::seqfile: Attempting to open file "queue/nextseq"
DEBUG scanqueue_map{queuedir="queue" decoder=None}: filespooler::jobqueue: Reading header from "queue/jobs/fspl-ebe7f35a-0613-4680-ae34-057bc74bd555.fspl"
DEBUG scanqueue_map{queuedir="queue" decoder=None}: filespooler::jobqueue: Reading header from "queue/jobs/fspl-41a95ea8-f7cb-4ae5-ac9c-94510673d0da.fspl"
DEBUG scanqueue_map{queuedir="queue" decoder=None}: filespooler::jobqueue: Reading header from "queue/jobs/fspl-3ae378c1-64c6-4711-945a-91b3afd2bc26.fspl"
Error: Attempted to process "fspl-3ae378c1-64c6-4711-945a-91b3afd2bc26.fspl" with seq 2, which was already seen in "fspl-41a95ea8-f7cb-4ae5-ac9c-94510673d0da.fspl"

So, clearly, we need timestamp-based processing! Let's do it.

$ fspl --log-level debug queue-process -q queue --order-by=Timestamp hd
DEBUG prepare_seqfile_lock{path="queue/nextseq"}: filespooler::seqfile: Attempting to prepare lock at "queue/nextseq.lock"
DEBUG open{path="queue/nextseq"}: filespooler::seqfile: Attempting to acquire write lock
DEBUG open{path="queue/nextseq"}: filespooler::seqfile: Attempting to open file "queue/nextseq"
DEBUG filespooler::jobqueue: Reading header from "queue/jobs/fspl-ebe7f35a-0613-4680-ae34-057bc74bd555.fspl"
DEBUG filespooler::jobqueue: Reading header from "queue/jobs/fspl-41a95ea8-f7cb-4ae5-ac9c-94510673d0da.fspl"
DEBUG filespooler::jobqueue: Reading header from "queue/jobs/fspl-3ae378c1-64c6-4711-945a-91b3afd2bc26.fspl"
DEBUG filespooler::jobqueue: Reading header from "queue/jobs/fspl-6b756d7a-7c00-4ade-af2e-94f173572a63.fspl"
DEBUG filespooler::jobqueue: Reading header from "queue/jobs/fspl-9b00f6b5-b1ae-4ed8-9656-062750400abe.fspl"
DEBUG filespooler::cmd::cmd_exec: Proparing to execute job 1 from Some("fspl-9b00f6b5-b1ae-4ed8-9656-062750400abe.fspl")

There's a lot of output here, but if you run this experiment yourself, you'll see it did indeed sort by timestamp: jobs 1, 2, 3, 1, 2, in order of creation. Nice!

Putting it to use

Now that we've established this, you can have all your machines that are sending data to the queue just dump things in it, without worrying about ID collisions and such. Filespooler will sort it out and do the right thing!

One thing to note is that commands that take an explicit job ID (eg, `fspl queue-info -j`) will not work when there are collisions among sequence numbers in the queue. But that's fine; you can use `fspl queue-ls` to get filenames and pipe the data to `fspl stdin-info`.

--------------------------------------------------------------------------------

Links to this note

4: /gitsync-nncp-over-filespooler/

You can use gitsync-nncp[5] (a tool for Asynchronous[6] syncing of git[7] repositories) atop Filespooler[8]. This page shows how. Please consult the links in this paragraph for background on gitsync-nncp and Filespooler.

5: /gitsync-nncp/

6: /asynchronous-communication/

7: /git/

8: /filespooler/

9: /one-to-many-with-filespooler/

In some cases, you may want to use Filespooler[10] to send the data from one machine to many others. An example of this could be using gitsync-nncp over Filespooler[11] where you would like to propagate the changes to many computers.

10: /filespooler/

11: /gitsync-nncp-over-filespooler/

12: /feeding-filespooler-queues-from-other-queues/

Sometimes with Filespooler[13], you may wish for your queue processing to effectively re-queue your jobs into other queues. Examples may be:

13: /filespooler/

14: /parallel-processing-of-filespooler-queues/

Filespooler[15] is designed around careful sequential processing of jobs. It doesn't have native support for parallel processing; those tasks may be best left to the queue managers that specialize in them. However, there are some strategies you can consider to achieve something of this effect even in Filespooler.

15: /filespooler/

16: /using-filespooler-over-syncthing/

Filespooler[17] is a way to execute commands in strict order on a remote machine, and its communication method is by files. This is a perfect mix for Syncthing[18] (and others, but this page is about Filespooler and Syncthing).

17: /filespooler/

18: /syncthing/

19: /filespooler/

Filespooler lets you request the remote execution of programs, including stdin and environment. It can use tools such as S3, Dropbox, Syncthing[20], NNCP[21], ssh, UUCP[22], USB drives, CDs, etc. as transport; basically, a filesystem is the network for Filespooler.
Filespooler is particularly suited to distributed and Asynchronous Communication[23].

20: /syncthing/

21: /nncp/

22: /uucp/

23: /asynchronous-communication/

More on www.complete.org

Homepage

Interesting Topics

How This Site is Built

About John Goerzen

Web version of this site

(c) 2022-2024 John Goerzen