💾 Archived View for dioskouroi.xyz › thread › 25010624 captured on 2020-11-07 at 00:56:55. Gemini links have been rewritten to link to archived content

View Raw

More Information

-=-=-=-=-=-=-

Mayastor: Lightning Fast Storage for Kubernetes

Author: botayhard

Score: 41

Comments: 13

Date: 2020-11-06 20:02:09

Web Link

________________________________________________________________________________

blowfish721 wrote at 2020-11-07 05:48:13:

”This site uses cookies and other tracking technologies to assist with navigation, analyze your use of our products and services, assist with promotional and marketing efforts, allow you to give feedback, and provide content from third parties. If you do not want to accept cookies, adjust your browser settings to deny cookies or exit this site. Cookie Policy”

Great cookie policy there with no easy opt-out.

mgerdts wrote at 2020-11-06 22:30:31:

It's written in rust, in the off chance that is important to anyone on this site.

hhh wrote at 2020-11-07 00:33:55:

If that was in the title it'd be a great bingo card title.

DJBunnies wrote at 2020-11-06 22:47:12:

You just never know.

hardwaresofton wrote at 2020-11-07 04:04:09:

Just want to say for anyone running a clusters of machinese (with kubernetes for example) -- OpenEBS provides one of the most approachable and easily to use solutions for storage out there. It scales from hobby grade to enterprise grade in my opinion, and although I can't point you to a 1000+ node cluster I run or anything (maybe OpenEBS can), they offer enterprise editions and support which is usually enough to qualify as the minimum viable definition of "enterprise grade".

OpenEBS has many different drivers[0] (which is a good sign IMO) I only use one of them but am planning a post (that may never actually get written) to expand on all of them if/when I can:

- Jiva[1] (based on Longhorn[1], iSCSI-based, bring your own disk driver)

- cStor[2] (baked by OpenEBS, zfs-based used to be the golden child before MayaStor)

- MayaStor (written in rust, iSCSI/NVMe-oF based, bring your own lower layer)

- zfs-localpv[3]

Funnily enough, I've been _really_ interested in zfs and btrfs lately so have been learning a lot about the field, but cStor didn't work when I first ran it.

I know "written in rust" has become somewhat of a buzzword, but don't forget that it has actual meaning -- it means that they will be able to get the raw performance possible without a language runtime or garbage collection to hold them back. This is critical for "systems" programming, and is the domain rust was built for. When I see rust, I basically think -- this could be as fast as average C (and safer) if you get even just pretty good developers.

Also a note on this -- one of the things that's nice about OpenEBS's offerings is that writes are _synchronous_. This is obviously a problem[4] for performance of course, but it's a massive relief for building resiliency into systems. I'm not a classical sysadmin, but it puts me at ease that I can give up some throughput to know that if one of my servers goes completely kaput, the other can pick up _at the last write_ on the other node, with zero guesswork on my end. The switchover is of course a little hairy but this is what OpenEBS smoothes out for you.

For this not to be completely an ad for OpenEBS, I do want to note that the other prominent option that scales from hobbyist to enterprise (and arguably more so) is Rook[5] which manages Rook clusters. Rook the organization seems to be branching out to position themselves as Cassandra storage and a bunch of other stuff, but the key thing they brought to the table was a mostly hands-off operator for operating Ceph[6] which is certified enterprise grade(tm) _FREE_ software. Ceph is basically the linux of the storage appliance world.

Ceph (via Rook) is another really good choice for getting storage up and running on a kubernetes (or whatever else) because it offers similar benefits to OpenEBS's solutions but does it slightly different. Instead of container-attached-storage, it solves the features come at the filesystem layer-- you get a tunable "ceph" installation (you should _definitely_ choose BlueStore btw) which gives you:

- replication (like the iSCSI approaches of OpenEBS)

- checksumming (like ZFS and zfs-backed OpenEBS)[7]

- striping (like ZFS/btrfs and zfs-backed OpenEBS depending on setup)

- compression (like ZFS)[7]

Anyway, I don't run any servers that use NVMe but nice to see this article, hope the people at OpenEBS are still doing well, love their entry into the space.

[EDIT] - I totally forgot and should mention, you can obviously _always_ bring your own lower layer, ex creating a xfs-formatted ZVOL on top of ZFS on the host and then exposing _that_ drive to whatever. Excellent example is a recent talk from Japan which I've mentioned before[8][9] which uses this to run Ceph on ZFS.

[0]:

https://docs.openebs.io/docs/next/casengines.html

[1]:

https://docs.openebs.io/docs/next/jiva.html

[2]:

https://docs.openebs.io/docs/next/alphafeatures.html#mayasto...

[3]:

https://github.com/openebs/zfs-localpv/

[4]:

https://github.com/longhorn/longhorn/issues/1242#issuecommen...

[5]:

https://rook.io/docs/rook/

[6]:

https://docs.ceph.com/en/latest/architecture/

[7]:

https://ceph.io/community/new-luminous-bluestore/

[8]:

https://speakerdeck.com/takutakahashi/ceph-on-zfs

[9]:

https://news.ycombinator.com/item?id=24784596

mgerdts wrote at 2020-11-07 04:34:25:

"For the input you can choose which type of protocol you want through the storage classes. At the output, let's say, you can do iSCSI, NVMe, and local. But the iSCSI work was there mostly for us to get us going because by default we simply use NVMe Over Fabrics or local..."

https://youtu.be/_5MfGMf8PG4?t=2866

EDIT: Thanks for sharing your experience. I'm looking at Mayastor, but remembered this quote from a video I watched a while back suggesting that while iSCSI works, it is probably not the way to the most from this software.

dilyevsky wrote at 2020-11-06 21:41:31:

Trying to filter all the marketing speak since there is almost zero info on the actual technology of Mayastor engine. Is this thing basically NVMf array orchestrator for k8s? So if your NVMf target crashes your PV goes bye bye?

mgerdts wrote at 2020-11-06 22:37:08:

According to

https://blog.mayadata.io/openebs-mayastor-0.3.0-lands-soon-k...

> But the nexus is also capable of performing transformations on the I/O passing through it. For example, for reasons of availability and durability, we might wish to maintain more than one copy of the data contained by a PV. The nexus supports this by dispatching multiple copies of any writes which are received for the volume, to replicas hosted on other Mayastor Storage Nodes within the cluster (the actual replica count is defined by the Volume’s Storage Class). Only when all replicas have acknowledged their writes will the nexus signal completion of the transaction back to the consumer. That is to say, policy-based workload protection in Mayastor is based on synchronous replication.

Presumably life can go on so long as a replica survives.

dilyevsky wrote at 2020-11-06 23:49:38:

Ah i see so it’s kind of like network raid1... I’m guessing this still implies that you can’t write while backup target is offline?

mgerdts wrote at 2020-11-06 23:58:54:

Later in the same blog post:

> MOAC’s response to the available replica count of a Volume falling below the desired count defined within its Storage Class is to instruct a Mayastor Storage Node to create a new, “empty” replica, and to share it with the degraded nexus. The introduction of a new replica to the nexus will cause it to start a rebuild process, bringing it into synchronization with the other replicas. This process can successfully complete without having to suspend workload I/O to the affected PV.

gravypod wrote at 2020-11-06 22:19:04:

Your PV would likely still exist but the `mount` command on the pod's cgroup will fail and you'll get into a crashloop with a NotReady (I think).

dilyevsky wrote at 2020-11-06 23:50:42:

Yes by “bye bye” i meant your workload writes or reads are going to fail/stall and all sorts of bad things happen

spicyponey wrote at 2020-11-07 04:30:31:

From the above: "This process can successfully complete without having to suspend workload I/O to the affected PV"