đŸ Archived View for dioskouroi.xyz âș thread âș 29403320 captured on 2021-12-05 at 23:47:19. Gemini links have been rewritten to link to archived content
âŹ ïž Previous capture (2021-12-04)
-=-=-=-=-=-=-
________________________________________________________________________________
The website doesn't seem to mention that several of the papers on the filesystem won best-paper awards at major conferences. The paper, Optimizing Every Operation in a Write-Optimized File System, in particular, won best-paper award at FAST '16.
Also, if you're interested in learning more about B^\epsilon trees, here's a talk given by Rob Johnson a few years ago at Microsoft Research:
BetrFS: A Right-Optimized Write-Optimized File System
https://www.youtube.com/watch?v=fBt5NuNsoII
In general, I think it's really cool that there is a file system that exists today (i.e., BetrFS) that uses data structures which _didn't_ exist 25 years ago. It's a great example of theoreticians and systems researchers working together.
> here's a talk given by Rob Johnson a few years ago at Microsoft Research: BetrFS: A Right-Optimized Write-Optimized File System
https://www.youtube.com/watch?v=fBt5NuNsoII
Interesting talk, I wonder how much of the perf. advantage diminishes in a finished, production-ready implementation though.
Comparing an 80%-complete R&D prototype mule against crash-resilient posix-compliant production filesystems is basically never a fair perf. comparison.
You might find that just implementing rename and hard-links properly alone is going to kill your perf. since you dispensed with on-disk inode equivalents.
Nice to see people poking at these issues nonetheless, Linux needs better filesystem options.
On the other hand I can easily imagine lot of applications that don't care about full posix compliance, and are perfectly happy to trade handling of some obscure feature for improved performance.
Renames and hard-links are not obscure features.
And there are myriad mount options for tailoring performance vs. crash-resilience/posix-compliance to the application in most the existing production filesystems. Which was honestly another aspect of the talk that was somewhat lacking; what journaling modes were used? barrier/nobarrier? was it even made equivalent to what betrfs achieves? We don't even know if a betrfs instance can successfully mount after a mid-write hard reboot.
Maybe a viable option for single-application container images. They would on the one side offer the ability to have tight control around used functions (to allow for missing features) but also be able to exactly target an FS and be optimized for it.
Containers don't control the FS that they're written to. A container image is in the tar format and at runtime the underlying FS is defined by the host, which is why containers only run on hosts with union filesystems
The reason I posted it was I saw it in the FAST21 playlist:
https://www.youtube.com/watch?v=6KueHK9i8lE
As one of the authors of this project, first off, I appreciate the interest.
For those who are curious, our initial goal is indeed to build a PoC and understand whether the data structures actually deliver the potential performance gains in a realistic implementation that one might expect on paper. I see a long arc from a new idea to a production-quality implementation, and several iterations of increasingly thorough evaluation and hardening.
Our current prototype is not production-ready; this is a long-term goal, but we appreciate how much work this is. More of our focus at the moment is on exploring other ways these algorithmic techniques may be useful in a storage system or how to address current problems---i.e., understanding the best way to design such a system before trying to build a production-quality version. Each of our papers has yielded significant overhauls to the design.
We would also consider it a success if other file systems adopted any ideas from our papers, or a new file system were designed by someone else that adopted these techniques.
The commenters are right that there is a gap between when an idea is exciting new research and fundable via grants versus funding the "maturing" phase of the prototype. I will hasten to say that the NSF has been supportive of maturing this system, for which we are most grateful. Nonetheless, like many projects, we could use more resources, and I would be happy to engage constructive conversations out-of-band about how to address this gap.
Since you're answering questions here, what is the impact of the patents on the fractal tree for other implimentations? Are other projects legally allowed to implement their own fractal/B^epsilon trees?
Looking forward to the results of your work, good luck and thank you!
(This comment was originally a reply to
https://news.ycombinator.com/item?id=29404038
but I've detached it so that more people will see itâthe other thread has been moderated since it's a tedious flamewar.)
Way back in my final year undergrad project, I put together a LSM-based filesystem that was write-optimised. I even thought that I had invented the notion of an LSM tree, but the original LSM tree paper pre-dated my invention by 3 years. I applied to take the idea and run with it for a PhD afterwards, but no joy.
The criticism of LSM on the FAQ that it can't have as good read performance is perhaps a little over-egged. A fair proportion of the work I did on my project was on how to optimise the read performance. The biggest problem with a LSM tree was working out which level of the hierarchy your entry was in, which involves looking in one, then the next, until you find it. When your data is larger than your RAM, then this becomes a disc access for each level. I was working on structures that could be very small and answer that, so they were more likely to fit in RAM.
The other difference between an LSM and a B-epsilon tree is that with an LSM, the merging is done in bulk as a single operation, whereas with a B-epsilon tree it is done on a node-by-node basis as the node buffers fill up. Therefore an LSM could potentially perform more of its housekeeping in long sequential disc operations than a B-epsilon tree, which is likely to have a more random-access pattern.
Random access patterns donât matter so much for SSDs which is where LSMs make the most amount of sense if I recall correctly.
This is not completely true.
LSM does not just convert random write into sequential write.
It also reduce the number of write IO when the dataset is larger than memory.
With B-tree inserting a random entry require reading all b-tree page from disk until the right page is found then writing the updated page (4KB) back to disk.
While in LSM if you are trying to add a new entry that is 300byte you only need to append 300 bytes to the top level file on disk.
Right. So it's more a problem of data dependency than it is with the access being random/sequential. Eg. if LSM needed to do 1 large sequential read & then jump around in RAM the same way, it would still largely have the same basic problem, no?
With LSM (leveldb, rocksdb) its just a different tradeoff.
LSM Write are faster than b-tree (half write amplification)
while B-tree read are faster than LSM.
Reason is because with a warm cache B-tree requires at most one disk read per query since internal nodes will be in cache but for LSM you need one Read per Level.
This advantage disappear if you don't have enough memory to cache all internal nodes.
Another advantage with LSM is it only does half the space amplification as a B-tree.
This is because all level except the top level contain long run of sorted data that compress very well. In a b-tree page are small and most page are not full and you need to store metadata in each page.
If you want this to succeed, I suggest you change the name.
This is like starting a band named "The Beetles". Word about your band will never spread, because people who hear the name in passing will will subconsciously read or hear this as "The Beatles". When you talk to strangers about whether they've heard of "The Beetles", even if you put in the extra effort of "no I mean Beetles with two 'E's", in most cases they will nod their head and say yes, because the association in their head to "Beatles" has already been made and they will simply assume you or they have the spelling wrong.
The effect is that they'll not mentally register that there's something new here, so they won't put time into learning about it. Even if they tried, search results for "The Beetles" will auto-correct to "The Beatles".
Looks very promising. Anybody tested it yet? The benchmarks look phenomenal
https://www.betrfs.org/faq.html
!
Even if far from production ready for important data, I can see its immediate uses for certain kinds of software, where the disk is used as a large scratch pad for example. Lot's of random writes are common in photogrammetry in large datasets, where I imagine BetrFS can be used during compute and the final output stored on ZFS.
I'd be very cautious about the benchmarks. For example, betrfs was measured performing 1000 4-byte writes into a 1GB file. It isn't clear whether there were any sync operations - there certainly wasn't a sync after each write, although there might have been a sync after the whole set of 1000. That speed up is a simple characteristic of a filesystem that is log-structured (so it is writing those 1000 events as a single sequential disc access) and doesn't store data in 4kB blocks (so it doesn't have to load the other 4092 bytes in the block before writing it). The filesystem I wrote in 1999 for my undergrad project would have done the same thing. One of the benchmarks I wrote for my system showed exactly the same amazing performance benefit. (My benchmark had me generate a tree of a thousand small files in a ridiculously short time - ext2 thrashed all over the disc doing the same thing.) Unfortunately it is unrealistically optimistic because that isn't a write pattern that is going to happen very often. Usually each small write will have an fsync after it. Unless you actually have a thousand writes without a separating sync, then this speedup isn't going to be realised.
I'm struggling to see how the find/grep benchmark could possibly have such a fantastic performance benefit for betrfs, given the fact that all those filesystems are effectively reading a tree or known-location structure. The only conclusion I can reach is that maybe the betrfs test had a hot cache and the others didn't. I could possibly be persuaded if betrfs keeps all its metadata in a small easily-cached part of the disc, but there are disadvantages to that too. I don't think this test is valid.
It seems like you may be jumping to conclusions a bit prematurely. The paper (
https://www.cs.unc.edu/~porter/pubs/fast15-final.pdf
) is very explicit that they start with a _cold_ cache. They also go into detail for why they do well on grep. As I understand it (but I'm not an expert), betrfs's advantage here comes from the fact that it stores files lexicographically by their full names (and metadata), meaning that related files are stored nearby each other on disk. This gives better locality than what you would get with a standard inode structure.
Based on that, it seems like the outcomes of the tests are pretty reasonable.
I'll concede on the hot cache suggestion. Storing the files lexicographically is an interesting thing - it means that grep/find (or anything else that reads through the files/directories in order) would perform well. But this makes the test to some extent contrived to specifically run fast on this particular system.
I do agree that this kind of filesystem mechanism should give good performance benefits. But in the general case they won't be quite as fantastic as these benchmarks make out.
Please note that the benchmark sources are also available, e.g.;
https://github.com/oscarlab/betrfs/blob/master/benchmarks/mi...
Indeed, sounds very interesting. However, from their github [1]:
> NOTE: The BetrFS prototype currently only works on the 3.11.10 kernel.
This is a tad limiting, hopefully they will port it to latest...
[1]
https://github.com/oscarlab/betrfs
It's also currently stacked on top of ext4, and the tree data has to sit on some other filesystem as well. So promising design, but quite a long way from ready for production.
Not only that, but they need some fun patches...
> Our design minimizes changes to the kernel. The current code requires a few kernel patches, such as enabling direct I/O from one file system to another. We expect to eliminate most of these patches in future versions.
I prefer half-baked projects that are honest about their status over overpromised vaporware, personally
how large are those random write?
For video editing something like XFS perform very well.
The reason is because you usually read a whole image at once and each image are very big.
Wasn't tokutek's, now part of percona, fractal tree open source _BUT_ patented?
edit: yes, it seems [0]
[0]
https://github.com/Tokutek/ft-index/blob/master/PATENTS
Be-trees were invented in 2003, predating the Tokutek patent: see citation [1] in
http://supertech.csail.mit.edu/papers/BenderFaJa15.pdf
. I think Tokutek was originally not aware of Be-trees and only later discovered the similarity to Be-trees.
This paper (from 2015) seems to be saying that:
"(...) The BΔ-tree has since been used
by both the high-performance, commercial TokuDB
database [4] and the BetrFS research file system [5]. (...)"
First citation from the paper links to
http://perso.ens-lyon.fr/loris.marchal/docs-data-aware/broda...
â is this the paper you're referring to as prior-art for BΔ-trees?
Yeah, though I'm not a laywer and I have not personally studied the Tokutek patent in detail. I did consult at some point on this with somebody experienced in database-related software patents, but please do not take legal advice from me :).
My understanding is that Be-trees are not covered by a valid patent, either because of clear prior art or because Tokutek patented a different data structure. Be-trees are simpler than the original fractal tree afaik (they were not aware of Be-trees at the time of their innovation).
If this filesystem "has comparable random-write performance to an LSM tree", would it be viable to use this filesystem _directly_ as the storage for a key-value store (i.e. to swap out LevelDB/RocksDB for a simple library that just creates each key as its own file, expecting to be backed by this filesystem)?
If not, why not? I'm guessing mainly because of kernel context-switching overhead?
And if that's why, then could use of this filesystem be _made_ competitive with [or better-performing than!] e.g. "LevelDB writing to ext4", if that context-switch overhead was removed â e.g. if it was either used by a kernel-mode application (i.e. a unikernel approach); or if the driver itself were moved into userspace as a library, with the expectation that you'd compile it into a single daemon process which would own and have write access to a raw block device?
(I ask because part of my job involves tending to blockchain archive-nodes, and the operational management of LevelDB at scale sometimes makes me want to pull out my hair. A million little 2MB files all in one directory, constantly being created and deleted. If I could 1. work with the keys in those databases directly as a mounted [perhaps read-only] filesystem, and 2. get for free the BetrFS equivalent of Btrfs's incremental subvolume send/receive for them, rather than trying to organize parallel rsync(1) for a million tiny files, those factors alone would be worth dealing with an experimental FS.)
I think the BeTree operation model is simpler than the BeTree based file system. It might be better to use a BeTree database directly.
Yeah, that's true. Analogous to using LMDB/BoltDB, directly instead of using Btrfs.
And I guess, as long as the BeTree library could understand the concept of a non-expandable database file, you could just point it directly at a raw block device as its "database file" and it would be happy.
The only concern I'd have in this case is that userspace database libraries usually don't worry about the possibility of interrupted partial-block writes, since they're usually writing to files on a filesystem, and the filesystem usually handles that possibility for them; while the filesystem itself â or a library working with a raw direct-write block device â _does_ have to worry about partial-block writes.
I think, in the case of raw "everything is a tree" storage (B-tree or Be-tree), you'd only need to ensure that 1. there's a journal for root-node-page offsets, and that 2. there's a separate freelist for root-node pages, not hanging _off of_ the root node, but rather attached to the journal; such that root-node pages are only freed for overwrite once the journal entry for the new page is guaranteed flushed to disk.
Of course, if you were lazy, you could get the semantics of a journal and rootnode-freelist, by making this database library write its root-node pages as regular sequence-numbered files in a directory backed by a real filesystem, relying on the real filesystem to declare those files fsync(2)'ed before it's willing to delete previous root-node-page files; and then considering the effective freelist to consist only of the intersection of the freelists from all currently-visible root-node-page files.
Wonder how it compares to bcachefs which, I believe, uses similar data structures.
Bcachefs has gotten a lot more development time and seems close to ready for mainstream use, and this seems like it's much more in very early stages.
It'd be cool to hear a conversation on the overall design of each project from the authors of both, though
Bcachefs seems like it needs the attention of a team, not just one good dev. Since it's not even merged, I guess it's still 5 years out before we can use it..
I think it's already usable in a way that BΔTRFS is not. Like it can be installed on modern kernels, and there are a handful of people using it as their root filesystem today.
I don't think it being out-of-tree is a huge deal per se. ZFS is also out-of-tree. For use on personal systems, I think the bigger thing is that the on-disk format is not officially stable/permanent yet. But if that comes before the thing is merged to the Linux kernel, I'd be willing to try it on a personal system.
Try it at your own risk, of course, but BCacheFS doesn't look like any extra work to set up on NixOS if you wanna try it thereâ if you tell NixOS that you wanna use bcachefs it'll just transparently pull in the required kernel for you.
Idk about filesystems development, but I agree that eventually it would be ideal for BCacheFS to have a sizeable development and maintenance team. Maybe in the early stages, though, it's good for it to have the kind of coherence and simplicity required to fit all in one person's head. Time will tell, I guess!
I have used it on NixOS for over a year on my main desktop and NAS box. The experience is .. flaky .. sometimes Kent does not have a new enough kernel version available that NixOS needs, there have been several major breakages where some background tasks spin at 100% cpu forever and the file system slows down to a crawl, sometimes you need a to run fsck from a compat branch to get your fs back into shape. At the moment my desktop is broken because the NixOS config forgot how to unlock my root volume. But when it works, it mostly stays out of my way. I think I will move to ZFS for my desktop, there has been just too much faff with my setup. The claim about there not being any on-disk data loss, IDK, I have read from disk some large media files that have been broken .. when they had been written during a slow crawl while the fs processes were spinning 100%. So jury is still out there.
I totally agree with some parent commenter here that it needs a team to work with Kent. Documentation is almost nonexistent (tho ArchWiki saves the day a little).
Good points, but bcachefs does not have releases (ZFS has versioned releases) or a development team.
(Obviously I'm not comparing anything to Bepsilon - they are irrelevant until implemented as an actual linux filesystem)
Oh yeah. ZFS is mature on a whole different level than BCacheFS, too. As a bystander and potential user, if I have a hope for BCacheFS it's once it makes it into the mainline kernel, it attracts more developers and grows into a community project with versioned releases and all that. I imagine that its author hopes the same.
I'm not super familiar with bcachefs, but from what I can find it seems like it is based mostly on a standard (but I guess very well implemented) B-tree. Am I missing something?
Bcachefs appends log entries into large leaf blocks instead of updating the sorted block data for insert the way a standard B+tree would do it.
One previous thread, if anyone's curious:
_BetrFS: An in-kernel file system that uses BΔ trees to organize on-disk storage_ -
https://news.ycombinator.com/item?id=18202935
- Oct 2018 (46 comments)
I wish there was a file system that, when the CPU is unused, trawled the disk for unused files and transparently compressed them.
bcachefs does that, with the background_compression option
Why is it nice to have that behavior as an option in addition to/instead of just trying to compress everything as it's written?
aside: gotta love that the commenter who said they wished for this feature got downvoted before a respected filesystem developer casually dropped by to mention that it was a feature he thought was worth implementing in his cool new filesystem.
aren't the existing forms of transparent compression for filesystems better than that?
such as? Do you have a link?
Last time I looked at this, it was using the toku db b-epsilon tree mysql db code. I wonder how much has changed.
Is the name designed to be intentionally confusing?
Looks like they are starting with the popular BTRFS, and then making the pun of this being "better," and also implying the Be tree data structure they use.
I bet it's intended to be pronounced "Better Eff Ess."
> Amanda: To clear this up, once and for all: is it pronounced BetterFS or ButterFS?
> Chris: <Grin> Definitely both.
https://web.archive.org/web/20120627065427/http://www.linuxf...
But that's already how a lot of people pronounce Btrfs...
I thought everyone said butterface.
I'm definitely calling it that starting today.
I personally find it really douchey. BTRFS definitely had this first and using such a name in full knowledge of BTRFS is just in poor taste.
They should call it Bepsilon FS, or Bepsi for short.
Bepsi Max for when you need large volume support
Oh no, we're kindling the Bepis vs Conk debate now.
OMAN BEPIS
Letting programmers name their projects was always a mistake.
letting C-level name projects is always a mistake too. "project Crossbow!" (Microsoft, Sun, EU, ...)
After a former colleague of mine, Stuartâs rule of system naming: The whizzier the name, the crappier the system. Inside corporates the correlation is uncanny.
for operating systems, it's the opposite: the crappier the name, the better the OS. "Plan 9", "Fiasco", ...
Thatâs the thing with correlations, they work both ways :-)
WSL would like a word with you.
UUID4 or go home?
Is it that bad? There is plenty of technical shortcuts which differ in a single letter - that should not be an issue. Plus one is pronounced _/ËbiËtriË/_ and the other likely _/ËbÉtÉ/_ (or _/ËbÉtÉÉč/_ in the US :)).
I think that this is particularly bad since there are many different pronunciations for btrfs. E.g. Wikipedia says
> Btrfs (pronounced as "better F S", "butter F S", "b-tree F S", or simply by spelling it out)
and I heard all of them in practice (except for spelling it out).
While you can hear the difference for "b-tree F S", the other ones are much harder to distinguish.
Swedish person thinking I'll just pronounce it "bee tee arr FS". I mean that's what it says and it isn't a tounge warper so...
Same with SQL, never got the sequel / whatever pronunciation, I just say the darn letters.
Oh, thx for pointing that out. I was not aware of the other possible pronunciations.
It's the single letter in a relatively complex acronym where the single letter doesn't distinguish the underlying name.
Only if you don't know how to pronounce btrfs.
NOTE: The BetrFS prototype currently only works on the 3.11.10 kernel.
No thank you.
Why is it called ftfs in the kernel? To confuse potential users? I mean it was probably called fractaltreefs before and they just renamed it to jab at btrfs and get publicity? I don't know, but it seems weird to me.
PS I would have to create a completly new system from scratch just to test this since my systems won't boot with such a prehistoric kernel. Also many improvements that "recently" went into the linux kernel will be moot. Last commit was from march... why was this posted now and is there any interest/activity left?
The README[1] is enough to understand, this is a really funky implementation and does nothing to bring useful functionality into the kernel.
It relies on TokuDB, which is a database server meant for userland, already a complex piece of code, patent-encumbered, probably well tweaked but heavy. It does not port that to the kernel, rather it reimplements userland interfaces in-kernel. For example, the file-based interfaces TokuDB expects are proxied to files in a different filesystem.
B_epsilon trees are a useful data structure, they may have a place in filesystems, but it will take a from scratch implementation to prove it.
Repackaging TokuDB's patented fractal trees with extra duct tape does not address any needs outside of superficial marketing.
[1]:
https://github.com/oscarlab/betrfs/blob/master/README.md
> Last commit was from march... why was this posted now and is there any interest/activity left?
I work in an adjacent research group at UNC, and I can assure you that this is a very active project. Unfortunately, because most venues now use double-blind review, the updated code can't be posted until after the associated paper(s) are accepted.
I'd encourage any potentially interested parties to star/watch the GitHub repo to keep an eye on development. I've seen some very impressive benchmark improvements from work currently in the pipeline.
> PS I would have to create a completly new system from scratch just to test this since my systems won't boot with such a prehistoric kernel.
This is most certainly a use case for running this in a virtual machine.
Using a VM to benchmark disk IO is a whole other can of worms.
When the limiting factor for IO perf that youâre trying to measure with your benchmark is the filesystem, rather than the disk, you donât actually want a real disk backing your filesystem; that would just introduce noise to your measurements. Instead, you want your filesystem to be backed with an entirely in-memory block-device loopback image or some equivalent â which VMs are perfectly capable of (and in fact better at) providing.
Think of it by analogy to e.g. GPU benchmarking: youâd never use anything slower than the fastest CPU you can get your hands on, because you want to benchmark the GPU on its own as a single system bottleneck; not how well the GPU idles when held back by a bottleneck somewhere else in the system.
I donât think you want to benchmark datastructures designed to be efficient on high latency disks using low latency memory. Many file systems can be efficient with low latency disks, thatâs not very impressive. Whatâs impressive is being efficient with high latency storage.
Who said anything about low latency? The thing you get from a simulated block device is _predictable, controllable_ latency. Latency that can be deterministically replayed. Latency that can be precisely normalized away in the benchmark results, without worrying about inter-batch anomalies (if using separate disks) or the effects of wear (on the same disk.)
If itâs backed by ram, I would expect the performance characteristics to differ from ssd or hdd, especially in respect to latency concerns.
Thatâs all fine and dandy, but it only means that there is some good/great research in the topic. Winning best paper awards doesnât say a thing about the implementation and handling of various edge cases.
Like many research projects, this one will also probably last as long as there is funding. Remember, the goal of PhD students is to publish papers, not develop and maintain software. Thus, without skin in the game, I couldnât trust my data/workloads to such systems.
Please don't take HN threads on generic flamewar tangents. They're repetitive, shallow, and tedious, and usually get nastier as they go along. We're trying for the opposite sort of conversation here.
https://news.ycombinator.com/newsguidelines.html
We detached this subthread from
https://news.ycombinator.com/item?id=29403597
.
The goal of academic research is to explore ideas, which are judged by submitting papers to conferences for review, and to train the next generation of academics (i.e., graduate students) in coming up with ideas, proving them, and then writing up said ideas and the proof that they work well. It is not to create a production quality software, which is an orthogonal set of goals and skills.
The key thing to remember is that THERE IS NOTHING WRONG WITH THIS ACADEMIC PROCESS. I go to the Filesystems and Storage Technology (FAST) conference, where many of these BetrFS papers were published, to harvest ideas which I might use in my production systems, and of course, to see if any of the graduate students who have decided that the academic life is not for them, whether they might come to work for my company[1]. I personally find the FAST conference incredibly useful on both of these fronts, and I think the BetrFS papers are super useful if you approach them from the perspective of being a proving ground for ideas, not as a production file system.
So it's unfortunate that people seem to be judging BetrFS on whether they should "trust my data/workloads to such systems", and complaining that the prototype is based on the 3.11 kernel. That's largely irrelevant from the perspective of proving such ideas. Now, I'm going to be much more harshly critical when someone proposes a new file system for inclusion in the upstream kernel, and claiming that it is ready for prime time, and then when I run gce-xfstests on it, we see it crashing right and left[2][3]. But that's a very different situation. You will notice that no one is trying to suggest that BetrFS is being submitted upstream.
A good example of how this works is the iJournaling paper[4], where the ideas were used as the basis for ext4 fast commits[5]. We did not take their implementation, and indeed, we simplified their design for simplicity/robustness/deployment concerns. This is an example of academic research creating real value, and shows the process working as intended. It did NOT involve taking the prototype code from the jJournaling research effort and slamming it into ext4; we reimplemented the key ideas from that paper from scratch. And that's as it should be.
[1] Oligatory aside: if you are interested in working on file systems and storage in the Linux kernel; reach out to me --- we're hiring! My contact information should be very easily found if you do a Google search, since I'm the ext4 maintainer
[2]
https://lore.kernel.org/r/YQdlJM6ngxPoeq4U@mit.edu
[3]
https://lore.kernel.org/all/YQgJrYPphDC4W4Q3@mit.edu/
[4]
https://www.usenix.org/conference/atc17/technical-sessions/p...
[5]
https://lwn.net/Articles/842385/
Looks like only works with Linux kernel - 3.11?
https://github.com/oscarlab/betrfs/blob/master/README.md
, so definitely have not been updated for awhile.
I am not even sure it wants to be production ready but may be it is a playground for ideas.
Indeed, we are behind on releases. We do anticipate a major release, including 4.19 kernel support, in the coming months.
Part of our challenge is that we are also exploring non-standard extensions to the VFS API - largely supported by kallsyms + copied code to avoid kernel modifications. This makes rolling forward more labor intensive, but we are working to pay down this technical debt over time, or possibly make a broader case for a VFS API change.
Extricating something from specific kernel API calls won't be fun. Might be a good learning experience, tho. I may take a crack at this in my spare time (I'm not good at C. At all. So this will be more learning for me, and much less functional).
I hope you are being downvoted for the harshness and not the content.
> Like many research projects, this one will also probably last as long as there is funding. Remember, the goal of PhD students is to publish papers, not develop and maintain software. Thus, without skin in the game, I couldnât trust my data/workloads to such systems.
Sadly true. For-profit companies only care about $$. Academia only cares about publishing to get funding.
Both options are not ideal for developing trusted and user-focused software in the long term. OpenSSL is a good example.
No-profits really struggle to get funding. Government grants are a mess.
The world really needs a new approach to R&D.
> Academia only cares about publishing to get funding.
That's just not true. To do well in academia you have to be truly invested in your field. You can just about get by if you're only in it for the papers, but it's just like getting by in a job that you're only in for the money. At the end of the day, though, in a world where everyone is forced to be productive or be homeless, there are times when publishing becomes a necessity. This doesn't mean they only care about publishing, though.
It is however, very much a follow the incentives kinda situation. Just as the monetary incentive can also bring about many unwanted behaviour. The pride /publication based incentives introduce their own flavour of dysfunction.
> To do well in academia you have to be truly invested in your field
I never said the opposite. For individuals it takes a huge lot of dedication.
But academia, as a whole entity, is being forced into the publish-or-die mindset.
Tokutek's fractal tree was quite known when they did backend for mongodb on it with record breaking perf, from what I recall it was patented and that was the reason people didn't dive into it.
Where did you get the impression that this is the product of PhD students?
Not OP, but the majority of people involved have .edu homepages (stints in industry, still research emphasis) and many of the alums appear to have become alums contemporaneously with the end of their academic career, most of them via Stony Brook, and finally there are a bunch of academic papers with authors clearly acting in their academic capacity (and typically prior to their stints in industry), so, IDK, seems like a reasonable assertion that this has a strong academic emphasis and a lot of the work was done by academic students. Whether it's actually unreliable is a different question, but it seems pretty reasonable to suggest that it's a research project and not a production filesystem.
The sibling comment described it well. In addition, the majority of github commits are done by the people that are listed in the alumni section, while they where PhD students. There arenât many commits from people listed as current members, and last significant commits are from the last year.
Nobody is asking you to deploy this in your production system. This is about an experimental filesystem which supports exactly one version of the Linux kernel. It's neat to see progress in this field -- maybe try and learn something new?
And, the way to get production-ready code is to write a kernel module, with hopes that others in the kernel community will pick it up. Linux certainly didn't start out mature, but you're probably using it now.
It is both kind of hilarious and kind of terrifying to see this sort of anti-academic, anti-expert nonsense is bleeding in to %{body}amp;#ing software development.
All your written-in-production, battle-hardened code with no effete book-larnin' algorithms aren't going to run very well without a functional electricity grid.
What is going on? The grandparent comment is merely noting the novelty of a filesystem utilizing a recently invented data structure. The parent is weirdly mentioning how they wouldn't trust a research filesystem for real work (who would...?). Now THIS comment is claiming the parent comment is anti-academic and anti-expert when it's actually mainly raising common concerns about the disconnect between theory and practice (then this comment mentions the electricity grid, as if that's of any relevance??). Just a really strange series of disconnects between the arguments.
The grandparent comment is an example of the kind of âmiddlebrow dismissalâ [1] that isnât really welcome here.
It sets the tone with dismissive snark (âfind and dandyâ), then implicitly asserts that the project is not interesting because itâs not production-ready.
Of course itâs obvious to everyone here that version 0.2 beta software is not production-ready, so obvious that comments to that effect are at best superfluous, at worst annoying.
But its production-readiness is clearly not the focus of the discussion, rather its novelty and potential is. Thatâs what makes it interesting and worth discussing here.
[1]:
https://news.ycombinator.com/item?id=4693920
I'm extremely pro-academic, but I think you're taking the least charitable interpretation of the parent. While I fully disagree with the parent on the value proposition here, they are quite correct that (at least most) phds aren't concerned with implementation problems like corner cases and long term maintenance. There are of course exceptions, but having worked on quite a bit of academic code, I can say that anecdatally maintainability is not a high priority. It's very much like a typical PoC is in a startup.
Why is this relevant? Just looking at the title and abstract, it is clearly among the most implementation-focused computer science papers ever written. It's the paper that accompanies BetrFS 0.2, incorporating that source code (which clearly has to handle edge cases), many measurements and discussions of tradeoffs. What more are you people asking for?
> _looks at extremely practical paper, which rightly won a best paper award, probably for the very reason that it was extremely practical_
> _decides to have the thread descend into a dismissal of the value of best paper awards on the basis that they do not reward practicality_
It's possible to regard the paper as high value, appreciate its value, _and_ recognize that this is not a production filesystem.
To be fair, FAT32 is a production filesystem...
Why was the fact that it's not a production file system even brought up? Was it advertised as a production file system? Does the paper say that it is one? What was the rhetorical purpose of that statement?
I donât think I dismissed the value of the paper. I pointed out that implementation may not be that good, and best paper award and quality of the implementation most of the time are not correlated.
You dismissed the value of the paper about ten times in a row, barely stopping for breath. Some of the tone is found in which thing goes on what side of the word "but", some in other words, but generally you really messed it up if you wanted to avoid insulting the authors and dismissing the value of research like this. There is an enormous gulf between a comment like yours
> _Thatâs all fine and dandy, but [...] Winning best paper awards doesnât say a thing about the implementation [...] the goal of PhD students is to publish papers [...] I couldn't trust [this]_
and a comment like this
> _This is a really impressive project. Obviously this is deeply academic, but since I am so impressed, I wonder what the plans are for this (or the same idea in a new fs) to reach the kind of commercial quality where I can use it in a production system._
If you were so aware of the general nature of academic research vs battle-tested implementations, then you would also know that filesystems are so incredibly complicated that the latter invariably takes on the order of 10+ years from a big team to create. When you forget this fact and say that a few-years-old implementation probably sucks because it's from academia, you're ignoring that NOBODY could have made it production-ready in that time, not even Microsoft or Apple or Oracle. Why would you criticise it on this basis? Choosing to do that was the biggest dismissal of the value of the work. Instead, you buried what was in effect a compliment (this would be useful for my production systems) under ten layers of insults.
The iJournaling paper was published in 2017 (and like many papers, it took multiple rounds of paper submissions before it was finally accepted; the academic procress is rigorous, and many program committees are especially picky).
The jJournaling ideas hit the upstream kernel in 2021 as ext4 fast commits, and no I don't consider it production ready yet. If the fast commits journal gets corrupted, it's possible that the file system will not be automatically recoverable, and may even lead to kernel crashes. I'd give it another year or so before it's completely ready for prime time.
But the other reason for the four year delay between 2017 and 2021 is because I had to find the business justification (and after that, the budget and head count) to invest the SWE time to actually implement the idea. A lot of people want new sexy file system features, but very few people are willing to PAY for them. So part of the job of an open source maintainer is not just to perform quality control and create a technical roadmap, but also to help the developers workin on that subsystem to make business cases to their respective employers to make a particular investment. The dirty little secret is that most people are pretty happy with the file systems as they currently exist; the bottleneck is often not the file system, and while certain file system features are _nice_, they very much aren't critical --- or at least not enough that people are willing to pay the SWE cost for them.
> anecdatally
Is this an anecdote vs. data pun? :D
Yes haha, sorry I use it way too much and it's become part of my vocabulary
I understand some of the frustration though. I was trying to do some audio processing work once. Found the paper(s), which promised code available from websites that are no longer available. Dug through the internet archive to find the zip files with the matlab code; managed to tweak it to run with the matlab version I have; found it works as described with the sample inputs, but crashes horribly on my inputs.
Source code availaility for academic papers is important for reproducibility, so other people can run additional experiments and demonstrate that the performance numbers in the paper weren't fudged.
It's not necessarily going to be useful for production use. There are exceptions to this; for example, there are papers were the authors claim that a flash/SSD emulator is suitable for use by other academics to experiment with their FTL ideas, or to grab network traces from NFS traffic so they can be replayed to test file system performance using real world data. In those cases, the point of the paper is to share a tool that can be used by other researchers (as well as the team that created the tool in the first place), and in that case, the code had better d*mned well work. (But even then, there might be buffer overrun bugs in the SSD emulator; which is fine, since the FTL is intended to be provided by an academic researcher, and it is not expected to accept arbitrary inputs including from a malicious attacker.)
I don't know whether the papers in your case were meant to be documentation for code that was meant to be shared, or to explore a particular research idea, the code was only meant for that particular purpose. Even if it was for the former, there's an awful lot of bitrotted, unreliable "abandonware" on sourceforge or github that can be pretty skanky; that's a problem which is not restricted to academically published papers.
I've been debating where the anti-science behavior stems from. From reasonable people at least. The best i can come up with is that most reasonable people recognize how the modern age is an information war. Product sales, articles on economy, articles on politics, even some well advertised miss-steps like the sugar industry pushing/funding pro-sugar anti-fat papers way back _(which may or may not be true, but it is a common trope parroted)_.
I assert that all this leads to people being paranoid about information of subjects well outside their expertise. Which is a really scary place to be. The answer seems non-obvious to me, but is likely nuanced.. and the public doesn't do well with propagating nuance in my experience.
I'm really interested in tooling to help disseminate information.
When I encounter it, I feel it's often a hatred of the "doers vs the thinkers"
My career path was 25 years engineering, before migrating into a hybrid EE/PM role as sort of a natural progression from being "the engineer who knew how to run the project". Once I started learning the more formal approaches to PM, it uncovered an entire world of engineers who have an incredible hatred of any sort of planning of any kind, because all planning time is wasted and we should all just be doing.
The parent comment here feels the same way. Hatred towards research because it's all theoretical (I guess?). It seems clear as day that the best approach is a marriage between the two.
I think it's summed up by the mantra "those who can, do. Those who can't, teach."
I suspect "hatred of any sort of planning of any kind" is actually hatred for planning that fails to embody any actual strategy (and is therefor a waste of time because it doesn't help solve any actual problem except maybe alleviate non-technical vips anxiety with false hope). "formal approaches to PM" evokes just that sort of thing in my mind (kpis/goals masquerading as strategy, gantt charts, etc)
When I worked in a university lab that builds stellarators I learned that misgivings towards researchers is all bullshit. There are engineers and there are pencil pushers. Pencil pushers burn money and bark loudly. Real engineers can plan and execute on time and under budget.
I think one source of anti-science behaviour might be betrayed expectations. I know that I personally have stronger expectations for academia and scientists than for other people. So when someone from academia, a scientist, an engineer betrays these expectations, I feel worse than if a regular person did it. There's a feeling of "If we can't even count on those people, what are we even supposed to do?". For example, at the beginning of the COVID pandemic (around January 2020), I read a lot about it. Lots of very smart people were saying that this could be a big pandemic. I talked about it with a doctor in a non-professional setting that told me to basically not worry about it, that it wasn't going to be anything huge. This time I was right and he was wrong. Was it because I searched more about it? Was it just luck? I don't know. But I know that it made me lose a bit of trust with that person.
I think the origin of this might be on how I (or we) see those people. You're supposed to follow what the doctor tells you, what the scientists tell you. But in a way, since you're supposed to follow what they say, they have some kind of responsibility towards you. And when they say something wrong, it's way worse than when a regular person says something wrong. It's like when you're young and your teacher or your parents are wrong, it's very frustrating.
Your example about the sugar industry is also a great one. Try to understand a bit more about nutrition, and soon you'll hear all kind of conflicting advice and explanations from very different experts.
I know that personally I have to work on myself and accept that those people are humans, and make mistakes, just like me. But just like telling people to eat less and move more didn't solve the obesity epidemic, I'm not sure that this solution will scale to a large population.
This is interesting, and I've seen a bit of this sort of behavior, too.
Some people seem to confuse expertise for a claim of infallibility, and when some expert get something wrong, the reaction is to conclude that expert advice is worth no more than the guy on the teevee hawking vitamins and anti-expert bile.
It is a sort of Leveler belief wrapped in a search of an Oracle.
Something you may be more familiar with is people's concept that someone that "knows computers" is familiar with any and every sort of task that involves a computer whereas in fact this could encompass a wide variety of different skills that require an individual investment of time.
The same can be said of medicine where encompasses a very broad set of skills. Your doctor may have been an expert in sports medicine or brain surgery but it doesn't automatically make him competent and epidemiology. It also doesn't force him to pay attention to current developments in the news which is likely what informed your opinion. Personally I found it was completely obvious in January that we would be dealing with a crisis because I followed the situation and suspect strongly that your doctor friend did not.
There is also the issue of survivor-ship bias. We worry about many things and we will absolutely recall the times our worry was justified and forget when it we are mistaken. If Yellowstone ever blows there will be many people who knew it was just around the corner and this will be true if it blows now or in a century, whether or not we have any scientific basis for the thought process.
TLDR: A singular doctor of unknown specialty getting it wrong in January isn't a flaw in science. Science isn't expected to be very good at ensuring a single expert of only tangential expertise gives you the right answer whereas it is reasonable good groups sometimes slowly arriving at increasingly correct answers. If you want a more correct answer consider consulting or reading what several people of relevant expertise who are up to the minute on current information have to say.
Readers of Nassim Taleb, doubts raised over our blind trust of institutions like their reaction to coronavirus, any mainstream nutritional advice (fat is bad and causes heart disease, âthe China studyâ was written by a nutritional biochemist), papers that p hack to be published, replication crisis of psychology, statistics and using it to lie.
There is no paranoia, an allegory is an ivory tower of studies about language from non native speakers with a PhD, and a native speaker who gets no recognition for using it daily but doesnât have a fancy diploma or credentials so someone who speaks âproperâ Spanish from Spain who has never been to Spain is more âcredibleâ than the Mexican speaking âimproperâ Spanish daily.
Academia is to be ignored unless itâs relevant, Fritz Harber didnât need the Nobel prize to have real world effects in nitrogen fixing to help farmers grow and sustain our population, Obama wasnât more relevant because of his Nobel prize, and Perelmanâs refusal of the Fields Medal doesnât change his contributions.
Readers of Nassim will also recognize his critiques of academia are mainly targeted at social sciences and similar fields that can only âproveâ their findings statistically, where p-hacking and incorrect use of models and wrong distributions and the like result in bad findings passed off as good.
Thatâs not the case with computer science, at least in systems subfields like filesystems, where theories can be implemented in isolation and shown to either work or not.
Disagree. Artificial benchmarks and p hacking for showing good performance in CS are also statistically proven. The best result in geekbench doesnât mean anything to me.
Benchmarks and geekbench are not "where theories can be implemented in isolation and shown to either work or not."
Maybe I donât understand your point, in psychology for example sterile lab tests are isolated and can be shown to work or not, is it not the same idea here?
CompSci theories are essentially mathematical proofs. You create a proof of something, then you build it to test it, to make sure your math is actually correct, and that the theory works in implementation.
Proof of correctness doesn't rely on having a large cohort of test subjects undergoing an experimental trial of some sort, and then interpreting the results with statistical models, distributions, p-values, etc.
I don't know psychology in depth, but if there are similar kinds proofs without requiring statistical analysis of a large experimental cohort, then I don't think Taleb's criticisms are aimed at those either.
It's the fundamental problem of knowledge - can truth be known via logic and reason, or via empiricism and observation? The answer to both is, sometimes, but with caveats.
Peter Norvig also wrote a good take on all the ways studies using experimental cohorts can go wrong:
https://norvig.com/experiment-design.html
You realize that most of the core software that we depend on was built by graduate students right ? Idk why the average programmer assumes that PhDs in freaking computer science can't code. Implementation and edge cases are the easy part, the hard part is design and algorithms. One just requires some focused work , the other requires real skill and intelligence
Sorry, but this is nonsense. Look at the chubby implementation and the subsequent paper - implementation and edge cases were the hard part, that took a lot of skill to get right. The algorithm is important, but labeling one as easy is far away from real world experiences.
I never assumed that PhD students canât code. They can and they are pretty good at that. My point is that their incentives are in writing papers and running experiments that support claims in their papers, not produce reliable software. It might be reliable, but mostly itâs not. When we use tools build by PhD students, itâs usually when there are companies/startups built around it, and that is what I refer to as having skin in the game.
Fair enough I misunderstood your point then
_> PhDs in freaking computer science can't code_
Iâve known people with PhDs in computer science (from a top tier school) that couldnât code. Their research was all done in Matlab for simulations, modeling a biological process. It was a very specific set of skills required. And at the time, this person couldnât have written a web front end to a database to save their lives.
Just because one is good at the theory behind CS doesnât mean they understand software engineering. Similarly, because one is good at the theory doesnât mean they canât code.
They are two related, but different, skill sets.
> > PhDs in freaking computer science can't code
and many graduates in freaking computer science can't read or write proofs, either
> They are two related, but different, skill sets.
exactly.
I think it would be valuable to enforce more crossover in our educational institutions, though. We should have clearer boundaries between computer science and software engineering, and then also require students (at every level) in each to do some study in the other.
Researchers should be in touch with the concerns and needs of ordinary programmers, and ordinary programmers should be capable of looking at the output of researchers to take good ideas and make them practicable and polished.
Sometimes the disconnect means research effort gets wasted and practical technology lingers on designs that serve it relatively poorly.
But of course software engineering and computer science are distinct and deep specializations.
> average programmer assumes that PhDs in freaking computer science can't code.
Average programmer here. PhDs in computer science can't code.
Ok, it's an overgeneralization. And it's probably based on a flawed sample of job applicants that make it past HR screening to get to me. The base rate of applicants who can't code is disturbingly high, probably around 20%. (Not that high numerically, but given that they've passed pre-screening and have something impressive-sounding on their resume, it's too high.) The rate of applicants with a PhD in CS who can't code is way higher, probably around 60%.
Note that these tend to be fresh graduates. And it even makes sense -- most theses require just enough coding to prove a point or test a hypothesis. In fact, the people who like to code tend to get sucked into the coding and have trouble finishing the rest of their thesis work, which may start out interesting but soon gets way less fun than the coding part. Often such people bail out with an MS instead of a PhD.
(Source: personal experience, plus talking to people I've worked with, plus pulling stuff out of my butt.)
At the same time, many of the _best_ coders I know have PhDs.
> Implementation and edge cases are the easy part, the hard part is design and algorithms.
Hahahaha. <Snarky comment suppressed, with difficulty.>
I agree that design and algorithms can be hard. (Though they usually aren't; the vast majority of things being built don't require a whole lot.) But the entire history of our field shows that even a working implementation Just Isn't Good Enough. _Especially_ when what you're writing is exposed in a way that security vulnerabilities matter.
Though it's a bit of a false dichotomy. Handling the edge cases and the interaction with the rest of the system requires design, generally much more so than people give it credit for. Algorithms sometimes too, to avoid spending half your memory or code size on the 1% of edge cases.
PhD CS scientists shouldn't _have_ to be able to code. They are exploring the _science_ of computing, not the implementation.
Developing queueing theory doesn't make you a great coder for Kafka environments.
Working on file system design and new innovative data structures (a persistence and retrieval environment) has nothing to do with writing kernel drivers.
On the other hand, a lot of SE graduates dont _want_ to code, because they think there should be code writing tools and frameworks and infrastructure and process.
They'll spend endless hours talking about those things and focusing on their own needs instead of the actual purpose of the code they're supposed to be developing.
I have a feeling that - software along with hardware today has got lot more complicated than what was 30-40 years ago.
Most production software (esp low level stuff like Kernel, filesystem) today is written and maintained by people having that work as jobs. I wish it was any other way. Also, what users expect from production software is way different than situation 30-40 years ago. An Operating sytem _must_ work for different CPU, GPU. A bare-bones OS is basically a non-starter. I mean look at Haiku-OS or any of other operating system projects, for most part they have gone nowhere.
A filesystem is also fairly complicated piece and what we expect from a filesystem is different. Speed is good but that is not the only criteria and I am afraid it does take serious _engineering_ effort (edge cases and all) to get it usable on today's hardware.
I think itâs more that people believe (in my opinion rightfully) that good design is a skill which comes with experience. Thatâs why I expect great algorithms and small software from graduate students and awesome design from established teams working on large scale problems.
That doesnât really apply here obviously. The BetrFS team has experienced members.