š¾ Archived View for gemini.omarpolo.com āŗ post āŗ writing-a-major-mode.gmi captured on 2024-08-25 at 07:23:51. Gemini links have been rewritten to link to archived content
ā¬ ļø Previous capture (2023-01-29)
-=-=-=-=-=-=-
About DSLs and custom major-modes
Written while listening to āNew Millenniumā by Dream Theater.
Published: 2021-08-06
Tagged with:
As part of the regression suite for a project Iām working on, I designed a simple scripting language (which is not even Turing-complete by the way) to create specific situations and test how the program respond. Iāve almost finished the interpreter for it, so itās the time to start writing tests. How do you edit a file if you donāt have a proper major mode available? You write one!
A major mode is a lisp program that manage how the user interacts with the content of a buffer. (Friendly remainder that a buffer may or may not be an actual file; things like dired or elpher are major modes after all, but theyāre not the kind of modes Iām interested in now.)
Major modes for text files usually do at least three things:
and probably more, like providing useful keybindings and interactions with other packages.
Iāve never had to deal with the fontification or syntax tables, nor realised how difficult the indentation can be, so itās been lots of fun.
The difficulty of writing a major mode seems to be at least proportional to the ācomplexnessā of the target language. In my case, the grammar of the language is dead-simple and so the major mode is simple too. cc-mode on the other hand is probably at the other side of the spectrum (well, after all it manages C, C++, Java, AWK and moreā¦)
Before describing the elisp implementation, hereās a look at the custom DLS, ānpsā:
include "lib.nps" # consts comes in two flavors const ( one = 1 two = 2 ) const foo = "hello there" # procedures works as expected, ā¦ is for the rest argument proc message(type, ...) { send(type:u8, ...) # type casts } # itās a DSL for regression tests after all testing "cooking skills" { message(Make, "me", "a", "sandwitch") m = recv() # asserts comes in two flavors too assert ( m.type == What m.content == "Make it yourself." ) assert m.id = 5 }
Now letās jump in to the mode implementation.
The elisp file starts with the usual header. Iām enabling the lexical-binding even if itās the default from emacs 27
;;; nps-mode.el --- major mode for nps -*- lexical-binding: t; -*-
Iāll also make use of the rx library to write regexps, so
(eval-when-compile (require 'rx))
i.e. syntax highlighting. There are probably different ways of doing this, but Iāll stick with the simplest one: a bunch of regexps.
(defconst nps--font-lock-defaults (let ((keywords '("assert" "const" "include" "proc" "testing")) (types '("str" "u8" "u16" "u32"))) `(((,(rx-to-string `(: (or ,@keywords))) 0 font-lock-keyword-face) ("\\([[:word:]]+\\)\s*(" 1 font-lock-function-name-face) (,(rx-to-string `(: (or ,@types))) 0 font-lock-type-face)))))
Yes, I got the number of parenthesis wrong (multiple times) at first.
This value will be later set to the buffer-local font-lock-defaults variable. Iāve not yet wrapped my head around the different levels mentioned in the documentation, but the code seems to work. Weāre using rx to build a regexp that matches the keywords and using the face āfont-lock-keyword-faceā for the matches. The zero is there because the regexp doesnāt have any sub-groups.
The second entry is slightly more complex and interesting. It matches a symbol followed by an open paren and applies the face āfont-lock-function-name-faceā to it. The regexp has a sub-group (the \\( and \\) bit) that matches only the symbol, and the number 1 tells font-lock to highlight only the first match and not the whole regexp.
The third one is like the first, it highlights the ātypesā.
This is pure black magic, I can assure you. Nah, just kidding. But it looks like.
Itās a very important piece of the major-mode. Various lisps function will inspect the current syntax-table to query over what kind of text the point is. It also interacts with the font-lock and various other parts of Emacs.
This is also the part Iām less confident with. Some major-modes Iāve seen add explicit entries for the braces and the quotes, other doesnāt. Iāve decided to be explicit and list all the characters Iām using, just to be sure.
The idea is to specify for each character (or range of characters) some properties. These properties are expressed in a very terse notation using a string. To add entries to the syntax table you need to use āmodify-syntax-entryā: it takes the character (or range), the string description of the properties and the syntax table.
The format of the specification is better explained in the elisp manual, but the gist is that is a sequence of character with a special interpretation. The first character identifies the āclassā (punctuation, word component, comment delimeter, parenthesis, ā¦), the second if not a space specifies the matching character, and then there are further fields that I wonāt use.
Just to provide an example before showing the code, in a programming language the syntax entry for the character ā(ā probably looks like "()":
The syntax table for ā)ā instead will look like "((" because
So, hereās the syntax table for nps in its all glory:
(defvar nps-mode-syntax-table (let ((st (make-syntax-table))) (modify-syntax-entry ?\{ "(}" st) (modify-syntax-entry ?\} "){" st) (modify-syntax-entry ?\( "()" st) ;; - and _ are word constituents (modify-syntax-entry ?_ "w" st) (modify-syntax-entry ?- "w" st) ;; both single and double quotes makes strings (modify-syntax-entry ?\" "\"" st) (modify-syntax-entry ?' "'" st) ;; add comments. lua-mode does something similar, so it shouldn't ;; bee *too* wrong. (modify-syntax-entry ?# "<" st) (modify-syntax-entry ?\n ">" st) ;; '==' as punctuation (modify-syntax-entry ?= ".") st))
Indentation at first doesnāt seem like a difficult thing. After all, when weāre staring at code we donāt have the slightest doubt on how a certain line needs to be indented. Turns out, like most other āobviousā things, that coming up with a program that decides how to indent is not that straightforward.
In my case fortunately the logic is pretty simple. The level of the indentation is how nested we are in parenthesis multiplied by the tab-width (because yes, nps uses hard tabs), with the exception of a closing parenthesis which gets indented one level less. Take this snippet for instance:
proc foo(x) { y = bar(x.id) assert ( y.thingy = 3 ) }
The first line, the āprocā declaration, is indented at the zeroth column because we arenāt inside a nested pair of parenthesis. The āyā variable is indented one tab level because itās inside the curly braces. The body of the assert is inside two nested pairs of parenthesis, so itās indented twice. The closing parenthesis of the assert is indented by only one level because of the special case: it should be two, but since itās a closing we drop one indentation level.
The code for ānps-indent-lineā is probably not the prettiest, but seems to work nonetheless:
(defun nps-indent-line () "Indent current line." (let (indent boi-p ;begin of indent move-eol-p (point (point))) ;lisps-2 are truly wonderful (save-excursion (back-to-indentation) (setq indent (car (syntax-ppss)) boi-p (= point (point))) ;; don't indent empty lines if they don't have the in it (when (and (eq (char-after) ?\n) (not boi-p)) (setq indent 0)) ;; check whether we want to move to the end of line (when boi-p (setq move-eol-p t)) ;; decrement the indent if the first character on the line is a ;; closer. (when (or (eq (char-after) ?\)) (eq (char-after) ?\})) (setq indent (1- indent))) ;; indent the line (delete-region (line-beginning-position) (point)) (indent-to (* tab-width indent))) (when move-eol-p (move-end-of-line nil))))
The real workhorse is āsyntax-ppssā that tells us how deep in parens we are. A better real-world example is probably the indent-line of the go-mode: itās obviously more complex, but itās still manageable.
This is not strictly needed, but itās nice to have. Iām using abbrev tables for various languages to automatically correct some small typos (like āinculdeā instead of āincludeā).
(defvar nps-mode-abbrev-table nil "Abbreviation table used in `nps-mode' buffers.") (define-abbrev-table 'nps-mode-abbrev-table '())
Now that we have all the pieces, letās define the mode:
;;;###autoload (define-derived-mode nps-mode prog-mode "nps" "Major mode for nps files." :abbrev-table nps-mode-abbrev-table (setq font-lock-defaults nps--font-lock-defaults) (setq-local comment-start "#") (setq-local comment-start-skip "#+[\t ]*") (setq-local indent-line-function #'nps-indent-line) (setq-local indent-tabs-mode t))
nps mode derives from prog-mode, a generic mode used for programming language. This way, users can easily define keybindings and options only for programming-related buffers and have a consistent experience. The body of the ādefine-derived-modeā macro is just some code that gets executed when the mode is activated. There, we set the font-lock-defaults that was computed previously, define comment-start and comment-start-skip so functions like ācomment-dwimā (M-;) works as expected and setup the āindent-line-functionā. Then, also enable indent tabs mode because nps uses real hard tabs. Thatās it.
Registering this mode to the ānpsā file extension ensures that Emacs will enable nps-mode automatically:
;;;###autoload (add-to-list 'auto-mode-alist '("\\.nps" . nps-mode))
Sidebar: what are those āautoloadā comments? Itās a trick used by Emacs to cheat and not load all the code in a file until itās needed. Emacs will only evaluate the āadd-to-listā and register a ānps-modeā autoload, but wonāt evaluate anything else until ānps-modeā is called. The first time that ānps-modeā is called, itāll make Emacs load the whole ānps-mode.elā file and then call again ānps-modeā. This is how Emacs can starts so quickly and still load TONS of emacs-lisp files.
Major modes usually defines also some keys and/or integration with other packages (flymake for example). Iām not going do to neither, but itās still pretty easy. To provide some keys all you have to do is to declare a ā$mode-mapā variable that holds a keymap, then ādefine-derived-modeā will take care of enabling it:
(defvar nps-mode-map (let ((map (make-sparse-keymap))) (define-key map "C-c c" #'do-stuff) ... map)) ; donāt forget to return the map here!
Writing a major-mode from scratch this way was really interesting in my opinion. The knowledge on how major-mode works and how to write one will probably come in handy in the future, either to write more major-mode for (hopefully) real programming languages or to tweak existing ones.
In retrospect, I ended up choosing the hardest possible way to build a major mode. For a project like this, where Iām only interested in basic font-locking, there was at least two other options to choose from:
generic-mode is provide an easy, but limited, way to write major-modes.
cc-mode itās the mode that powers C, C++, Java and (at least) AWK. Itās pretty flexible and it was designed to handle āallā C-like programming languages.
However, writing nps-mode from scratch was a pleasant experience and I had some fun hacking in emacs lisp. The implementation is also not too bad and still pretty simple, so it has been worth the time.
Iām not sharing the code in this post because itās part of the aforementioned project that itās still heavily worked on. The code in this post is everything I wrote in nps-mode.el anyway.
Some useful links:
[https] A makefile for Emacs Packages
-- text: CC0 1.0; code: public domain (unless specified otherwise). No copyright here.