đŸ Archived View for dcreager.net âș 2021 âș 06 âș tree-sitter-map.gmi captured on 2023-07-22 at 16:17:12. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
2021-06-14
The tree-sitter ecosystem is divided up across a large number of components, each in different repositories, which can be quite overwhelming at first. This post tries to provide a map of sorts.
Say youâre interested in the tree-sitter project, so you decide to check out the âtree-sitterâ organization on GitHub, browsing through its repositories to determine how the ecosystem is structured. The list of repositories spills over onto a second page, and you see entries that seem redundant. Why is there both âtree-sitter-pythonâ and âpy-tree-sitterâ? Are they competing with each other? Is one deprecated?
You might instead decide to check out the project homepage. The landing page lists (as of June 2021) over 40 different programming language parsers that various folks have implemented, as well as a handful of language bindings.
This, at least, points to an answer. The tree-sitter ecosystem is complicated because when we write a code analysis tool, we want to support different programming languages in two separate, orthogonal ways:
That at least explains why âPython supportâ in tree-sitter might mean two different things. But why have we separated everything out into distinct repositories? The main reason is to make it as clear as possible that all of these pieces are truly independent of each other. There shouldnât be any way for the Python language bindings to influence the design or release process of the Haskell bindings, for instance, nor of _any_ of the language grammars.
True, it adds complexity to the ecosystem, but weâve tried to get around this with careful naming conventions, and tree-sitter-specific tooling to make it easy to find and work with whatever pieces you need.
So, given the above, you will encounter all of the following on your journey:
You must have a tree-sitter grammar for each language that you want to parse. Each language grammar is typically implemented in a its own repository, named âtree-sitter-$LANGUAGEâ.
There are some exceptions. For instance, the âtree-sitter-javascriptâ repository lets you parse JavaScript _and_ JSX â although in this case, this is handled with a single grammar that treats âplain JavaScriptâ as a file that happens to not have any JSX expressions in it. Similarly, the âtree-sitter-typescriptâ repository lets you parse TypeScript and TSX, though in this case, theyâre handled with distinct grammars. All of these grammars share enough structure, and are a coherent enough family of languages, that it would be overkill to separate them out further.
The generated parsers only contain some state tables describing the language being parsed. The âmeatâ of the parsing logic is implemented in the âtree-sitterâ runtime library, which each parser depends on. This runtime library is also where tree-sitterâs query language is implemented.
The runtime library is implemented in the âtree-sitter/tree-sitterâ repository on GitHub, under the âlib/includeâ and âlib/srcâ directories.
The runtime library and each generated parser are implemented in C. Assuming that you arenât writing your analysis tool in C, you will need _bindings_ for the language that you are using. This will use your languageâs FFI mechanism to link in the tree-sitter C code and make it available using more idiomatic constructs.
The Rust and WASM bindings are considered âtier 1â, and are implemented directly in the âtree-sitter/tree-sitterâ repository.
Rust binding implementation (tree-sitter/tree-sitter)
WASM binding implementation (tree-sitter/tree-sitter)
Other bindings (such as for Python or Haskell) are implemented in separate repositories, typically named â$LANGUAGE-tree-sitterâ.
Haskell binding implementation (tree-sitter/haskell-tree-sitter)
Python binding implementation (tree-sitter/py-tree-sitter)
Complicating things even more, you need both the runtime library and the generated parser for each language that you want to parse â and in particular, you need _bindings_ for both! The language bindings described above only include the runtime library, since they canât know in advance which languages you will want to parse. The bindings should include instructions for how to build and include your desired parsers.
For some language bindings, we can lean on the languageâs package manager for this. For instance, for the Rust bindings, we publish packages to crates.io both for the language binding itself (the âtree-sitterâ crate) and for most of the supported grammars (e.g. the âtree-sitter-pythonâ crate). So if you are writing a tool, which is implemented in Rust, and which analyzes Python code, you would add both âtree-sitterâ and âtree-sitter-pythonâ to your âCargo.tomlâ file. Wherever possible, we follow this approach for other language bindings, too.