💾 Archived View for dioskouroi.xyz › thread › 24931985 captured on 2020-10-31 at 00:58:51. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
________________________________________________________________________________
It's great to see more tools adopting tree-sitter [1].
Having a (fast) single tool that can accurately parse most commonly used programming languages is incredibly useful, but it requires the maintenance of dozens of grammars, which is difficult without a large community effort. Hopefully increased adoption means more accurate parsers and support for even more languages.
Tree-sitter powers syntax highlighting on GitHub.com and (soon) neovim and OniVim 2. Hopefully regex-based syntax highlighting is a thing of the past soon. If you haven't seen the Strange Loop conference talk on tree-sitter [2] yet, it's worth a watch.
I think a Prettier-like code formatter using tree-sitter would be cool, both in terms of potentially broader language support and native performance.
[1]:
https://tree-sitter.github.io/tree-sitter/
[2]:
https://www.youtube.com/watch?v=Jes3bD6P0To
We've been working with the r2c folks for a while, and been using semgrep since before it was called semgrep.
If you can write code in a language, you can use semgrep. It also has a feature I have learned to love every time I find it in any kind of auditing tool: it’s ruthlessly effective as an exploratory and experimental tool, but it takes no effort at all to turn that into a persistent check. By comparison: ripgrep finds anything fast, but nobody uses it to write linters. Other off the shelf linters do a great job finding (simple) issues, but bandit doesn’t help me one bit to build a mental map of how a codebase works.
Hey HN, I’m the author of this post and a contributor to Semgrep. Happy to answer questions and hear feedback! I’m excited to try to lower the barrier to writing a simple lint (or more complex program analysis) that previously only a static analysis expert could do; we’ve gotten contributions from people who don’t know what an abstract syntax tree is! The userbase for Semgrep is almost evenly split between security engineers using it for hunting/enforcement and developers looking for bugs; we’ve tried to collect examples for both use cases at
.
Is Semmle, offering CodeQL language and LGTM service, and recently acquired by Github, doing a similar thing (
)? If so, how does Semgrep compare to CodeQL?
_Edit:_ There is a help entry:
https://semgrep.dev/docs/faq/#how-is-semgrep-different-from-...
First of all, I love the idea of semgrep, but can't use it since we're using C++. Is there any chance for C++ support in the future?
The good news is that we’ve replaced almost all the homegrown parsers that were written while the tool was at Facebook and we’re using the now tree-sitter project, which already has parsers for 40+ languages. There is a tree-sitter-cpp project we can and will eventually integrate! The bad news is this requires the code to not use heavily macros to be parseable as-is. So really the difficulty is not C++ but rather the pre-processor.
I guess you could run the preprocessor and then run tree-sitter-cpp / semgrep on the preprocessed output, but the problem would then be trying to tie any findings from that to the original source.
Do gcc/clang/any other preprocessor create "source maps" that could facilitate that? GCC looks like it has a `-fdebug-cpp` that "[...] dumps debugging information about location maps. Every token in the output is preceded by the dump of the map its location belongs to."
Right; but this turns out to be pretty tricky in practice. I've attempted to do this for even relatively straightforward code (libsodium--complex in implementation, though not in API) with libclang and it was not particularly pleasant.
Some prior art for reference:
https://github.com/bytedeco/javacpp/issues/51
The preprocessed output of GCC and Clang usually contains the file names.
Better than that, compilers can be told to include line numbers too.
This thread would love to learn about Compiler Explorer
which works for C++ and many other languages.
Line numbers are a good start, but JS source maps go from output source byte ranges to input source byte ranges.
I don't write or even read a lot of C++ these days but I recall from when I did that a major pain point was deciphering compiler warnings/errors when there are a lot of templates, macros, or both. Seems like the problem has been around forever.
That sounds really cool, would it be possible to ignore the macros or can the code not be parsed at all then?
Not GP, but one the one hand semgrep has a real honest to goodness parser at its core; on the other hand I'd expect C++ to have sufficiently complicated semantics that it needs some understanding of C++ specific mechanics to be useful. Furthermore, you'd need preprocessor and template expansion magic to really get to the bottom of it. Effectively this is the same problem e.g. javacpp has.
Interesting. You can try it out here:
It doesn't appear to catch the following when searching for exec(...) in the following python code:
not_exec = exec not_exec('rm -rf /')
Edited to include language
Good catch. Currently we only support constant propagation for literals. Here's a working example:
$ semgrep -e "not_exec('somestr)"
will match
foo = "somestr" not_exec(foo)
Here's a more complete example:
https://semgrep.dev/s/ievans:const-python
In your example, we don't propagate exec because it's not seen as a literal -- that's a TODO for sure. See
https://github.com/returntocorp/semgrep/issues/1645
for a longer discussion!
The CI use case is cool, and probably makes more money. But I would really love to see a CLI for optimized search and replace. It seems that they have search available on the CLI however I can't see any replace. And most of the options are focused on running the rule config instead of adhoc replacements.
The CLI does have an --autofix flag, but the replacement it uses has to be specified through a local config file rather than as a command line arg. There is a ticket that though!
https://github.com/returntocorp/semgrep/issues/840
Here are docs for what exists currently
https://semgrep.dev/docs/experiments/#autofix
I would love this in my editor: if I search for
day = 'friday'
I want it to find
day="friday"
also!
If you use VSCode you can get that today, if you use something else it doesn't look too hard to write:
https://semgrep.dev/docs/integrations/#editor
I'd expect latency might be juuust in the range where it doesn't feel interactive yet? But honestly any search that isn't ripgrep or --omg-optimized-etags feels like that to me now, and people use symbol rename features in IDEs all the time that take multiple seconds, so maybe I'm just unreasonably picky.
I created a semgrep rule for this:
https://semgrep.dev/s/GRD6/?version=develop
This is actually a great idea.