💾 Archived View for dioskouroi.xyz › thread › 24931985 captured on 2020-10-31 at 00:58:51. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Introducing Semgrep and r2c

Author: pabloest

Score: 111

Comments: 21

Date: 2020-10-29 16:07:32

Web Link

________________________________________________________________________________

rtsao wrote at 2020-10-29 18:55:31:

It's great to see more tools adopting tree-sitter [1].

Having a (fast) single tool that can accurately parse most commonly used programming languages is incredibly useful, but it requires the maintenance of dozens of grammars, which is difficult without a large community effort. Hopefully increased adoption means more accurate parsers and support for even more languages.

Tree-sitter powers syntax highlighting on GitHub.com and (soon) neovim and OniVim 2. Hopefully regex-based syntax highlighting is a thing of the past soon. If you haven't seen the Strange Loop conference talk on tree-sitter [2] yet, it's worth a watch.

I think a Prettier-like code formatter using tree-sitter would be cool, both in terms of potentially broader language support and native performance.

[1]:

https://tree-sitter.github.io/tree-sitter/

[2]:

https://www.youtube.com/watch?v=Jes3bD6P0To

lvh wrote at 2020-10-29 16:10:35:

We've been working with the r2c folks for a while, and been using semgrep since before it was called semgrep.

If you can write code in a language, you can use semgrep. It also has a feature I have learned to love every time I find it in any kind of auditing tool: it’s ruthlessly effective as an exploratory and experimental tool, but it takes no effort at all to turn that into a persistent check. By comparison: ripgrep finds anything fast, but nobody uses it to write linters. Other off the shelf linters do a great job finding (simple) issues, but bandit doesn’t help me one bit to build a mental map of how a codebase works.

ievans wrote at 2020-10-29 16:33:54:

Hey HN, I’m the author of this post and a contributor to Semgrep. Happy to answer questions and hear feedback! I’m excited to try to lower the barrier to writing a simple lint (or more complex program analysis) that previously only a static analysis expert could do; we’ve gotten contributions from people who don’t know what an abstract syntax tree is! The userbase for Semgrep is almost evenly split between security engineers using it for hunting/enforcement and developers looking for bugs; we’ve tried to collect examples for both use cases at

https://semgrep.dev/explore

dti wrote at 2020-10-29 17:58:02:

Is Semmle, offering CodeQL language and LGTM service, and recently acquired by Github, doing a similar thing (

https://semmle.com/

)? If so, how does Semgrep compare to CodeQL?

_Edit:_ There is a help entry:

https://semgrep.dev/docs/faq/#how-is-semgrep-different-from-...

carlmr wrote at 2020-10-29 17:29:38:

First of all, I love the idea of semgrep, but can't use it since we're using C++. Is there any chance for C++ support in the future?

ievans wrote at 2020-10-29 17:48:20:

The good news is that we’ve replaced almost all the homegrown parsers that were written while the tool was at Facebook and we’re using the now tree-sitter project, which already has parsers for 40+ languages. There is a tree-sitter-cpp project we can and will eventually integrate! The bad news is this requires the code to not use heavily macros to be parseable as-is. So really the difficulty is not C++ but rather the pre-processor.

philsnow wrote at 2020-10-29 17:55:59:

I guess you could run the preprocessor and then run tree-sitter-cpp / semgrep on the preprocessed output, but the problem would then be trying to tie any findings from that to the original source.

Do gcc/clang/any other preprocessor create "source maps" that could facilitate that? GCC looks like it has a `-fdebug-cpp` that "[...] dumps debugging information about location maps. Every token in the output is preceded by the dump of the map its location belongs to."

lvh wrote at 2020-10-29 18:12:47:

Right; but this turns out to be pretty tricky in practice. I've attempted to do this for even relatively straightforward code (libsodium--complex in implementation, though not in API) with libclang and it was not particularly pleasant.

Some prior art for reference:

https://github.com/bytedeco/javacpp/issues/51

carlmr wrote at 2020-10-29 18:04:16:

The preprocessed output of GCC and Clang usually contains the file names.

inetknght wrote at 2020-10-29 18:44:02:

Better than that, compilers can be told to include line numbers too.

This thread would love to learn about Compiler Explorer

https://gcc.godbolt.org/

which works for C++ and many other languages.

philsnow wrote at 2020-10-29 18:50:23:

Line numbers are a good start, but JS source maps go from output source byte ranges to input source byte ranges.

I don't write or even read a lot of C++ these days but I recall from when I did that a major pain point was deciphering compiler warnings/errors when there are a lot of templates, macros, or both. Seems like the problem has been around forever.

carlmr wrote at 2020-10-29 17:53:12:

That sounds really cool, would it be possible to ignore the macros or can the code not be parsed at all then?

lvh wrote at 2020-10-29 17:34:54:

Not GP, but one the one hand semgrep has a real honest to goodness parser at its core; on the other hand I'd expect C++ to have sufficiently complicated semantics that it needs some understanding of C++ specific mechanics to be useful. Furthermore, you'd need preprocessor and template expansion magic to really get to the bottom of it. Effectively this is the same problem e.g. javacpp has.

scanr wrote at 2020-10-29 18:03:51:

Interesting. You can try it out here:

https://semgrep.dev/editor/

It doesn't appear to catch the following when searching for exec(...) in the following python code:

not_exec = exec
    not_exec('rm -rf /')

Edited to include language

ievans wrote at 2020-10-29 18:21:45:

Good catch. Currently we only support constant propagation for literals. Here's a working example:

  $ semgrep -e "not_exec('somestr)"

will match

  foo = "somestr"
    not_exec(foo)

Here's a more complete example:

https://semgrep.dev/s/ievans:const-python

In your example, we don't propagate exec because it's not seen as a literal -- that's a TODO for sure. See

https://github.com/returntocorp/semgrep/issues/1645

for a longer discussion!

kevincox wrote at 2020-10-29 17:53:21:

The CI use case is cool, and probably makes more money. But I would really love to see a CLI for optimized search and replace. It seems that they have search available on the CLI however I can't see any replace. And most of the options are focused on running the rule config instead of adhoc replacements.

ievans wrote at 2020-10-29 18:18:18:

The CLI does have an --autofix flag, but the replacement it uses has to be specified through a local config file rather than as a command line arg. There is a ticket that though!

https://github.com/returntocorp/semgrep/issues/840

Here are docs for what exists currently

https://semgrep.dev/docs/experiments/#autofix

magicseth wrote at 2020-10-29 16:31:14:

I would love this in my editor: if I search for

day = 'friday'

I want it to find

day="friday"

also!

lvh wrote at 2020-10-29 17:31:02:

If you use VSCode you can get that today, if you use something else it doesn't look too hard to write:

https://semgrep.dev/docs/integrations/#editor

I'd expect latency might be juuust in the range where it doesn't feel interactive yet? But honestly any search that isn't ripgrep or --omg-optimized-etags feels like that to me now, and people use symbol rename features in IDEs all the time that take multiple seconds, so maybe I'm just unreasonably picky.

daghan wrote at 2020-10-30 18:47:51:

I created a semgrep rule for this:

https://semgrep.dev/s/GRD6/?version=develop

daghan wrote at 2020-10-29 17:03:32:

This is actually a great idea.