Confusing Grep Mistakes I've Made

Author: r4um

Score: 90

Comments: 46

Date: 2020-11-05 06:26:54

Web Link

________________________________________________________________________________

throwaway373438 wrote at 2020-11-05 23:44:01:

Others have commented that many of these are general shell or terminal quoting problems.

Something that stood out for me is that the author did not mention ^V, which is very useful in quoting metacharacters. Take the tab example: The author seems to imply that PCRE is needed to match a tab because there is no \t escape sequence in BRE/ERE. Presumably he cannot just type in a tab because he's using a shell like bash, and tab has a special interpretation and cannot be typed in as a string literal.

The way around this is to use ^V as a _terminal_ escape sequence, followed by simply pressing the tab key. This technique can be used to insert other control characters as string literals in arguments. Want to grep for EOF? "grep ^V^D" will get you there.

saurik wrote at 2020-11-06 02:06:46:

Or \t' can be used in bash to get some C-style escapes. (The ^V thing is epic though!)

eru wrote at 2020-11-06 04:26:09:

By the way, that technique also works in vim and some other editors, I think.

Jestar342 wrote at 2020-11-06 09:09:03:

Yep. Allows for some seriously cool programmatic editing:

    g/^abc/norm ^3wciwHello^V^]2ei!

Any line starting with abc, replace the 3rd word with hello, and append an exclamation mark to the end of the 5th word. ^] is the control char for escape.

justinsaccount wrote at 2020-11-06 03:13:01:

My favorite grep mistake is actually from a related tool pgrep and pkill.

pgrep foo -> finds things running matching foo
  pkill foo -> kills things running matching foo

except every year or so, I do something like this

$ pkill foo
  $ echo nothing happened?
  $ pkill -9 foo
  $ echo nothing happened still? huh?
  $ echo ok, let's run this in verbose mode..
  $ pkill -9 foo -v

but.. -v isn't verbose. since pkill is part of pgrep, and pgrep is like grep, -v is 'Reverse'.

hackyhacky wrote at 2020-11-06 05:50:54:

According to the man page, related to the -v option:

> In pkill's context the short option is disabled to avoid accidental usage of the option.

So at least in the version that I have installed (procps-ng 3.3.15), pkill -v results in an error message, not the catastrophic situation suggested by the parent comment.

zwp wrote at 2020-11-06 08:04:22:

It was fixed in 2012. The debian bug report (2009) cites numerous users that fell into the trap before the fix.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=558044

    commit 1af18c260a87dc38f0e33bfeb6de6163f91be4ad
    Author: Sami Kerola <kerolasa@iki.fi>
    Date:   Sat Feb 11 20:33:17 2012 +0100

        pkill: remove -v match inversion option
    
        The option -v does not make much sense in pkill context.

alanbernstein wrote at 2020-11-06 03:15:21:

oof

colanderman wrote at 2020-11-06 04:18:17:

On many systems,

grep '[A-Z]'

will match 'y' but not 'z' (note the case). This is due to collation of the system's locale, which intersperses upper- and lowercase letters.

Usually what you want instead is

LC_ALL=C grep '[A-Z]'

(to match ASCII uppercase letters), or

grep '[[:upper:]]'

(to match your locale's uppercase letters).

FWIW I cannot reproduce this on my system any longer; it seems to vary by distribution. See e.g. [1].

[1]

https://unix.stackexchange.com/questions/15980/does-should-l...

dehrmann wrote at 2020-11-06 07:03:51:

3) Confusing '.' with '\.'

Because of how different languages handle escaping within strings and not wanting to have to think about it, I've started using [.] to get a literal dot because it always means what I want. I still don't like it.

schoen wrote at 2020-11-06 02:53:10:

Another I've run into:

If you meant to search in ∗, but somehow completely forgot to type the ∗ at the end of the command line, you might do something like

grep foo

and then wait for a while while grep searches your standard input, instead of files on disk, until you notice your mistake.

(I don't find this conceptually confusing -- I expect many Unix tools, including grep, to act on their standard input -- but I've still sometimes simply forgotten the * and not noticed right away.)

unhammer wrote at 2020-11-06 08:54:53:

With (g)awk it can be even worse.

  $ awk 'BEGIN{while(getline<"phonebook")p[$1]=$2} $1 in p{print $0,p[$1]}' names.tsv

If the file `phonebook` doesn't exist, this will just sit there politely waiting for you to create the file in your other terminal. (And then when you do, it could crash with `fatal: cannot open file `names.tsv' for reading (No such file or directory)` if that file also doesn't exist. Consistency yay.)

axaxs wrote at 2020-11-06 06:51:03:

Don't feel bad, I've done this probably a thousand times. Probably because I normally am using -r which doesn't have that behavior. I usually wonder 'wow this directory was bigger than I thought' until realizing my mistake. In the modern SSD era at least, you don't waste -minutes-, usually.

nieve wrote at 2020-11-06 14:48:33:

To add to the confusion grep -r without a path will recursively search the . directory.

bowmessage wrote at 2020-11-05 22:39:37:

Often my biggest mistake is using grep instead of rg.

MaxBarraclough wrote at 2020-11-05 22:10:09:

Tripping over the escaping rules is a continuing pain. GNU sed does things differently, using _\|_ for the regex _or_ operator. [0] It's fiddly enough to baffle the occasional StackOverflow answerer. [1]

[0]

https://www.gnu.org/software/sed/manual/sed.html#BRE-syntax

[1]

https://stackoverflow.com/a/6388042/

rattray wrote at 2020-11-06 00:21:11:

Related, the ack website has a comparison chart of various grep competitors:

https://beyondgrep.com/feature-comparison/

sillysaurusx wrote at 2020-11-05 22:48:09:

Just use egrep instead of grep. Much easier to remember / less surprising behavior, and it’s supported on every system out of the box (unlike rg).

gnagatomo wrote at 2020-11-05 22:04:43:

Most of the mistakes are not exclusive/related to grep, they are actually shell mistakes.

mturmon wrote at 2020-11-06 00:22:12:

This is true, and I agree. But this kind of "language-within-language" problem (regex within sh) comes up a lot when formulating grep and sed command lines.

Come to think of it, I think this is the most frequently-encountered class of "this line didn't do what I thought it would" type errors that I get as a near-daily user of (ba)sh for a few decades now.

Question for the group: When you encounter this kind of issue, e.g. the shell is stealing a single or double quote meant to be in the regex, do you diagnose the problem w/r/t the shell precedence rules for quotes and backslashes, or do you just blindly put the opposite type of quote around the whole thing and re-run?

Because probably half the time I use a basic blind strategy, and that's not usually how I approach programming errors!

asicsp wrote at 2020-11-06 05:07:47:

agree, and the author continues to use double quotes around search pattern despite showing an example where single quote was needed!

As a good practice, I always try to single quote the expression, even if it is not needed. Use double quote only when needed and even then, use it only for the portion required, not for the entire expression.

https://mywiki.wooledge.org/Quotes

is a must read.

Jenz wrote at 2020-11-05 22:42:58:

And simply regex mistakes.

sethammons wrote at 2020-11-05 23:34:45:

I like `grep -F` - it treats the search as a literal; no more escaping regex when you are really wanting a ".".

asicsp wrote at 2020-11-06 07:24:27:

I have a list of gotchas and tricks here:

https://learnbyexample.github.io/learn_gnugrep_ripgrep/gotch...

As pointed out in other comments, many of the issues in the post is due to shell, not specific to grep. Especially quoting. Always use single quotes to specify the search pattern, unless other forms of shell quoting is needed. Otherwise, you'll face issues with commands like

grep ; ip.txt

Another example is searching for a pattern that starts with a hyphen, which causes issue even with quoting

$ echo '5*3-2=13' | grep '-2'
    Usage: grep [OPTION]... PATTERN [FILE]...
    Try 'grep --help' for more information.

You'll need to either escape the hyphen or use -- before the search pattern to prevent it from being treated as a command option. This is needed if a filename starts with a hyphen too.

xorcist wrote at 2020-11-06 10:17:39:

That's a really good article!

One of the most common mistakes I see is missing escaping. Not strange when the rules change between the various regexp modes.

Something like 'grep "file.exe"' is probably not what you want, and sometimes it can be easier to turn off regexp processing with -F in the the cases that doesn't use them.

zmmmmm wrote at 2020-11-06 05:19:22:

Another basic problem I run into is that grep returns whether it found a match or not as its exit status. This means if you, for example, run a script with bash option

set -e

Then the session will exit unceremoniously on any grep that doesn't match. This often catches you out when you say, develop a script without -e and use it for a while, and then one day someone deploys the same thing with -e enabled because they think it will be more robust if the script terminates if a command fails - and boom, now suddenly your script is randomly broken depending on text matches of the files it is processing. It is even worse if you are sourcing the script somehow from within an existing session and it terminates your interactive shell!

Anthony-G wrote at 2020-11-06 13:13:16:

I run scripts using both dash and bash and always use `set -eu`. Returning a false exit status is not a problem if you wrap the `grep` inside an `if` statement:

  if grep pattern file; then echo y; fi

jrockway wrote at 2020-11-05 23:15:09:

Quoting and parsing continue to surprise people, and I don't blame them -- you're embedding one programming language inside another (regex inside bash), and they use some of the same reserved symbols and have slightly different quoting rules. And, every language is "inspired" by the others, but have their own special rules, so the more you learn, the less sure of anything you'll ever be. (For example, '(' matches a literal parenthesis in Emacs Lisp regexes, and '\(' starts a capture group!)

For matching literal periods, I personally have gotten into the habit of using "[.]" instead of "\.". Less to go wrong in this double-embed scenario, and I have never ever regretted adding the extra byte to my regexp. (Of course, character classes have their own weirdness. Your editor that matches bracket paris will love the syntax for matching a literal '['.)

userbinator wrote at 2020-11-06 04:53:09:

As confusing as multi-layered escaping can be, I still vastly prefer the situation in Linux and other Unix-likes to that of Windows where each process is given nothing more than a mildly-processed command line from the shell (e.g. no globbing) as a string, and parses it however it wants to. Escaping in batch files is slightly different from interactive commands, each utility may have its own escaping conventions on top of those the shell has, etc. Linux and the like have a far more predictable and consistent experience, since argument splitting and globbing is always done once by the shell, and each process gets the already-split and expanded argument list.

disown wrote at 2020-11-05 23:50:27:

If you are interested in where the name "grep" came from:

g/re/p

g: global

re: regular expression

p: print.

https://tldp.org/LDP/abs/html/textproc.html

ineedasername wrote at 2020-11-05 23:32:37:

Well, not so confusing, but accidentally printing negative matches. Once it locked up my shell instance, and I had to ssh in again to kill it.

rattray wrote at 2020-11-06 00:17:18:

How many of these gotchas exist with ripgrep?

NathanOsullivan wrote at 2020-11-06 03:14:44:

Most of them are shell gotchas, and so would apply to rg

jftuga wrote at 2020-11-06 04:08:06:

On Windows, if you want to match a line ending with $, then you have to use the --crlf option.

When enabled, ripgrep will treat CRLF (\r\n) as a line terminator instead of just \n.

I really wish this was enabled by default on Windows builds.

burntsushi wrote at 2020-11-06 04:34:39:

ripgrep at least won't suffer from the UTF-16 problem. It will automatically handle those files correctly by transcoding to UTF-8.

maest wrote at 2020-11-05 23:18:29:

What's a reasonable grep alias I should add to my .bashrc?

ben509 wrote at 2020-11-06 03:37:47:

The only flag I'd say I want 99% of the time is -E (same as calling egrep) because extended regular expressions are what you almost always want.

A sibling comment suggested -i, other useful flags are -w (find whole words) and -C (change the amount of context around matches), but I don't think I always use those.

A handy function for searching a tree:

  greptree() {
       find "$1" -name "$2" -type f -exec egrep "$3" '{}' +
   }

   greptree . '*.java' 'foo[123]'

That would search all java files in the current tree. One thing people often don't know is that -exec ... {} + in modern 'find' works like piping to xargs.

All that said... when I'm searching a tree, I can usually use 'git grep'.

wahern wrote at 2020-11-06 04:43:26:

-exec {} + is better than piping to xargs as it handles filenames with whitespace properly. Otherwise you have to use -print0 and xargs -0 extensions.

Scarbutt wrote at 2020-11-06 04:02:05:

Grep can search recursively:

    grep -Er 'foo[123]' --include='*.java' .

blibble wrote at 2020-11-06 04:57:15:

grep -P to use pcre (i.e. what everyone knows)

xorcist wrote at 2020-11-06 10:19:43:

grep -r --exclude-dir=.git --exclude-dir=.svn --color=auto

opan wrote at 2020-11-06 01:10:33:

I have `gi` aliased to `grep -i`.

m000 wrote at 2020-11-05 21:44:11:

Better title: grep newbie mistakes

chrisshroba wrote at 2020-11-05 22:41:54:

For some of the mistakes (like .* vs '.*') maybe, but I doubt even experienced grep users are aware of all these things (e.g. how \t is interpreted in BRE vs. extended vs. PCRE modes).

gpvos wrote at 2020-11-06 14:37:52:

On a linked page, he recommends to use -P in all cases without mentioning that the BSD greps don't support it. I think "newbie" is a proper description.

m000 wrote at 2020-11-06 14:38:13:

If you need to be told that grep supports different flavours of regular expressions, you are a grep newbie. And although even experienced users may be bit by errors related to the flavour of regular expressions in use, they surely don't need TFA to know the source of the errors.