2016-07-25 IP number range to regular expression

I can’t believe how long it took me to develop a little script to do the following: given an IP number, look up the IP number network range this particular number falls in and create a regular expression that matches all of them.

Example:

1. the IP number of a spam wiki edit is `45.58.42.34`

2. `whois 45.58.42.34` says the range is `45.58.42.0 - 45.58.42.255`

3. the regular expression we want is `^45\.58\.42\.`

Harder example:

1. the IP number of a spam wiki edit is `46.101.109.194`

2. `whois 46.101.109.194` says the range is `46.101.0.0 - 46.101.127.255`

3. the regular expression we want is `^46\.101\.([0-9]|[1-9][0-9]|1[0-1][0-9]|12[0-7]) gemini - alexschroeder.ch

It took me way too long to get at the following solution written in Perl 5. Let me know, if you have a more elegant solution. It doesn’t have to be written in Perl.

The script I wrote is called with an IP number or the string “test” to run its tests.

banned-host-regexp

#!/usr/bin/env perl

use Modern::Perl;
use Net::Whois::Parser qw(parse_whois);
use Test::More;

die "Usage: banned-host-regexp (IP-NUMBER|test)\n" unless $ARGV[0];

sub get_groups {
  my ($from, $to) = @_;
  my @groups;
  if ($from < 10) {
    my $to = $to >= 10 ? 9 : $to;
    push(@groups, [$from, $to]);
    $from = $to + 1;
  }
  while ($from < $to) {
    my $to = int($from/100) < int($to/100) ? $from + 99 - $from % 100 : $to;
    if ($from % 10) {
      push(@groups, [$from, $from + 9 - $from % 10]);
      $from += 10 - $from % 10;
    }
    if (int($from/10) < int($to/10)) {
      if ($to % 10 == 9) {
	push(@groups, [$from, $to]);
	$from = 1 + $to;
      } else {
	push(@groups, [$from, $to - 1 - $to % 10]);
	$from = $to - $to % 10;
      }
    } else {
      push(@groups, [$from - $from % 10, $to]);
      last;
    }
    if ($to % 10 != 9) {
      push(@groups, [$from, $to]);
      $from = 1 + $to; # jump from 99 to 100
    }
  }
  return \@groups;
}

if ($ARGV[0] eq 'test') {
  is_deeply(get_groups('2', '5'), [[2, 5]], "2-5");
  is_deeply(get_groups('9', '15'), [[9, 9], [10, 15]], "9-15");
  # diag explain get_groups('80', '90');
  is_deeply(get_groups('80', '90'), [[80, 89], [90, 90]], "80-90");
  is_deeply(get_groups('85', '99'), [[85, 89], [90, 99]], "85-99");
  # diag explain get_groups('80', '110');
  is_deeply(get_groups('80', '110'), [[80, 99], [100, 109], [110, 110]], "80-110");
  # diag explain get_groups('0', '127');
  is_deeply(get_groups('0', '127'), [[0, 9], [10, 99], [100, 119], [120, 127]], "0-127");
  # diag explain get_groups('0', '255');
  is_deeply(get_groups('0', '255'), [[0, 9], [10, 99], [100, 199], [200, 249], [250, 255]], "0-255");
}

sub get_regexp_range {
  my @chars;
  for my $group (@{get_groups(@_)}) {
    my ($from, $to) = @$group;
    my $char;
    for (my $i = length($from); $i >= 1; $i--) {
      if (substr($from, - $i, 1) eq substr($to, - $i, 1)) {
	$char .= substr($from, - $i, 1);
      } else {
	$char .= '[' . substr($from, - $i, 1) . '-' . substr($to, - $i, 1). ']';
      }
    }
    push(@chars, $char);
  }
  return join('|', @chars);
}

if ($ARGV[0] eq 'test') {
  is(get_regexp_range('2', '2'), '2', "2-2");
  is(get_regexp_range('2', '5'), '[2-5]', "2-5");
  is(get_regexp_range('2', '15'), '[2-9]|1[0-5]', "2-15");
  is(get_regexp_range('9', '15'), '9|1[0-5]', "9-15");
  is(get_regexp_range('2', '20'), '[2-9]|1[0-9]|20', "2-20");
  is(get_regexp_range('2', '25'), '[2-9]|1[0-9]|2[0-5]', "2-25");
  is(get_regexp_range('2', '35'), '[2-9]|[1-2][0-9]|3[0-5]', "2-35");
  is(get_regexp_range('80', '99'), '[8-9][0-9]', "80-99");
  is(get_regexp_range('85', '99'), '8[5-9]|9[0-9]', "85-99");
  is(get_regexp_range('80', '110'), '[8-9][0-9]|10[0-9]|110', "80-110");
  is(get_regexp_range('0', '127'), '[0-9]|[1-9][0-9]|1[0-1][0-9]|12[0-7]', "0-127");
}

sub get_regexp_ip {
  my ($from, $to) = @_;
  my @start = split(/\./, $from);
  my @end = split(/\./, $to);
  my $regexp = "^";
  for my $i (0 .. 3) {
    if ($start[$i] eq $end[$i]) {
      $regexp .= $start[$i];
    } elsif ($start[$i] eq '0' and $end[$i] eq '255') {
      last;
    } elsif ($start[$i + 1] > 0) {
      $regexp .= '(' . $start[$i] . '\.('
	  . get_regexp_range($start[$i + 1], '255') . ')|'
	  . get_regexp_range($start[$i] + 1, $end[$i + 1]) . ')';
      $regexp .= '\.';
      last;
    } else {
      $regexp .= '(' . get_regexp_range($start[$i], $end[$i]) . ')';
      $regexp .= $i < 3 ? '\.' : '


;
      last;
    }
    $regexp .= '\.' if $i < 3;
  }
  return $regexp;
}

if ($ARGV[0] eq 'test') {
  is(get_regexp_ip('88.0.0.0', '88.15.255.255'),
     '^88\.([0-9]|1[0-5])\.',
     '88.0.0.0 - 88.15.255.255');
  is(get_regexp_ip('77.56.180.0', '77.57.70.255'),
     '^77\.(56\.(1[8-9][0-9]|2[0-4][0-9]|25[0-5])|5[7-9]|6[0-9]|70)\.',
     '77.56.180.0 - 77.57.70.255');
  is(get_regexp_ip('46.101.0.0', '46.101.127.255'),
     '^46\.101\.([0-9]|[1-9][0-9]|1[0-1][0-9]|12[0-7])\.',
     '46.101.0.0 - 46.101.127.255');
}

sub get_range {
  my $ip = shift;
  my $response = parse_whois(domain => $ip);
  my ($start, $end);
  my $ip_regexp = '(?:[0-9]{1,3}\.){3}[0-9]{1,3}';
  for (sort keys(%{$response})) {
    if (($start, $end)
	= $response->{$_} =~ /($ip_regexp) *- *($ip_regexp)/) {
      last;
    }
  }
  die "Did not find an IP range in the response:\n"
      . join("\n", map { "$_: $response->{$_}" } sort keys(%{$response}))
      . "\n" unless $start and $end;
  return $start, $end;
}

if ($ARGV[0] eq 'test') {
  is_deeply([get_range('46.101.109.194')],
	    ['46.101.0.0', '46.101.127.255'],
	    "46.101.109.194");
}

if ($ARGV[0] eq 'test') {
  done_testing();
  exit 0;
}

print get_regexp_ip(get_range($ARGV[0])) . "\n";

​#Perl

Comments

(Please contact me if you want to remove your comment.)

Hi Alex, love your blog.

Why not turn it to an integer and do the math as if it were a subnet? I’m pretty sure the output from whois will always be a subnet range.

If you don’t want to roll your own, http://search.cpan.org/~luismunoz/NetAddr-IP-4.007/IP.pm. The trick is to turn the range form into a CIDR form.

http://search.cpan.org/~luismunoz/NetAddr-IP-4.007/IP.pm

– Ben Bennett 2016-07-26 11:12 UTC

Ben Bennett

---

Thanks! Yeah, I thought about that and truly, it would have been so much easier. I guess I was lured along by two things:

1. the current solution to banning IP numbers by Oddmuse wiki involves regular expressions because in the old days, it used to resolve IP numbers and get hostnames – and regular expressions worked well enough for host names but in recent years we’re more concerned with dedicated spammers instead of lazy vandals and as it turns out, regular expressions are a lousy way to match IP numbers using decimal dot notation

2. when I realized how tricky it turned out to be, I started thinking of it as personal challenge – which is stupid but there you go 😄

What I should do is change how BannedHosts is parsed and if the regular expression is in fact a CIDR, then I should use math instead of regular expression to check IP numbers.

BannedHosts

– Alex Schroeder 2016-07-26 13:07 UTC

---

+1 for no regular expressions in BannedHosts, but don’t forget about ipv6… 😄

– AlexDaniel 2016-07-26 21:55 UTC

---

Yeah, I definitely need to add IPv6 support to the existing solution if I don’t rewrite it.

– AlexSchroeder 2016-07-29 22:06 UTC