2012-05-16 Blog Napping

Ok, so I wanted a local copy of Metal Earth in order to better prepare for my game. Based on previous work I had done, this proved to be fairly easy and I improved my scripts along the way. Yay!

Metal Earth

previous work

To identify the blog, look at the source of any page. The HTML header will contain a line like the following: `<link rel="service.post" type="application/atom+xml" title="..." href="http://www.blogger.com/feeds/XXX/posts/default" />` – this is where you get the number from. In this case, the number is 2248254789731612355.

#! /bin/sh
for i in `seq 40`; do
  start=$((($i-1)*25+1))
  curl -o foo-$i.atom "http://www.blogger.com/feeds/2248254789731612355/posts/default?start-index=$start&max-results=25"
done

You’ll find that you only need to keep the first four of them.

#! /bin/sh
for f in *.atom; do
    perl extract.pl "$*" < $f
done
#!/usr/bin/perl
use strict;
use XML::LibXML;
use HTML::HTML5::Parser;
use Getopt::Std;
use DateTime::Format::W3CDTF;
use DateTime;
our $opt_f;
getopts('f');
undef $/;
my $data = <STDIN>;
my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($data);
die $@ if $@;
my $encoding = $doc->actualEncoding();
my $context = XML::LibXML::XPathContext->new($doc);
$context->registerNs('atom', 'http://www.w3.org/2005/Atom');
my $html_parser;
foreach my $entry ($context->findnodes('//atom:entry')) {
  my $content = $entry->getChildrenByTagName('content')->[0]->to_literal;
  my $title = $entry->getChildrenByTagName('title')->[0]->to_literal;
  $title =~ s!/!_!gi;
  $title =~ s!&amp;!&!gi;
  $title =~ s!&#(\d+);!chr($1)!ge;
  if (not $title) {
    if (not $html_parser) {
      $html_parser = HTML::HTML5::Parser->new;
    }
    my $html_doc = $html_parser->parse_string($content);
    # we don't know the HTML namespace for certain
    my $html_ns = $html_doc->documentElement->namespaceURI();
    my $html_context = XML::LibXML::XPathContext->new($html_doc);
    $html_context->registerNs('html', $html_ns);
    $title = $html_context->findnodes('//html:h1')->[0];
    $title = $html_context->findnodes('//html:span')->[0] unless $title;
    $title = $title->to_literal if $title;
    warn "Guessed missing title: $title\n";
  }
  my $f = DateTime::Format::W3CDTF->new;
  my $dt = $f->parse_datetime($entry->getChildrenByTagName('updated')->[0]->to_literal)->epoch;
  my $file = $title . ".html";
  if (-f $file and ! $opt_f) {
    warn "$file exists\n";
  } else {
    open(F, ">:encoding($encoding)", $file) or die $! . ' ' . $file;
    print F <<EOT;
<html>
<head>
<meta content='text/html; charset=$encoding' http-equiv='Content-Type'/>
</head>
<body>
$content
</body>
</html>
EOT
    close F;
    utime $dt, $dt, $file;
  }
}

​#Blogs

Comments

(Please contact me if you want to remove your comment.)

Recently I wanted a copy of *Elfmaids & Octopi* because the owner announced on Reddit that they were going to move elsewhere.

The directory structure I used:

┬ Elfmaids & Octopi
├ feed
└ html

This is how I got a copy of the feed, download.sh in the top folder:

#! /bin/sh
for i in `seq 80`; do
  start=$((($i-1)*25+1))
  curl -o foo-$i.atom "https://www.blogger.com/feeds/737809845612070971/posts/default?start-index=$start&max-results=25"
done

This downloads a bit more than seventy Atom files plus a few nearly empty Atom files. I moved these into the first subdirectory:

mv *.atom feed

I installed the missing dependency for my Perl script. Depending on your setup you might have more dependencies missing, and you might have to use cpan instead of my favourite, cpanm):

cpanm HTML::HTML5::Parser

I saved the Perl script as extract.pl:

#!/usr/bin/perl
use strict;
use XML::LibXML;
use HTML::HTML5::Parser;
use Getopt::Std;
use DateTime::Format::W3CDTF;
use DateTime;
our $opt_f;
getopts('f');
undef $/;
my $data = <STDIN>;
my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($data);
die $@ if $@;
my $encoding = $doc->actualEncoding();
my $context = XML::LibXML::XPathContext->new($doc);
$context->registerNs('atom', 'http://www.w3.org/2005/Atom');
my $html_parser;
foreach my $entry ($context->findnodes('//atom:entry')) {
  my $content = $entry->getChildrenByTagName('content')->[0]->to_literal;
  my $title = $entry->getChildrenByTagName('title')->[0]->to_literal;
  $title =~ s!/!_!gi;
  $title =~ s!&amp;!&!gi;
  $title =~ s!&#(\d+);!chr($1)!ge;
  if (not $title) {
    if (not $html_parser) {
      $html_parser = HTML::HTML5::Parser->new;
    }
    my $html_doc = $html_parser->parse_string($content);
    # we don't know the HTML namespace for certain
    my $html_ns = $html_doc->documentElement->namespaceURI();
    my $html_context = XML::LibXML::XPathContext->new($html_doc);
    $html_context->registerNs('html', $html_ns);
    $title = $html_context->findnodes('//html:h1')->[0];
    $title = $html_context->findnodes('//html:span')->[0] unless $title;
    $title = $title->to_literal if $title;
    warn "Guessed missing title: $title\n";
  }
  my $f = DateTime::Format::W3CDTF->new;
  my $dt = $f->parse_datetime($entry->getChildrenByTagName('updated')->[0]->to_literal)->epoch;
  my $file = "html/$title.html";
  if (-f $file and ! $opt_f) {
    warn "$file exists\n";
    my $i = 2;
    $i++ while -f "html/$title ($i).html";
    $file = "html/$title ($i).html";
  }
  open(F, ">:encoding($encoding)", $file) or die $! . ' ' . $file;
  print F <<EOT;
<html>
<head>
<meta content='text/html; charset=$encoding' http-equiv='Content-Type'/>
</head>
<body>
$content
</body>
</html>
EOT
  close F;
  utime $dt, $dt, $file;
}

And I saved a simple wrapper as extract.sh:

#! /bin/sh
for f in feed/*.atom; do
    perl extract.pl "$*" < $f
done

And finally I moved all the HTML files into the second subdirectory:

mv *.html html

Done!

– Alex