2010-05-31 Blognapping

For *Blogger*, using *bash*, *perl*, and *curl*. You need to replace XXX with the magic number you get when you look at the blog’s source. The HTML header will contain a line like the following: `<link rel="service.post" type="application/atom+xml" title="..." href="http://www.blogger.com/feeds/XXX/posts/default" />` – this is where you get the number from.

Once you have it:

for i in `seq 40`; do
  start=$(((10#$i-1)*25+1))
  curl -o foo-$i.atom "http://www.blogger.com/feeds/XXX/posts/default?start-index=$start&max-results=25"
done

This should get you 40 files called `foo-1.atom` to `foo.40.atom` with 25 articles each in your current directory. Delete the ones that don’t contain any results or increase the number if you’re looking at a blog with more than 1000 posts that you’re interested in.

For a *Wordpress* blog, we try to do the same thing. First, get the atom pages:

for i in `seq 100`; do
  curl -o foo-$i.atom "http://foo.wordpress.com/feed/atom/?paged=$i"
done

This should get you 100 files called `foo-1.atom` to `foo.100.atom` in your current directory. Delete the ones that don’t contain any results or increase the number if you’re looking at a blog with more posts.

Now, unless the author has disabled it somehow, the atom feeds already include the complete articles. It’s certainly possible to fetch them all again, but it’s not necessary. Save the following in a Perl script called `extract.pl`.

#!/usr/bin/perl
use strict;
use XML::LibXML;
undef $/;
my $data = <STDIN>;
my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($data);
die $@ if $@;
my $encoding = $doc->actualEncoding();
my $context = XML::LibXML::XPathContext->new($doc);
$context->registerNs('atom', 'http://www.w3.org/2005/Atom');
foreach my $entry ($context->findnodes('//atom:entry')) {
  my $title = $entry->getChildrenByTagName('title')->[0]->to_literal;
  $title =~ s!/!_!gi;
  $title =~ s!&amp;!&!gi;
  $title =~ s!&#(\d+);!chr($1)!ge;
  my $content = $entry->getChildrenByTagName('content')->[0]->to_literal;
  open(F, ">:raw" . $title . ".html") or die $! . ' ' . $title;
  $content = utf8::decode($content);
  print F <<EOT;
<html>
<head>
<meta content='text/html; charset=$encoding' http-equiv='Content-Type'/>
</head>
<body>
$content
</body>
</html>
EOT
  close F;
}

Run it on the Atom files:

for f in *.atom; do
    perl extract.pl < $f
done

You should end up with a ton of HTML files in your current directory.

If that doesn’t work, perhaps the author only has links to the actual articles in their atom files. Here is how to extract the HTML links from these Atom feeds: save the following in a Perl script called `url.pl`.

#!/usr/bin/perl
use XML::LibXML;
undef $/;
$data = <STDIN>;
my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($data);
die $@ if $@;
my $context = XML::LibXML::XPathContext->new($doc);
$context->registerNs('atom', 'http://www.w3.org/2005/Atom');
foreach ($context->findnodes('//atom:entry'
			     . '/atom:link[@rel="alternate"][@type="text/html"]'
			     . '/attribute::href')) {
  print $_->to_literal() . "\n";
}

Now you can extract all the URLs and fetch them:

for f in *.atom; do
    for url in `perl url.pl < $f`; do
        curl -O "$url";
    done;
done

The use of the `-O` option assumes that the file names given by the URL will be unique – this is not necessarily true as `http://localhost/2010/05/test.html` and `http://localhost/2010/06/test.html` will result in one overwriting the other.

You should end up with a ton of HTML files in your current directory.

This doesn’t get any required extra files like CSS or images, but it might be good enough for a *blog backup*.

If you want to get the images, here’s a way to extract the image URLs and download them. Save the following as `img.pl`.

#!/usr/bin/perl
use XML::LibXML;
undef $/;
$data = <STDIN>;
1. munging
$data =~ s!<colgroup>.*?</colgroup>!!gs;
$data =~ s!<class western="">.*?</class>!!gs;
1. parsing
my $parser = XML::LibXML->new();
my $doc = $parser->parse_html_string($data);
die $@ if $@;
1. extracting
my $context = XML::LibXML::XPathContext->new($doc);
foreach ($context->findnodes('//img/attribute::src[starts-with(.,"http")]')) {
  print $_->to_literal() . "\n";
}

Use it:

for f in *.html; do
    echo "$f";
    for img in $(perl img.pl < "$f"); do
	echo "$img"
        curl -s -O "$img"
    done
done

Watch out for parser errors!

What you’re still lacking is a fixing of all the links in the HTML sources. How about this? Save as `img-replace.pl`.

#!/usr/bin/perl
use XML::LibXML;
undef $/;
$data = <STDIN>;
1. munging
$data =~ s!<colgroup>.*?</colgroup>!!gs;
$data =~ s!<class western="">.*?</class>!!gs;
1. parsing
my $parser = XML::LibXML->new();
my $doc = $parser->parse_html_string($data);
die $@ if $@;
1. extracting
my $context = XML::LibXML::XPathContext->new($doc);
for my $attr ($context->findnodes('//img/attribute::src[starts-with(.,"http")]')) {
  my $url = $attr->getValue();
  $url =~ s!.*/!!;
  $attr->setValue($url);
}
print $doc->toString();

Use:

for f in *.html; do
    echo "$f";
    perl img-replace.pl < "$f" > "${f}_"
    mv "${f}_" "$f"
done

This seems to work well enough. If you have some HTML files that cannot be parsed, however, this will result in them getting overwritten with an empty file.

#Blogs #Atom

Comments

(Please contact me if you want to remove your comment.)

⁂

I’m surprised (and somewhat impressed) that it works for Wordpress blogs too.

Good work!

– greywulf 2010-06-01 05:51 UTC

greywulf

---

Now that the coding has been done, I need to do actual text assembly. Yikes! 🙂

– Alex Schroeder 2010-06-01 17:52 UTC

Alex Schroeder

---

If you’re wondering how to do this… Assume you want to pull a copy of A Hamsterish Hoard of Dungeons and Dragons. Examine the source code and you’ll find a link to the atom feed within *blogger*. This is important, because it’ll provide us with the blog Id! In this case:

A Hamsterish Hoard of Dungeons and Dragons

`<link rel="alternate" type="application/atom+xml" title="A Hamsterish Hoard of Dungeons and Dragons - Atom" href="http://hamsterhoard.blogspot.com/feeds/posts/default" /> <link rel="alternate" type="application/rss+xml" title="A Hamsterish Hoard of Dungeons and Dragons - RSS" href="http://hamsterhoard.blogspot.com/feeds/posts/default?alt=rss" /> <link rel="service.post" type="application/atom+xml" title="A Hamsterish Hoard of Dungeons and Dragons - Atom" href="http://www.blogger.com/feeds/5373792969086619654/posts/default" />` ← that’s the one we’re looking for!

Start with a small set: the last 100 entries:

for i in `seq 4`; do
  start=$((($i-1)*25+1))
  curl -o taichara-$i.atom "http://www.blogger.com/feeds/5373792969086619654/posts/default?start-index=$start&amp;max-results=25"
done

Save it in a script such as `download-atom.sh` and run it using `bash download-atom.sh`. You’ll end up with the files `taichara-1.atom taichara-2.atom taichara-3.atom taichara-4.atom`.

Now take the Perl script from the main page and save it as `url.pl`. It will extract the page URLs from the Atom files.

for f in *.atom; do
  for p in `perl url.pl &lt; $f`; do
    wget $p
  done
done

Once you’ve verified it, you can fetch more Atom pages.

– Alex Schroeder 2011-11-16 19:53 UTC

Alex Schroeder