Ok, so I wanted a local copy of Metal Earth in order to better prepare for my game. Based on previous work I had done, this proved to be fairly easy and I improved my scripts along the way. Yay!
To identify the blog, look at the source of any page. The HTML header will contain a line like the following: `<link rel="service.post" type="application/atom+xml" title="..." href="http://www.blogger.com/feeds/XXX/posts/default" />` – this is where you get the number from. In this case, the number is 2248254789731612355.
#! /bin/sh for i in `seq 40`; do start=$((($i-1)*25+1)) curl -o foo-$i.atom "http://www.blogger.com/feeds/2248254789731612355/posts/default?start-index=$start&max-results=25" done
You’ll find that you only need to keep the first four of them.
#! /bin/sh for f in *.atom; do perl extract.pl "$*" < $f done
#!/usr/bin/perl use strict; use XML::LibXML; use HTML::HTML5::Parser; use Getopt::Std; use DateTime::Format::W3CDTF; use DateTime; our $opt_f; getopts('f'); undef $/; my $data = <STDIN>; my $parser = XML::LibXML->new(); my $doc = $parser->parse_string($data); die $@ if $@; my $encoding = $doc->actualEncoding(); my $context = XML::LibXML::XPathContext->new($doc); $context->registerNs('atom', 'http://www.w3.org/2005/Atom'); my $html_parser; foreach my $entry ($context->findnodes('//atom:entry')) { my $content = $entry->getChildrenByTagName('content')->[0]->to_literal; my $title = $entry->getChildrenByTagName('title')->[0]->to_literal; $title =~ s!/!_!gi; $title =~ s!&!&!gi; $title =~ s!&#(\d+);!chr($1)!ge; if (not $title) { if (not $html_parser) { $html_parser = HTML::HTML5::Parser->new; } my $html_doc = $html_parser->parse_string($content); # we don't know the HTML namespace for certain my $html_ns = $html_doc->documentElement->namespaceURI(); my $html_context = XML::LibXML::XPathContext->new($html_doc); $html_context->registerNs('html', $html_ns); $title = $html_context->findnodes('//html:h1')->[0]; $title = $html_context->findnodes('//html:span')->[0] unless $title; $title = $title->to_literal if $title; warn "Guessed missing title: $title\n"; } my $f = DateTime::Format::W3CDTF->new; my $dt = $f->parse_datetime($entry->getChildrenByTagName('updated')->[0]->to_literal)->epoch; my $file = $title . ".html"; if (-f $file and ! $opt_f) { warn "$file exists\n"; } else { open(F, ">:encoding($encoding)", $file) or die $! . ' ' . $file; print F <<EOT; <html> <head> <meta content='text/html; charset=$encoding' http-equiv='Content-Type'/> </head> <body> $content </body> </html> EOT close F; utime $dt, $dt, $file; } }
#Blogs
(Please contact me if you want to remove your comment.)
⁂
Recently I wanted a copy of *Elfmaids & Octopi* because the owner announced on Reddit that they were going to move elsewhere.
The directory structure I used:
┬ Elfmaids & Octopi ├ feed └ html
This is how I got a copy of the feed, download.sh in the top folder:
#! /bin/sh for i in `seq 80`; do start=$((($i-1)*25+1)) curl -o foo-$i.atom "https://www.blogger.com/feeds/737809845612070971/posts/default?start-index=$start&max-results=25" done
This downloads a bit more than seventy Atom files plus a few nearly empty Atom files. I moved these into the first subdirectory:
mv *.atom feed
I installed the missing dependency for my Perl script. Depending on your setup you might have more dependencies missing, and you might have to use cpan instead of my favourite, cpanm):
cpanm HTML::HTML5::Parser
I saved the Perl script as extract.pl:
#!/usr/bin/perl use strict; use XML::LibXML; use HTML::HTML5::Parser; use Getopt::Std; use DateTime::Format::W3CDTF; use DateTime; our $opt_f; getopts('f'); undef $/; my $data = <STDIN>; my $parser = XML::LibXML->new(); my $doc = $parser->parse_string($data); die $@ if $@; my $encoding = $doc->actualEncoding(); my $context = XML::LibXML::XPathContext->new($doc); $context->registerNs('atom', 'http://www.w3.org/2005/Atom'); my $html_parser; foreach my $entry ($context->findnodes('//atom:entry')) { my $content = $entry->getChildrenByTagName('content')->[0]->to_literal; my $title = $entry->getChildrenByTagName('title')->[0]->to_literal; $title =~ s!/!_!gi; $title =~ s!&!&!gi; $title =~ s!&#(\d+);!chr($1)!ge; if (not $title) { if (not $html_parser) { $html_parser = HTML::HTML5::Parser->new; } my $html_doc = $html_parser->parse_string($content); # we don't know the HTML namespace for certain my $html_ns = $html_doc->documentElement->namespaceURI(); my $html_context = XML::LibXML::XPathContext->new($html_doc); $html_context->registerNs('html', $html_ns); $title = $html_context->findnodes('//html:h1')->[0]; $title = $html_context->findnodes('//html:span')->[0] unless $title; $title = $title->to_literal if $title; warn "Guessed missing title: $title\n"; } my $f = DateTime::Format::W3CDTF->new; my $dt = $f->parse_datetime($entry->getChildrenByTagName('updated')->[0]->to_literal)->epoch; my $file = "html/$title.html"; if (-f $file and ! $opt_f) { warn "$file exists\n"; my $i = 2; $i++ while -f "html/$title ($i).html"; $file = "html/$title ($i).html"; } open(F, ">:encoding($encoding)", $file) or die $! . ' ' . $file; print F <<EOT; <html> <head> <meta content='text/html; charset=$encoding' http-equiv='Content-Type'/> </head> <body> $content </body> </html> EOT close F; utime $dt, $dt, $file; }
And I saved a simple wrapper as extract.sh:
#! /bin/sh for f in feed/*.atom; do perl extract.pl "$*" < $f done
And finally I moved all the HTML files into the second subdirectory:
mv *.html html
Done!
– Alex