💾 Archived View for chirale.org › 2016-01-22_1388.gmi captured on 2024-09-29 at 00:01:30. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2024-05-12)

-=-=-=-=-=-=-

From Drupal to Django: how to migrate contents

In a recent article I explain the motivations for an upgrade from a no longer maintained Drupal 6 installation to Django I will now cover more in detail the migration techniques adopted in the upgrade and I’ll deepen the models and the relationships.

recent article

Structure

If you’re a drupaler, you’re familiar with the node/NID/edit and the node/add/TYPE pages:

A-New-Page-Drupal-6-Sandbox

Here we have two visible fields: Title and Body. One is an input type text and the other a texarea. The good Form API provided by Drupal calls these two types textfield and textarea. However if you use the Content type creation interface you don’t see any of these, just declare some field types and you’ll see the form populating with new fields after the addition.

Form API

textfield

textarea

It’s similar in Django but you haven’t to pass to a graphical interface to do this: structure is code-driven and the side effect is the ability to put on revision almost anything. You can choose between different field types that will be reflected in database and on the user interface.

choose between different field types

Here what the Drupal Body and Title fields looks like in a model called Article:

 # models.py from django.db import models from tinymce import models as tinymce_models # Articles class Article(models.Model): title = models.CharField(max_length=250,null=False, blank=False) body = tinymce_models.HTMLField(blank=True, default='') 

The TinyMCE part require TinyMCE app installed and configured. If you’re new to Django read and follow the great Writing your first Django app to understand the basics, e.g the difference between a project and an app or the following sections will sound pretty obscure.

TinyMCE app

Writing your first Django app

After editing your projectname/appname/models.py file you can now apply the changes in your app via makemigrations (create a migration file for the changes in the database) and migrate (apply the migrations inside the migration files).

makemigrations

migrate

In a real world scenario these two fields alone aren’t enough neither in a Drupal  These information are all presented by default in any type on Drupal 6:

authoring-info

Drupal 6 treats author as entities you can search through an autocomplete field, and date as a pseudo-ISO 8601 date field. The author field is a link to the User table in Drupal. In Django a similar user model exists but if you want to unchain the access to the admin backend and the authorship it’s simpler to create a custom author model and later associate this with the real user model.

ISO 8601

similar user model

E-R of our app where migrate the Drupal contents to.

 from django.db import models from tinymce import models as tinymce_models # Authors class Author(models.Model): alias = models.CharField(max_length=100) name = models.CharField(max_length=100, null=True, blank=True) surname = models.CharField(max_length=100, null=True, blank=True) # Articles class Article(models.Model): author = models.ForeignKey('Author', verbose_name='Authored by') title = models.CharField(max_length=250,null=False, blank=False) body = tinymce_models.HTMLField(blank=True, default='') publishing_date = models.DateTimeField(auto_now=False, auto_now_add=False, verbose_name='First published on')

As you can see in the Entity-Relationship diagram one Article must have one and only one Author, but many Articles can have the same Author. This is called Many-to-one relationship and it’s represented in Django as a foreign key from the destination “many” model (e.g. Article) to the “one” model (Author).

Many-to-one relationship

The Article.publishing_date field is where publishing date and time are stored and, clicking on the text field, a calendar popup is presented to choose the day and hour, with a useful “now” shortcut to populate the field with the current time.

How a calendar is represented in a DateTime field.

Now that the basic fields are in the right place you can makemigrations / migrate again to update your app, restarting the webserver to apply the changes.

Attachments and images

Drupal is shipped with the ability to upload files and images to nodes. Django has two different field for this: FileField and ImageField. Before continuing we have to rethink our E-R model to allow attachments.

FileField

ImageField

sport3_uml

 

The model.py code is:

 from django.db import models from tinymce import models as tinymce_models # Authors class Author(models.Model): alias = models.CharField(max_length=100) name = models.CharField(max_length=100, null=True, blank=True) surname = models.CharField(max_length=100, null=True, blank=True) # Articles class Article(models.Model): author = models.ForeignKey('Author', verbose_name='Authored by') title = models.CharField(max_length=250,null=False, blank=False) body = tinymce_models.HTMLField(blank=True, default='') publishing_date = models.DateTimeField(auto_now=False, auto_now_add=False, verbose_name='First published on') # Attachments class Attachments(models.Model): description = models.CharField(max_length=255, default='', blank=True) list = models.BooleanField(default=True) file = models.FileField(upload_to='attachments_directory', max_length=255) 

Images are similar: if you want to enrich your model with images you can create another model like Attachments but with an ImageField instead. Remember to use a different upload_to directory in order to keep the attachments and images separated.

We miss the last one field to complete our models: path. Django comes with an useful SlugField that as of Django 8 allows only ASCII characters and can be mapped to another field, the title for example.

SlugField

 from django.db import models from tinymce import models as tinymce_models # Articles class Article(models.Model): author = models.ForeignKey('Author', verbose_name='Authored by') title = models.CharField(max_length=250,null=False, blank=False) body = tinymce_models.HTMLField(blank=True, default='') publishing_date = models.DateTimeField(auto_now=False, auto_now_add=False, verbose_name='First published on') 

Keep in mind that a SlugField differs from a Drupal path field because it doesn’t allow slashes. Consider a path like this:

news/news-title

In Drupal you will have a A) view with the path news and the argument news title or B) a fake path generated by pathauto or similar modules. In years of Drupal development, I can affirm that the B option is the typical easy way that turns into a nightmare of maintainance. Django core as far as I know allows only the A choice, so if you want a news view you have to declare it in urls.py and then in views.py as stated in official documentation.

pathauto

declare it in urls.py and then in views.py as stated in official documentation

Categories

And what about categories? If you have a category named Section, and an article can be associated with only one Section, you have to create a Many-to-one relationship. As you see before, you have to put the foreign key in the N side of the relation, in this case Article, so the model Article will have a ForeignKey field referencing a specific section.

On the other hands if you have tags to associate to your article you have to create a Tag model with a Many-to-many relationship to the Article. Django will create an intermediate model storing the Article-Tag relationships.

Many-to-many

Do not abuse of M2M relationships because each relation needs a separate table and the number of JOIN on database table will increase with side effects on the performance, not even perceivable on the first since Django ORM is very efficient. The event handling will be more difficult for a beginner since the many to many events occurs only when the parent models are saved and this require some experience if you need to add a custom action to a M2M event. If you design wisely your E-R model you have nothing to be scared of.

many to many events occurs only when the parent models are saved

Migration techniques

Now that we have the destination models, fields and relationship we can import the content from Drupal. In the previous article I suggested to use Views Datasource module to create a JSON view to export content. Please read the Exporting the data from Drupal section inside the article before continue.

previous article

Views Datasource

Exporting the data from Drupal section inside the article

The obtained row is something like:

 { [ { {nid: '30004', domainsourceid: '2', nodepath: 'http://example.com/path/here', postdate: '2014-09-17T22:18:42+0200', nodebody: 'HTML TEXT HERE', nodetype: 'drupal type', nodetitle: 'Title here', nodeauthor: 'monty', nodetags: 'Drupal, dragonball, paintball' } }, ... ] } 

If you haven’t a multi-site Drupal you can ignore domainsourceid field. The nodetags lists some Tag names of a Many-to-many relationship not covered here.

All the other value are useful for the import:

Destination: parsing

Destination: Article.path

Destination: Article.body

Destination: parsing

Destination: Article.title

Destination: Article.author -\> Author.alias

In the previous article you find how to make the View on Drupal (source) and now you have  rough idea of the field mapping. How to fetch the data from Django?

previous article

Management command and paged view

To start a one-time import you can write a custom management command for your Django application named project/app/management/commands/myimport.py.

 from __future__ import unicode_literals from django.core.management.base import BaseCommand, CommandError from django.core.exceptions import ValidationError, MultipleObjectsReturned, ObjectDoesNotExist import json, urllib import urlparse from shutil import copyfile from django.conf import settings from os import sep from django.core.files.storage import default_storage from django.utils.text import slugify import requests import grequests import time from md5 import md5 class Command(BaseCommand): help = 'Import data from Drupal 6 Json view' def add_arguments(self, parser): parser.add_argument('start', nargs=1, type=int) parser.add_argument('importtype', nargs=1) # Named (optional) arguments # Crawl parser.add_argument('--crawl', action='store_true', dest='crawl', default=False, help='Crawl data.') def handle(self, *args, **options): # process data pass 

This management command can be launched with

 python manage.py myimport 0 article --crawl 

Where 0 is the item to start + 1, “article” is the type of content to import (e.g. the destination model) and –crawl is the import option. Let’s add the import logic to the Command.handle method:

 def handle(self, *args, **options): try: assert options['crawl'] and options['importtype'] # start to import or store data sid = int(options['start'].pop()) reading = True while reading: importazioni = [] articoli = [] url = 'http://www.example.com/json-path-verylongkey?nid=%d' % (sid,) print url response = urllib.urlopen(url) data = json.loads(response.read()) data = data[''] # no data received, quit if not data: reading = False break for n, record in enumerate(data): sid = int(record['']['nid']) title = record['']['nodetitle'] # continue to process data, row after row # ... except AssertionError: raise CommandError('Invalid import command') 

This example will fetch /json-path-verylongkey starting from nid passed from the command + Then, it will process the json row after row and keep in memory the id of the last item. When no content is available, the cycle will stop. It’s a common method and it’s lightweight on the source server because only one request at time are sent and then the response is processed. Anyway, this method can be also slow because we have to sum waiting time: (request 1 + response 1 + parse 1) + (request 2 + response 2 + parse 2) etc.

Multiple, asyncronous requests

We can speed up the retrieval by using grequests. You have to check what is the last element first by cloning the Drupal data source json view and showing only the last item, then fetching the id.

grequests

 def handle(self, *args, **options): try: assert options['crawl'] and options['importtype'] # start to import or store data sid = int(options['start'].pop()) # find last node id to create an url list url = 'http://www.example.com/json-path-verylongkey-last-nid' response = requests.get(url, timeout = 50) r = response.json() last_nid = int(r[''].pop()['']['nid']) 

You can then create a from-to range starting from the first element passed by command line to the last.

 url_pattern = "http://www.example.com/json-path-verylongkey-last-nid?fromnid=%d&tonid=%d"; urls = [] per_page = 20 # e.g. [0, 20, 40, 60] relements = range(0, last_nid, per_page) if relements[-1] < last_nid: relements.append(last_nid + 1) for fromx, toy in zip(relements, relements[1:]): u = url_pattern % (fromx, toy) urls.append(u) rs = (grequests.get(u) for u in self.urls) # blocking request: stay here until the last response is received async_responses = grequests.map(rs) # all responses fetched 

The per_page is the number of element per page specified on Drupal json view. Instead of a single nid parameter, fromnid and tonid are the parameter “greater than” and “less or equal than” specified in the Drupal view.

The core of the asyncronous, multiple requests is grequests.map(). It take a list of urls and then request them. The response will arrive in random order but the async_responses will be populated by all of them.

At that point you can treat the response list like before, parsing the response.json() of each element of the list.

With these hints you can now create JSON views within Drupal ready to be fetched and parsed in Django. In a next article I will cover the conversion between the data and Django using the Django ORM.

Django ORM

https://web.archive.org/web/20160122000000*/https://chirale.wordpress.com/2015/10/10/guide-to-migrate-a-drupal-website-to-django-after-the-release-of-drupal-8/

https://web.archive.org/web/20160122000000*/https://chirale.wordpress.com/2016/01/22/from-drupal-to-django-how-to-migrate-contents/a-new-page-drupal-6-sandbox/

https://web.archive.org/web/20160122000000*/https://api.drupal.org/api/drupal/developer%21topics%21forms_api_reference.html/6

https://web.archive.org/web/20160122000000*/https://api.drupal.org/api/drupal/developer%21topics%21forms_api_reference.html/6#textfield

https://web.archive.org/web/20160122000000*/https://api.drupal.org/api/drupal/developer%21topics%21forms_api_reference.html/6#textarea

https://web.archive.org/web/20160122000000*/https://docs.djangoproject.com/en/1.8/ref/models/fields/

https://web.archive.org/web/20160122000000*/http://django-tinymce.readthedocs.org/

https://web.archive.org/web/20160122000000*/https://docs.djangoproject.com/en/1.8/intro/tutorial01/

https://web.archive.org/web/20160122000000*/https://docs.djangoproject.com/en/1.8/ref/django-admin/#django-admin-makemigrations

https://web.archive.org/web/20160122000000*/https://docs.djangoproject.com/en/1.8/ref/django-admin/#migrate-app-label-migrationname

https://web.archive.org/web/20160122000000*/https://chirale.wordpress.com/2016/01/22/from-drupal-to-django-how-to-migrate-contents/authoring-info/

https://web.archive.org/web/20160122000000*/https://it.wikipedia.org/wiki/ISO_8601

https://web.archive.org/web/20160122000000*/https://docs.djangoproject.com/en/1.8/topics/auth/customizing/#substituting-a-custom-user-model

https://web.archive.org/web/20160122000000*/https://chirale.wordpress.com/2016/01/22/from-drupal-to-django-how-to-migrate-contents/sport3_uml/

https://web.archive.org/web/20160122000000*/https://docs.djangoproject.com/en/1.8/topics/db/examples/many_to_one/

https://web.archive.org/web/20160122000000*/https://chirale.wordpress.com/2016/01/22/from-drupal-to-django-how-to-migrate-contents/calendario/

https://web.archive.org/web/20160122000000*/https://docs.djangoproject.com/en/1.8/ref/models/fields/#django.db.models.FileField

https://web.archive.org/web/20160122000000*/https://docs.djangoproject.com/en/1.8/ref/models/fields/#django.db.models.ImageField

https://web.archive.org/web/20160122000000*/https://chirale.wordpress.com/2016/01/22/from-drupal-to-django-how-to-migrate-contents/sport3_uml-2/

https://web.archive.org/web/20160122000000*/https://docs.djangoproject.com/en/1.8/ref/models/fields/#slugfield

https://web.archive.org/web/20160122000000*/https://www.drupal.org/project/pathauto

https://web.archive.org/web/20160122000000*/https://docs.djangoproject.com/en/1.8/topics/http/urls/#example

https://web.archive.org/web/20160122000000*/https://docs.djangoproject.com/en/1.8/ref/models/fields/#unique

https://web.archive.org/web/20160122000000*/https://docs.djangoproject.com/en/1.8/topics/db/examples/many_to_many/

https://web.archive.org/web/20160122000000*/https://docs.djangoproject.com/en/1.8/ref/signals/#m2m-changed

https://web.archive.org/web/20160122000000*/https://chirale.wordpress.com/2015/10/10/guide-to-migrate-a-drupal-website-to-django-after-the-release-of-drupal-8/

https://web.archive.org/web/20160122000000*/https://www.drupal.org/project/views_datasource

https://web.archive.org/web/20160122000000*/https://chirale.wordpress.com/2015/10/10/guide-to-migrate-a-drupal-website-to-django-after-the-release-of-drupal-8/

postdate: '2014-09-17T22:18:42+0200', nodebody: 'HTML TEXT HERE', nodetype: 'drupal type', nodetitle: 'Title here', nodeauthor: 'monty', nodetags: 'Drupal, dragonball, paintball' } }, ... ] } </pre> <p>If you haven&#8217;t a multi-site Drupal you can ignore domainsourceid field. The nodetags lists some Tag names of a Many-to-many relationship not covered here.</p> <p>All the other value are useful for the import:</p> <ul> <li>nid: the original content id, used for pagination and retrieval<br /> Destination: parsing</li> <li>nodepath: content path<br /> Destination: Article.path</li> <li>nodebody: content body<br /> Destination: Article.body</li> <li>nodetype: type of the node<br /> Destination: parsing</li> <li>nodetitle: title of the node<br /> Destination: Article.title</li> <li>nodeauthor: author of the content<br /> Destination: Article.author -&gt; Author.alias</li> </ul> <p>In the <a href=

https://web.archive.org/web/20160122000000*/https://chirale.wordpress.com/2015/10/10/guide-to-migrate-a-drupal-website-to-django-after-the-release-of-drupal-8/

% (sid,) print url response = urllib.urlopen(url) data = json.loads(response.read()) data = data[''] # no data received, quit if not data: reading = False break for n, record in enumerate(data): sid = int(record['']['nid']) title = record['']['nodetitle'] # continue to process data, row after row # ... except AssertionError: raise CommandError('Invalid import command') </pre> <p>This example will fetch /json-path-verylongkey starting from nid passed from the command + 1. Then, it will process the json row after row and keep in memory the id of the last item. When no content is available, the cycle will stop. It&#8217;s a common method and it&#8217;s lightweight on the source server because only one request at time are sent and then the response is processed. Anyway, this method can be also slow because we have to sum waiting time: (request 1 + response 1 + parse 1) + (request 2 + response 2 + parse 2) etc.</p> <h3>Multiple, asyncronous requests</h3> <p>We can speed up the retrieval by using <a href=

https://web.archive.org/web/20160122000000*/https://pypi.python.org/pypi/grequests

response = requests.get(url, timeout = 50) r = response.json() last_nid = int(r[''].pop()['']['nid']) </pre> <p>You can then create a from-to range starting from the first element passed by command line to the last.</p> <pre class=

urls = [] per_page = 20 # e.g. [0, 20, 40, 60] relements = range(0, last_nid, per_page) if relements[-1] &amp;amp;amp;lt; last_nid: relements.append(last_nid + 1) for fromx, toy in zip(relements, relements[1:]): u = url_pattern % (fromx, toy) urls.append(u) rs = (grequests.get(u) for u in self.urls) # blocking request: stay here until the last response is received async_responses = grequests.map(rs) # all responses fetched </pre> <p>The per_page is the number of element per page specified on Drupal json view. Instead of a single nid parameter, fromnid and tonid are the parameter &#8220;greater than&#8221; and &#8220;less or equal than&#8221; specified in the Drupal view.</p> <p>The core of the asyncronous, multiple requests is grequests.map(). It take a list of urls and then request them. The response will arrive in random order but the async_responses will be populated by all of them.</p> <p>At that point you can treat the response list like before, parsing the response.json() of each element of the list.</p> <p>With these hints you can now create JSON views within Drupal ready to be fetched and parsed in Django. In a next article I will cover the conversion between the data and Django using the <a href=

https://web.archive.org/web/20160122000000*/https://docs.djangoproject.com/en/1.8/topics/db/