sylvaindurand.org

Outlook emails analytics with Python

Microsoft Outlook uses a proprietary data format, "PST" (Personal Storage Table), to store emails, appointments or tasks. It was not until 2010 that the specifications of this format were made public, which probably explains why few tools are available to open it and use its data.

Under Python, however, there is the libpff library that allows most metadata to be exported. The following lines show how to use it to graph incoming emails.

Installation of libpff

Like all Python libraries, libpff can be installed with `pip` :

pip install libpff-python

However, at the time of writing, the version proposed by pip (20161119) does not allow to retrieve the time of the messages. Corrections have been made in a more recent version (20190725), available with :

pip install libpff-python-ratom

Retrieving emails

File opening

First we load the library:

import pypff

Then we open our file: the opening can nevertheless be quite long depending on the size of your archive.

pst = pypff.file()
pst.open("mails.pst")

Metadata extraction

It is possible to navigate through the structure using the functions offered by the library, from the root:

root = pst.get_root_folder()

To extract the data, a recursive function can then be used:

def parse_folder(base):
    messages = []
    for folder in base.sub_folders:
        if folder.number_of_sub_folders:
            messages += parse_folder(folder)
        print(folder.name)
        for message in folder.sub_messages:
            messages.append({
                "subject": message.subject,
                "sender": message.sender_name,
                "datetime": message.client_submit_time
            })
    return messages

messages = parse_folder(root)

This function can be quite slow depending on the number of messages (about 300 messages are processed per second on my computer).

Once this is done, can then import this file into Pandas:

import pandas as pd
df = pd.DataFrame(messages)

Time conversion

The hours extracted from the file are stored in UTC format, which means that you will have to reprocess the correct time zone. First, the time zone is declared UTC, then converted to the desired time zone:

df['datetime'] = df['datetime'].dt.tz_localize(tz='UTC')
df['datetime'] = df['datetime'].dt.tz_convert(tz='Europe/Paris')

Plot example

We will plot a point cloud showing the arrival time of emails by date. To do this, two columns are created with the coordinates of the points to be traced:

df['hour'] = df['datetime'].dt.hour + df['datetime'].dt.minute / 60
df['date'] = df['datetime'].dt.year + df['datetime'].dt.dayofyear / 365

Then we trace:

import matplotlib.pyplot as plt
import seaborn as sns

plt.clf()
ax = sns.scatterplot(x="date", y="hour", s=3, alpha=.3, linewidth=0, marker=".", data=df)
ax.set(xlim=(2014.5,2020), ylim=(7,25))
ax.invert_yaxis()
sns.despine()
ax.get_figure().savefig("plot.png", dpi=400)

This gives us:

Image: Scatter plot of emails depending of the hour and date