gemini - kennedy.gemi.dev

💾 Archived View for bacaliu.de › analyzing_gadgetbridge_data_in_python.txt captured on 2023-07-22 at 17:11:07.
-=-=-=-=-=-=-
           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                ANALYZING GADGETBRIDGE-DATA WITH PYTHON
                 Amazfit Neo → Gadgetbridge → Sqlite →
                             Python-Pandas
           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━


                               2023-03-15





1 Some helpful imports
══════════════════════

  ┌────
  │ import hsluv
  │ import matplotlib.pyplot as plt
  │ plt.rcParams["figure.figsize"] = (8, 4)
  │ import seaborn as sns
  │ from datetime import datetime
  └────
  Listing 1: some helpful imports


2 Getting the Data
══════════════════

  The FOSS-Application Gadgedbridge (“Gadgetbridge for android” 2022)
  supports exporting the collected Data into an sqlite-file. Loading
  them into Python is not that difficult. In my case the file is
  automaticly mirrored into my `~/Sync'-folder through Syncthing
  (“Syncthing” 2019) and named `ggb.sqlite'.
  ┌────
  │ import pandas as pd
  │ import sqlite3
  │ conn = sqlite3.connect("/home/adrian/Sync/ggb.sqlite")
  │ df = pd.read_sql_query(
  │     """SELECT TIMESTAMP, RAW_INTENSITY, STEPS, RAW_KIND, HEART_RATE
  │     FROM MI_BAND_ACTIVITY_SAMPLE;""",
  │     conn
  │ )
  │ 
  │ df.describe().to_markdown(tablefmt="orgtbl")
  └────

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
            TIMESTAMP  RAW_INTENSITY    STEPS  RAW_KIND  HEART_RATE 
  ──────────────────────────────────────────────────────────────────
   count       331004         331004   331004    331004      331004 
   mean   1.66892e+09        24.3221  4.33082   125.133     76.3764 
   std     5.7575e+06        28.2967  15.5202   84.1054     33.7458 
   min    1.65897e+09             -1        0         1          -1 
   25%    1.66394e+09              0        0        80          60 
   50%     1.6689e+09             17        0        90          71 
   75%    1.67387e+09             38        0       240          81 
   max    1.67895e+09            198      144       251         255 
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  I did not include the useless columns `DEVICE_ID' and `USER_ID'. They
  always have the same value if you use only one device as one user; I
  don't load them to make the tables smaller; otherwise a `SELECT * FROM
  MI_BAND_ACTIVITY_DATA' would be sufficient.


3 Preperation
═════════════

3.1 Datetime
────────────

  But what is this strange `TIMESTAMP'-column? Oh, maybe just an
  unix-timestamp. Throw it into `pd.to_datetime':

  ┌────
  │ pd.to_datetime(df.TIMESTAMP) \
  │   .describe(datetime_is_numeric=True) \
  │   .to_markdown(tablefmt="orgtbl")
  └────

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
          TIMESTAMP                     
  ──────────────────────────────────────
   count  331004                        
   mean   1970-01-01 00:00:01.668917075 
   min    1970-01-01 00:00:01.658969040 
   25%    1970-01-01 00:00:01.663935105 
   50%    1970-01-01 00:00:01.668900930 
   75%    1970-01-01 00:00:01.673866575 
   max    1970-01-01 00:00:01.678946580 
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  Hmmm… This don't look right. I ran into this type of problem last year
  when analyzing the Deutsche Bahn (results, not the progress:
  [Momentane Pünktlichkeit der Deutschen Bahn]). To safe memory and
  network capacity they divided the unix-timestamps by a factor of `1e6'
  or `1e9'.

  ┌────
  │ pd.to_datetime(df.TIMESTAMP * 1e9) \
  │   .describe(datetime_is_numeric=True) \
  │   .to_markdown(tablefmt="orgtbl")
  └────

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
          TIMESTAMP                     
  ──────────────────────────────────────
   count  331004                        
   mean   2022-11-20 04:04:35.349853696 
   min    2022-07-28 00:44:00           
   25%    2022-09-23 12:11:45           
   50%    2022-11-19 23:35:30           
   75%    2023-01-16 10:56:15           
   max    2023-03-16 06:03:00           
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  Yes! This matches the span in which I used Gadgedbridge with my
  watch. Let's make some useful columns out of this. By using the
  `.dt'-accessor Object (“pandas.Series.dt — pandas 1.5.3 documentation”
  2023) attributes like `date', `hour', etc. can be used easily:

  ┌────
  │ df["utc"] = pd.to_datetime(df.TIMESTAMP * 1e9)
  │ df["date"] = df.utc.dt.date
  │ df["weekday"] = df.utc.dt.day_name()
  │ df["hour"] = df.utc.dt.hour
  │ df["hourF"] = df.utc.dt.hour + df.utc.dt.minute/60
  └────

  `date' and `weekday' can be used for grouping data; `hour' and
  espeically `hourF' (meaning the hour as floating point number) for x/y
  diagrams.


[Momentane Pünktlichkeit der Deutschen Bahn] See file
momentane_puenktlichkeit_der_deutschen_bahn_in_nrw.org


3.2 Heart Rate
──────────────

  ┌────
  │ plt.hist(
  │     df.HEART_RATE,
  │     color=hsluv.hsluv_to_hex((0, 75, 25)),
  │     bins=256
  │ )
  │ plt.title("Histogram: Heart rate")
  │ plt.yscale("log")
  │ plt.savefig(file)
  │ plt.close()
  │ file
  └────

  <file:./images/20230315-01.png>

  Looking at the Histogram of the Heart Rate it's obvious that the
  Values of `255' and below `0' are errors or failed measures. Therefore
  I set them to `None'.

  ┌────
  │ df["heartRate"] = df.HEART_RATE
  │ df.loc[df.heartRate<=0, "heartRate"] = None
  │ df.loc[df.heartRate>=255, "heartRate"] = None
  └────

  To avoid strange problems when executing the org-babel-blocks in the
  wrong order, I follow the best-practise of copying and *not
  overwriting* the original data.

  ┌────
  │ df[
  │     ["HEART_RATE", "heartRate"]
  │ ].describe().to_markdown(tablefmt="orgtbl")
  └────

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
          HEART_RATE  heartRate 
  ──────────────────────────────
   count      331004     321445 
   mean      76.3764    71.0781 
   std       33.7458    14.0401 
   min            -1         39 
   25%            60         59 
   50%            71         71 
   75%            81         81 
   max           255        178 
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  This is much better!


4 Plotting some Data
════════════════════

  Now I want to see how the data looks. Today is a good day, because
  ⁃ I slept (not that surprise)
  ⁃ I worked at the Computer (doing /this/)
  ⁃ I rode 47km with the bike

  ┌────
  │ fig, ax = plt.subplots(figsize=(8, 4))
  │ span = df[
  │     (
  │ 	df.utc > datetime(2023, 3, 15, 3)
  │     ) & (  # & is the bitwise AND
  │ 	df.utc < datetime(2023, 3, 15, 15)
  │     )
  │ ]
  │ ax.plot(
  │     span.utc, span.RAW_INTENSITY,
  │     label="Intensity",
  │     color=hsluv.hsluv_to_hex((240, 80, 20)),
  │     linewidth=0.75
  │ )
  │ ax.plot(
  │     span.utc, span.RAW_KIND,
  │     label="Kind",
  │     color=hsluv.hsluv_to_hex((120, 80, 40)),
  │     linewidth=0.5
  │ )
  │ bx = ax.twinx()
  │ bx.plot(
  │     span.utc, span.heartRate,
  │     label="Heart Rate",
  │     color=hsluv.hsluv_to_hex((0, 80, 60)),
  │     linewidth=0.25
  │ )
  │ ax.set_ylim([0, 256])
  │ ax.set_yticks(list(range(0, 256, 32)))
  │ bx.set_ylim([0, 160])
  │ ax.set_xlim([span.utc.min(), span.utc.max()])
  │ fig.legend()
  │ ax.grid()
  │ fig.autofmt_xdate()  # tilting the x-labels
  │ fig.tight_layout()  # less space around the plot
  │ fig.savefig(file)
  │ plt.close(fig)
  │ file
  └────

  <file:./images/20230315-02.png>

  You can't clearly see what's going on, because the wiggeli wobbelyness
  of the lines. Try using a rolling mean:


  ┌────
  │ fig, ax = plt.subplots(figsize=(8, 4))
  │ span = df[
  │     (
  │ 	df.utc > datetime(2023, 3, 15, 3)
  │     ) & (
  │ 	df.utc < datetime(2023, 3, 15, 15)
  │     )
  │ ]
  │ ax.plot(
  │     span.utc,
  │     span.RAW_INTENSITY.rolling(5, min_periods=1).mean(),
  │     label="Intensity",
  │     color=hsluv.hsluv_to_hex((240, 80, 20)),
  │     linewidth=0.75
  │ )
  │ ax.plot(
  │     span.utc,
  │     span.RAW_KIND.rolling(5, min_periods=1).median(), # !
  │     label="Kind",
  │     color=hsluv.hsluv_to_hex((120, 80, 40)),
  │     linewidth=0.5
  │ )
  │ bx = ax.twinx()
  │ bx.plot(
  │     span.utc,
  │     span.heartRate.rolling(5, min_periods=1).mean(),
  │     label="Heart Rate",
  │     color=hsluv.hsluv_to_hex((0, 80, 60)),
  │     linewidth=0.25
  │ )
  │ ax.set_ylim([0, 256])
  │ ax.set_yticks(list(range(0, 256, 32)))
  │ bx.set_ylim([0, 160])
  │ ax.set_xlim([span.utc.min(), span.utc.max()])
  │ fig.legend()
  │ ax.grid()
  │ fig.autofmt_xdate()
  │ fig.tight_layout()
  │ fig.savefig(file)
  │ plt.close(fig)
  │ file
  └────

  <file:./images/20230315-02-rolling.png>

  /Now/ you can clearly see
  ⁃ Low activity and pulse while sleeping until 06:30 UTC
  ⁃ Normal activity while working until 11:00 UTC
  ⁃ High activity and pulse from 11:00-13:30 UTC

  For `RAW_KIND' I used the rolling *median*, because this looks more
  discrete than continuous. There might be some strange encoding
  happening: Sleep is very high, the spikes towards arround 100 are
  short occurrences of me waking up and turning around; while working
  the value is arround 80 and during sport it drops to below 20.


4.1 x/y - combining features!
─────────────────────────────

  Now combine features. And to add a color-dimension let's assume
  `RAW_KIND' above 192 means sleep; below 32 activity.

  ┌────
  │ df["assumption"] = [
  │     "sleep" if r>192 else "normal" if r>32 else "activity"
  │     for r in df.RAW_KIND
  │ ]
  │ fig, ax = plt.subplots(figsize=(6, 6))
  │ sns.scatterplot(
  │     ax=ax,
  │     data=df.sample(2048), # use not /all/ but only 2048 data-points
  │     x="heartRate",
  │     y="RAW_INTENSITY",
  │     hue="assumption",
  │     palette={
  │ 	"sleep": hsluv.hsluv_to_hex((240, 60, 60)),
  │ 	"normal": hsluv.hsluv_to_hex((120, 80, 40)),
  │ 	"activity": hsluv.hsluv_to_hex((0, 100, 20)),
  │     }
  │ )
  │ ax.set_xlim([30, 130])
  │ ax.set_ylim([0, None])
  │ fig.savefig(file)
  │ plt.close(fig)
  │ file
  └────

  <file:./images/20230315-03.png>

  It seems intuitive that intensity and heart rate are lower while
  sleeping. But do you see some strangeness? There are Lines of frequent
  heart rates when awake but not while sleep.

  I assume my watch has a high precission, but a medium
  accuracy. Randall Munroe made a useful table to keep in mind the
  difference between them:

  <https://imgs.xkcd.com/comics/precision_vs_accuracy.png>

  Maybe it's like the following: While sleeping I don't move that much
  (like the position on the y-axis implies) so the precision is as high
  as possible. But when moving around the watch measures just the
  moments it can and estimates the pulse with a lower precision.


Bibliography
════════════

  “Gadgetbridge for android,”. 2022. September 10, 2022, URL:
  <https://www.gadgetbridge.org>.

  Munroe, R. 2022. “Precision vs Accuracy,” /Xkcd/ November 9, 2022,
  URL: <https://xkcd.com/2696>.

  “pandas.Series.dt — pandas 1.5.3 documentation,”. 2023. January 19,
  2023, URL:
  <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.html>.

  “Syncthing,”. 2019. September 5, 2019, URL: <https://syncthing.net>.


Nav
═══

  ⁃ Tags: [Python] - [Data]
  ⁃ Formats: [md] - [txt] - [html] - [gmi]


[Python] <./tags/Python.org>

[Data] <./tags/Data.org>

[md] <./analyzing_gadgetbridge_data_in_python.md>

[txt] <./analyzing_gadgetbridge_data_in_python.txt>

[html] <./analyzing_gadgetbridge_data_in_python.html>

[gmi] <./analyzing_gadgetbridge_data_in_python.gmi>


Footer
══════

  License: CC BY-4.0
  [Impressum und Datenschutz]


[Impressum und Datenschutz] <./impressum-datenschutz.gmi>