πŸ’Ύ Archived View for bacaliu.de β€Ί analyzing_gadgetbridge_data_in_python.txt captured on 2023-07-22 at 17:11:07.

View Raw

More Information

⬅️ Previous capture (2023-07-10)

-=-=-=-=-=-=-

           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                ANALYZING GADGETBRIDGE-DATA WITH PYTHON
                 Amazfit Neo β†’ Gadgetbridge β†’ Sqlite β†’
                             Python-Pandas
           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━


                               2023-03-15





1 Some helpful imports
══════════════════════

  β”Œβ”€β”€β”€β”€
  β”‚ import hsluv
  β”‚ import matplotlib.pyplot as plt
  β”‚ plt.rcParams["figure.figsize"] = (8, 4)
  β”‚ import seaborn as sns
  β”‚ from datetime import datetime
  └────
  Listing 1: some helpful imports


2 Getting the Data
══════════════════

  The FOSS-Application Gadgedbridge (β€œGadgetbridge for android” 2022)
  supports exporting the collected Data into an sqlite-file. Loading
  them into Python is not that difficult. In my case the file is
  automaticly mirrored into my `~/Sync'-folder through Syncthing
  (β€œSyncthing” 2019) and named `ggb.sqlite'.
  β”Œβ”€β”€β”€β”€
  β”‚ import pandas as pd
  β”‚ import sqlite3
  β”‚ conn = sqlite3.connect("/home/adrian/Sync/ggb.sqlite")
  β”‚ df = pd.read_sql_query(
  β”‚     """SELECT TIMESTAMP, RAW_INTENSITY, STEPS, RAW_KIND, HEART_RATE
  β”‚     FROM MI_BAND_ACTIVITY_SAMPLE;""",
  β”‚     conn
  β”‚ )
  β”‚ 
  β”‚ df.describe().to_markdown(tablefmt="orgtbl")
  └────

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
            TIMESTAMP  RAW_INTENSITY    STEPS  RAW_KIND  HEART_RATE 
  ──────────────────────────────────────────────────────────────────
   count       331004         331004   331004    331004      331004 
   mean   1.66892e+09        24.3221  4.33082   125.133     76.3764 
   std     5.7575e+06        28.2967  15.5202   84.1054     33.7458 
   min    1.65897e+09             -1        0         1          -1 
   25%    1.66394e+09              0        0        80          60 
   50%     1.6689e+09             17        0        90          71 
   75%    1.67387e+09             38        0       240          81 
   max    1.67895e+09            198      144       251         255 
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  I did not include the useless columns `DEVICE_ID' and `USER_ID'. They
  always have the same value if you use only one device as one user; I
  don't load them to make the tables smaller; otherwise a `SELECT * FROM
  MI_BAND_ACTIVITY_DATA' would be sufficient.


3 Preperation
═════════════

3.1 Datetime
────────────

  But what is this strange `TIMESTAMP'-column? Oh, maybe just an
  unix-timestamp. Throw it into `pd.to_datetime':

  β”Œβ”€β”€β”€β”€
  β”‚ pd.to_datetime(df.TIMESTAMP) \
  β”‚   .describe(datetime_is_numeric=True) \
  β”‚   .to_markdown(tablefmt="orgtbl")
  └────

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
          TIMESTAMP                     
  ──────────────────────────────────────
   count  331004                        
   mean   1970-01-01 00:00:01.668917075 
   min    1970-01-01 00:00:01.658969040 
   25%    1970-01-01 00:00:01.663935105 
   50%    1970-01-01 00:00:01.668900930 
   75%    1970-01-01 00:00:01.673866575 
   max    1970-01-01 00:00:01.678946580 
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  Hmmm… This don't look right. I ran into this type of problem last year
  when analyzing the Deutsche Bahn (results, not the progress:
  [Momentane PΓΌnktlichkeit der Deutschen Bahn]). To safe memory and
  network capacity they divided the unix-timestamps by a factor of `1e6'
  or `1e9'.

  β”Œβ”€β”€β”€β”€
  β”‚ pd.to_datetime(df.TIMESTAMP * 1e9) \
  β”‚   .describe(datetime_is_numeric=True) \
  β”‚   .to_markdown(tablefmt="orgtbl")
  └────

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
          TIMESTAMP                     
  ──────────────────────────────────────
   count  331004                        
   mean   2022-11-20 04:04:35.349853696 
   min    2022-07-28 00:44:00           
   25%    2022-09-23 12:11:45           
   50%    2022-11-19 23:35:30           
   75%    2023-01-16 10:56:15           
   max    2023-03-16 06:03:00           
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  Yes! This matches the span in which I used Gadgedbridge with my
  watch. Let's make some useful columns out of this. By using the
  `.dt'-accessor Object (β€œpandas.Series.dt β€” pandas 1.5.3 documentation”
  2023) attributes like `date', `hour', etc. can be used easily:

  β”Œβ”€β”€β”€β”€
  β”‚ df["utc"] = pd.to_datetime(df.TIMESTAMP * 1e9)
  β”‚ df["date"] = df.utc.dt.date
  β”‚ df["weekday"] = df.utc.dt.day_name()
  β”‚ df["hour"] = df.utc.dt.hour
  β”‚ df["hourF"] = df.utc.dt.hour + df.utc.dt.minute/60
  └────

  `date' and `weekday' can be used for grouping data; `hour' and
  espeically `hourF' (meaning the hour as floating point number) for x/y
  diagrams.


[Momentane PΓΌnktlichkeit der Deutschen Bahn] See file
momentane_puenktlichkeit_der_deutschen_bahn_in_nrw.org


3.2 Heart Rate
──────────────

  β”Œβ”€β”€β”€β”€
  β”‚ plt.hist(
  β”‚     df.HEART_RATE,
  β”‚     color=hsluv.hsluv_to_hex((0, 75, 25)),
  β”‚     bins=256
  β”‚ )
  β”‚ plt.title("Histogram: Heart rate")
  β”‚ plt.yscale("log")
  β”‚ plt.savefig(file)
  β”‚ plt.close()
  β”‚ file
  └────

  <file:./images/20230315-01.png>

  Looking at the Histogram of the Heart Rate it's obvious that the
  Values of `255' and below `0' are errors or failed measures. Therefore
  I set them to `None'.

  β”Œβ”€β”€β”€β”€
  β”‚ df["heartRate"] = df.HEART_RATE
  β”‚ df.loc[df.heartRate<=0, "heartRate"] = None
  β”‚ df.loc[df.heartRate>=255, "heartRate"] = None
  └────

  To avoid strange problems when executing the org-babel-blocks in the
  wrong order, I follow the best-practise of copying and *not
  overwriting* the original data.

  β”Œβ”€β”€β”€β”€
  β”‚ df[
  β”‚     ["HEART_RATE", "heartRate"]
  β”‚ ].describe().to_markdown(tablefmt="orgtbl")
  └────

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
          HEART_RATE  heartRate 
  ──────────────────────────────
   count      331004     321445 
   mean      76.3764    71.0781 
   std       33.7458    14.0401 
   min            -1         39 
   25%            60         59 
   50%            71         71 
   75%            81         81 
   max           255        178 
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  This is much better!


4 Plotting some Data
════════════════════

  Now I want to see how the data looks. Today is a good day, because
  ⁃ I slept (not that surprise)
  ⁃ I worked at the Computer (doing /this/)
  ⁃ I rode 47km with the bike

  β”Œβ”€β”€β”€β”€
  β”‚ fig, ax = plt.subplots(figsize=(8, 4))
  β”‚ span = df[
  β”‚     (
  β”‚ 	df.utc > datetime(2023, 3, 15, 3)
  β”‚     ) & (  # & is the bitwise AND
  β”‚ 	df.utc < datetime(2023, 3, 15, 15)
  β”‚     )
  β”‚ ]
  β”‚ ax.plot(
  β”‚     span.utc, span.RAW_INTENSITY,
  β”‚     label="Intensity",
  β”‚     color=hsluv.hsluv_to_hex((240, 80, 20)),
  β”‚     linewidth=0.75
  β”‚ )
  β”‚ ax.plot(
  β”‚     span.utc, span.RAW_KIND,
  β”‚     label="Kind",
  β”‚     color=hsluv.hsluv_to_hex((120, 80, 40)),
  β”‚     linewidth=0.5
  β”‚ )
  β”‚ bx = ax.twinx()
  β”‚ bx.plot(
  β”‚     span.utc, span.heartRate,
  β”‚     label="Heart Rate",
  β”‚     color=hsluv.hsluv_to_hex((0, 80, 60)),
  β”‚     linewidth=0.25
  β”‚ )
  β”‚ ax.set_ylim([0, 256])
  β”‚ ax.set_yticks(list(range(0, 256, 32)))
  β”‚ bx.set_ylim([0, 160])
  β”‚ ax.set_xlim([span.utc.min(), span.utc.max()])
  β”‚ fig.legend()
  β”‚ ax.grid()
  β”‚ fig.autofmt_xdate()  # tilting the x-labels
  β”‚ fig.tight_layout()  # less space around the plot
  β”‚ fig.savefig(file)
  β”‚ plt.close(fig)
  β”‚ file
  └────

  <file:./images/20230315-02.png>

  You can't clearly see what's going on, because the wiggeli wobbelyness
  of the lines. Try using a rolling mean:


  β”Œβ”€β”€β”€β”€
  β”‚ fig, ax = plt.subplots(figsize=(8, 4))
  β”‚ span = df[
  β”‚     (
  β”‚ 	df.utc > datetime(2023, 3, 15, 3)
  β”‚     ) & (
  β”‚ 	df.utc < datetime(2023, 3, 15, 15)
  β”‚     )
  β”‚ ]
  β”‚ ax.plot(
  β”‚     span.utc,
  β”‚     span.RAW_INTENSITY.rolling(5, min_periods=1).mean(),
  β”‚     label="Intensity",
  β”‚     color=hsluv.hsluv_to_hex((240, 80, 20)),
  β”‚     linewidth=0.75
  β”‚ )
  β”‚ ax.plot(
  β”‚     span.utc,
  β”‚     span.RAW_KIND.rolling(5, min_periods=1).median(), # !
  β”‚     label="Kind",
  β”‚     color=hsluv.hsluv_to_hex((120, 80, 40)),
  β”‚     linewidth=0.5
  β”‚ )
  β”‚ bx = ax.twinx()
  β”‚ bx.plot(
  β”‚     span.utc,
  β”‚     span.heartRate.rolling(5, min_periods=1).mean(),
  β”‚     label="Heart Rate",
  β”‚     color=hsluv.hsluv_to_hex((0, 80, 60)),
  β”‚     linewidth=0.25
  β”‚ )
  β”‚ ax.set_ylim([0, 256])
  β”‚ ax.set_yticks(list(range(0, 256, 32)))
  β”‚ bx.set_ylim([0, 160])
  β”‚ ax.set_xlim([span.utc.min(), span.utc.max()])
  β”‚ fig.legend()
  β”‚ ax.grid()
  β”‚ fig.autofmt_xdate()
  β”‚ fig.tight_layout()
  β”‚ fig.savefig(file)
  β”‚ plt.close(fig)
  β”‚ file
  └────

  <file:./images/20230315-02-rolling.png>

  /Now/ you can clearly see
  ⁃ Low activity and pulse while sleeping until 06:30 UTC
  ⁃ Normal activity while working until 11:00 UTC
  ⁃ High activity and pulse from 11:00-13:30 UTC

  For `RAW_KIND' I used the rolling *median*, because this looks more
  discrete than continuous. There might be some strange encoding
  happening: Sleep is very high, the spikes towards arround 100 are
  short occurrences of me waking up and turning around; while working
  the value is arround 80 and during sport it drops to below 20.


4.1 x/y - combining features!
─────────────────────────────

  Now combine features. And to add a color-dimension let's assume
  `RAW_KIND' above 192 means sleep; below 32 activity.

  β”Œβ”€β”€β”€β”€
  β”‚ df["assumption"] = [
  β”‚     "sleep" if r>192 else "normal" if r>32 else "activity"
  β”‚     for r in df.RAW_KIND
  β”‚ ]
  β”‚ fig, ax = plt.subplots(figsize=(6, 6))
  β”‚ sns.scatterplot(
  β”‚     ax=ax,
  β”‚     data=df.sample(2048), # use not /all/ but only 2048 data-points
  β”‚     x="heartRate",
  β”‚     y="RAW_INTENSITY",
  β”‚     hue="assumption",
  β”‚     palette={
  β”‚ 	"sleep": hsluv.hsluv_to_hex((240, 60, 60)),
  β”‚ 	"normal": hsluv.hsluv_to_hex((120, 80, 40)),
  β”‚ 	"activity": hsluv.hsluv_to_hex((0, 100, 20)),
  β”‚     }
  β”‚ )
  β”‚ ax.set_xlim([30, 130])
  β”‚ ax.set_ylim([0, None])
  β”‚ fig.savefig(file)
  β”‚ plt.close(fig)
  β”‚ file
  └────

  <file:./images/20230315-03.png>

  It seems intuitive that intensity and heart rate are lower while
  sleeping. But do you see some strangeness? There are Lines of frequent
  heart rates when awake but not while sleep.

  I assume my watch has a high precission, but a medium
  accuracy. Randall Munroe made a useful table to keep in mind the
  difference between them:

  <https://imgs.xkcd.com/comics/precision_vs_accuracy.png>

  Maybe it's like the following: While sleeping I don't move that much
  (like the position on the y-axis implies) so the precision is as high
  as possible. But when moving around the watch measures just the
  moments it can and estimates the pulse with a lower precision.


Bibliography
════════════

  β€œGadgetbridge for android,”. 2022. September 10, 2022, URL:
  <https://www.gadgetbridge.org>.

  Munroe, R. 2022. β€œPrecision vs Accuracy,” /Xkcd/ November 9, 2022,
  URL: <https://xkcd.com/2696>.

  β€œpandas.Series.dt β€” pandas 1.5.3 documentation,”. 2023. January 19,
  2023, URL:
  <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.html>.

  β€œSyncthing,”. 2019. September 5, 2019, URL: <https://syncthing.net>.


Nav
═══

  ⁃ Tags: [Python] - [Data]
  ⁃ Formats: [md] - [txt] - [html] - [gmi]


[Python] <./tags/Python.org>

[Data] <./tags/Data.org>

[md] <./analyzing_gadgetbridge_data_in_python.md>

[txt] <./analyzing_gadgetbridge_data_in_python.txt>

[html] <./analyzing_gadgetbridge_data_in_python.html>

[gmi] <./analyzing_gadgetbridge_data_in_python.gmi>


Footer
══════

  License: CC BY-4.0
  [Impressum und Datenschutz]


[Impressum und Datenschutz] <./impressum-datenschutz.gmi>