💾 Archived View for gmi.bacardi55.io › blog › 2024 › 03 › 07 › generating-basic-content-statistics-f… captured on 2024-08-31 at 12:11:56. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2024-03-21)
-=-=-=-=-=-=-
Posted on 2024-03-07
2 or 3 days ago, I had a crazy idea: « What about adding content statistics on this small website? ». And a few days and many lines of bash / python scripts later, here we are…
For more context, this site is generated via [hugo], a static site generator. It means all my contents, would that be [blog posts], [gemlog entries] or [bookmarks], are in markdown files. If you want to see an example, you can see all blog posts on the [git repository].
By having content within markdown files, it meant:
I tried searching for some software that would allow to extract data from markdown / frontmatter, but couldn't find anything usable. All I found was libraries to work with markdown and / or frontmatter content.
I was almost resolved to be obligated to write something from scratch using those libs that I almost gave up and just add this idea to the "longer term todolist" (aka probably never).
But [Alex] saved the day by giving a brillant bash command that would remove the needs to write a complex code analysing frontmatter in files. Indeed, he shared with me the following command:
for file in *.md; head -n 5 $file | grep 'date:' | sed 's/date.*\([[:digit:]]\{4\}\).*/\1/' >> count; end ; cat count | sort | uniq -c
Seems ugly and complicated, but it simply look at all the `*.md' files in a directory, look at the header and grep the line started with `date:' and then keep the 4 digit of the year. Add the info in a temporary file that is then sorted and unified (displaying the count with the `-c' option).
Let's say that from there I simply followed the deep rabbit hole… Read the next chapter to access to the links to check these stats out, and read the following one to see the unnecessary complicated process to build those simple pages with bash and python (for generating graph images) :).
Before jumping into the "how", let's talk about the "what". The new available page showing different type of stats are:
If you looked at these pages, they are not the best and lots of info could be added, but I feel it is a very good start for 2 evening of work. I can always add more with time. For example, I'm planning to add some stats about word counts per article type of stats. I also plan on keeping a very light css and styling in general, so I may sometime keep stuff in not the best possible way to avoid adding unnecessary css.
Now let's go into the ugly details. And let me start by saying: they are indeed ugly! The python script is just a mess written with speed to write it in mind, absolutely not about optimization or even "good common sense"! I usually don't care and don't say too much about code quality of scripts I'm sharing, but I feel like I really MUST warn everyone before they open it :).
The big steps of the process are:
All of that is then incorporated within my CI/CD process based on sourcehut CI. I have a (already very) long blog posts in draft to details this CI process, so here I'm only focusing on the part for generating the stats page, not the CI/CD related stuff. It means that if you look at the scripts on sourcehut, there might be things in there not explained on this post.
Retrieve the complete [bash script] or [python code] on sourcehut.
It may be important to explain here how my content are organized. In a nutshell:
<hugoRoot>/content ├── bookmarks ├── gemlog ├── pages ├── posts └── tags
In each directory, I have all the `*.md' files in there, I don't have subdirectories. If it is your case, you will need to adapt everything below.
The magic part here is to focus only on the date parameter within the frontmatter area. I modified a bit the command shared above:
for file in *.md; do head -n 10 $file | grep 'date =' | sed 's/date.*\([[:digit:]]\{4\}-[[:digit:]]\{2\}\).*/\1/' >> "${temp}/_tmpCount" ; done
I had to use `-n 10' with head to get more lines as I did have some long frontmatter, as well as retrieving with `sed' not only the year, but the couple `YYYY-MM'. Then pushing that in a temporary file.
Then the tricky part was creating a loop that would allow to read data line by line and still be able to reuse the variable outside of the while loop[1]:
total=0 while read -r line do nb=$(echo "${line}" | awk -F " " '{print $1}') […] total=$((total + nb)) done < <(cat "${temp}/_tmpCount" | sort | uniq -c)
Notice the part after `done'? This is process substitution and prevents the while loop to create a subprocess.
Let's talk about generating the json file now…
I don't know how to explain better than sharing almost the entire code of the bash script…
[…] --- title: "Init variable:" date: 2024-03-07 --- global_total=0 res_json="{\"articles\": {" for type in "${stats_content_types[@]:?}" do […] res_json="${res_json}\"$type\": {\"entries_per_month\": [" […] for file in *.md; do head -n 10 $file | grep 'date =' | sed 's/date.*\([[:digit:]]\{4\}-[[:digit:]]\{2\}\).*/\1/' >> "${temp}/_tmpCount" ; done total=0 while read -r line do nb=$(echo "${line}" | awk -F " " '{print $1}') date=$(echo "${line}" | awk -F " " '{print $2}') res_json="${res_json}{\"date\": \"${date}\", \"count\": \"${nb}\"}," total=$((total + nb)) done < <(cat "${temp}/_tmpCount" | sort | uniq -c) res_json="${res_json::-1}], \"total\": ${total}}," global_total=$((global_total + total)) done res_json="${res_json::-1}}, \"total_articles\": ${global_total}}" […]
I tried to reduce the noise to a minimum. As you can see, it is just an ugly way to create the full json string in loops… But hey, it works!
The json generated looks like this:
{ "articles": { "posts": { "entries_per_month": [ { "date": "2013-01", "count": "1" }, { "date": "2013-02", "count": "2" }, […] ], "total": 127 }, "gemlog": { "entries_per_month": [ { "date": "2021-02", "count": "5" }, […] ], "total": 42 }, "bookmarks": { "entries_per_month": [ { "date": "2023-02", "count": "18" }, […] ], "total": 68 } }, "total_articles": 237 }
`Count' contains the number of posts for that type for the given month (format `YYYY-MM').
The generated json file will be used by hugo directly to display some stats (eg, on the [stats overview summary]) and by the python script to generate images. So it is the "source of truth" once created.
Let's look first at the graph and images generation, and then at the hugo setup.
I almost went with [MatPlotLib] as the library of choice for building the graphs, but then found [Pygal] which seemed easier to start with, and more than enough for anything I wanted to do on the stats area of this site.
I'm putting the warning again, but the python script is as ugly as it can be! It needs a lot of love, but for now, everything was made with speed to deliver in mind, not love of well thought work :D.
You can find the [script on sourcehut].
I'm not going to display and explain the script here. It is ugly but in the end generate with `Pygal' different pie and bar charts:
Replace `<YEAR>' with the different years since the creation of this website (ignoring empty years) and `<TYPE>' is one of the existing content types (`posts', `gemlog' and `bookmarks', `pages' are ignored).
Right now, it creates 38 files in total… Once generated they are moved to the right place within hugo structure (in my case, `<hugoRoot>/static/images/pages/stats/').
This may seems complicated and / or messy because there are many markdown and html files… But it is simple in reality, I just splited the markdown in multiple files to have lighter pages and used small shortcode templates to be able to reuse them easily.
Let's start with the easiest part. I created 6 markdown files in `<HugoRoot>/content/pages/'. I'm not going to copy their content here, I'm linking them to their sourcehut page if you want to see them entirely:
In the previous pages, I call some custom hugo `shortcodes' to avoid repeating myself (and be able to create custom html called from the markdown files).
You can find all of these `shortcodes' [on sourcehut], so I'm not going to go into all the details. But to give an idea, here is an example.
This is the code for the `contentstats-articles-per-type' (in `<HugoRoot>/layouts/shortcodes/'):
{{ $articleType := $.Get 0 }} {{ $currentYear := $.Get 1 }} <div class="stats-item"> <p>All {{ strings.FirstUpper $articleType }} per month in {{ $currentYear }}</p> <figure class="statsimg"> <img src="{{ print "/images/pages/stats/stats-monthly_" $articleType "_in_" $currentYear "-bar.png" }}" alt="{{ $articleType }} in {{ $currentYear }}" /> <figcaption>{{ $articleType }} in {{ $currentYear }}</figcaption> </figure> </div>
It receives 2 arguments: the type of articles (`posts', `gemlog' or `bookmarks') and a given year. Then based on these 2 arguments, it displays the right info. In this case, just one image found based on the 2 given arguments. Other shortcodes may do a lot more.
It allows me to call it from different markdown files multiple times with different arguments depending on the context. For example, I'm calling this one from the `stats-posts.md' file:
## 2024 {{</* contentstats-articles-per-type "posts" "2024" */>}} ## 2023 {{</* contentstats-articles-per-type "posts" "2023" */>}} […]
So for each year, I can just call it with different year argument. Then, I can do the same within `stats-gemlog.md' to display the gemlog graphs:
## 2024 {{</* contentstats-articles-per-type "gemlog" "2024" */>}} ## 2023 {{</* contentstats-articles-per-type "gemlog" "2023" */>}} […]
Look into the files themselves to understand more :).
One last thing about hugo, it can read directly json (and yaml, toml or csv) files and display data directly from it. For example, from the shortcode `contentstats-summary-graph', it doesn't load images but directly the json file to display the number of articles per type:
{{ $data := index .Site.Data "content_stats" }} <div class="stats-summary"> <strong>{{ $data.total_articles }}</strong> pieces of content have been published in total on this website: <ul> {{ range $k, $v := $data.articles }} <li> {{ $v.total }} {{ $k }}</li> {{ end }} </ul> </div>
The `index .Site.Data "content_stats"' is the part loading the file (called `content_stats.json') and then I can loop on it using the `range' function.
Well, this has been a fun couple of evening, working on the ugly bash and python scripts, toying with the generated pages and images and writing this article! Now I have [content stats] generated during the CI process so I know they will always be up to date :].
Let me know if you think about more interesting information to display or want to discuss how to apply this to your own website.
Footnotes
_________
[1] : See