💾 Archived View for alpha.lyk.so › systems › food › recipe-bot.gmi captured on 2023-12-28 at 15:22:54. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2023-05-24)

-=-=-=-=-=-=-

recipe bot

Updated 2023-05-12

nutrition data

I spent some time gathering nutrition data from the USDA and massaging it into a usable form.

The code and source data

The resulting 186 megabyte SQLite database

The process was a lot less straightforward than I'd thought it would be, so hopefully this will help others trying to make similar use of the USDA's datasets.

recipe gathering

A shoutout to Robin for sending me a link to some public domain recipes served up on a clean, lightweight website:

based.cooking

I'll be pulling those recipes into my collection as well. These are available on Github as regularly-formatted markdown files, so the process of converting to YAML should go much, much easier!

I've also managed to pull in the recipes off Grim Grains and have begun the process of adding nutrition sources to them.

--

I've completed preliminary conversion of the HTML downloaded to YAML recipes. I may have to repeat this step as I discover errors through use, so I'm holding on to the original HTML for now. It's only 2GB of raw data and 31MB (yes, thirty-one megabytes) as an `xz -9`ed tarball, so I'm not in a hurry to delete it.

I made a good start with grep and sed, successfully converting some recipes with only those tools. But there was enough irregularity that it became apparent using only those tools would be more trouble than I was willing to put up with. I incorporated `pup` into my toolset for this problem:

pup: Parsing HTML at the command line

First I flattened the HTML cache, incorporating the unique ID element of each recipe's URL into the filenames because I discovered collisions between 372 of the recipe names would occur without it. This also allows me to reconstruct the original URL if I need to, as the name and the ID are the only unique elements of each recipe's URL.

My method for figuring out the number of collisions, for those wondering:

$ find html-cache -type f -exec basename {} \; > recipe-names
$ sort recipe-names | uniq > recipe-names-unique
$ expr $(wc -l recipe-names) - $(wc -l recipe-names-unique)
372
$

And the flattening script:

#!/usr/bin/env sh

set -e

find "$1" -type f | while read f; do
  tmp="$(echo "$f" | rev | cut -d/ -f-2 | rev)"
  new="$(echo "$tmp" | cut -d/ -f2)-$(echo "$tmp" | cut -d/ -f1)"
  mv "$f" "$1/$new"
  rmdir "$(dirname "$f")"
done

scraping-foodista/flatten-html-cache.sh

Then I ran the `convert-to-yaml.sh` script on the contents of the HTML cache directory:

$ mkdir recipes
$ find html-cache -type f -exec "./convert-to-yaml.sh" recipes {} \;

The script:

#!/usr/bin/env sh

# Dependency: pup

set -e

[ "$2" ] || { echo "usage: $0 <recipe dir> <html source>"; exit 1; }
echo "Converting $2"

mkdir -p "$1/images" || true
img="$1/images/$(basename "$2").jpg"
imgurl="$(pup -f "$2" 'div.featured-image img attr{src}')"

[ -f "$img" ] || curl -s -o "$img" "$imgurl"

title="$(pup -f "$2" '#page-title text{}')"
author="$(pup -f "$2" '.username text{}')"
imgcredit="$(pup -f "$2" 'div.featured-image a text{}')"

if [ "$imgcredit" ]; then
  imgcrediturl="$(pup -f "$2" 'div.featured-image a attr{href}' | tail -n1)"
else
  imgcrediturl=""
  imgcredit="$author"
fi

description="$(pup -f "$2" 'div.field-type-text-with-summary text{}' \
  | sed -z 's/\n\n\+/\n\n/g')"

ingredients="$(pup -f "$2" "div[itemprop="ingredients"]" \
  | tr -d "\n" \
  | sed 's|</div>|</div>\n|g; s|<[^>]\+>||g;' \
  | sed 's/^ \+//g; s/^/- /g' | tr -s ' ')"

directions="$(pup -f "$2" "div[itemprop="recipeInstructions"].step-body" \
  | tr -d "\n" \
  | sed 's|</div>|</div>\n|g; s|<[^>]\+>||g;' \
  | sed 's/^ \+//g; s/^[0-9]\+\. \+//g; s/^/- /g' | tr -s ' ')"

tags="$(pup -f "$2" 'div.field-type-taxonomy-term-reference a text{}' \
  | tr "\n" "," | sed 's/,$//g; s/,/, /g;')"

cat > "$1/$(basename "$2").yml" <<EOF
---

layout: recipe
title: $title
author: $author
license: https://creativecommons.org/licenses/by/3.0/
image: $img
image_credit: $imagecredit
image_credit_url: $imagecrediturl
tags: $tags

ingredients:
$ingredients

directions:
$(echo "$directions" | sed 's/&nbsp;/ /g')

---

$(echo "$description" | sed 's/&nbsp;/ /g')
EOF

scraping-foodista/convert-to-yaml.sh

Still not sure about those `image_credit` and `image_credit_url` keys. Might convert to kebab-case later.

At this point, the recipes directory weighs in at 733MB, with 564MB of it being images.

previously

downloading urls

I downloaded all the URLs I'd scraped previously into an "HTML cache" directory, rate limiting the script to no faster than 10 requests per second.

#!/usr/bin/env sh

[ "$2" ] || { >&2 echo "usage: $0 <cache directory> <url list file>" && exit; }

export CACHE_DIR="$1"

tmp="$(mktemp)"
trap 'rm "$tmp"' EXIT INT HUP

cat > "$tmp" <<"EOF"
url="$1"
path="$CACHE_DIR/$(echo "$url" | sed 's|https\?://||')"

if [ -f "$path" ]; then
  echo "Already exists, skipping: $path"
else
  echo "Caching to $path"

  dir="$(dirname "$path")"
  mkdir -p "$dir"
  curl -s -o "$path" "$url"

  # rate limit, don't be *too* obnoxious
  sleep 1
fi
EOF

chmod +x "$tmp"

cat "$2" | xargs -P 10 -n 1 "$tmp"

scraping-foodista/cache-html.sh

gathering urls

I compiled the list of recipe URLs for the script above using this script:

#!/usr/bin/env sh

set -e

# The "pause" indicates how many seconds to wait between pages.
# Pages are 0-indexed. To start from the beginning, pass "0" as the start.
# The "end" is exclusive. To pull through page 282, pass "282" as the end.
# (Page 282 is at index 281.)

[ "$3" ] || { >&2 echo "usage: $0 <pause> <start> <end>" && exit 1; }

pause="$1"
page="$2"

while [ "$page" != "$3" ]; do
  # only show diagnostic output if this is an interactive terminal
  [ ! -t 1 ] || echo "Fetching page $(expr $page + 1)..."

  curl -s "https://www.foodista.com/browse/recipes?page=$page" \
  | grep -oP '<a href="/recipe/\K[^\"]+' \
  | sed 's|^|https://www.foodista.com/recipe/|'

  sleep $pause

  page=$(expr $page + 1)
done

scraping-foodista/recipe-urls.sh

relevant links:

Foodista

Chowdown

todo: