Converting HTML into other forms has a lot of complexity and nuance. Here are some of the things I have learned with my various projects parsing HTML and convert it into other formats such as gemtext, Markdown, or more simplified HTML.
What URL should you use for an image? This is usually important because many output formats you can converting HTML into will just want a single URL for the image. At first, this seem straightforward: Just look at the `src` attribute. However it can be more complicated.
First, some images define a `src`, but also use a `srcset` to define other URLs for different sizes or pixel densities. So you should also check `srcset` as well. Typically I try and extract the highest quality image source possible, and use that in the output. This way I have the URL for the "best" representation of the image, and other parts of my project can downsize/scale it as necessary.
Another complexity is when the image in the `src` attribute really isn't the image you want. An example of this is lazy loading, where you don't want a web browser downloading high resolution images before they are needed. You might not want to load any image if it is not in the browser's viewport. Or you may want to load a low resolution image first, then load a higher resolution image.
Before modern HTML supported lazy-loading via the `loading` attribute, many designers hacked this in with JavaScript. They would set the `src` attribute to a small image, or set it to a single pixel image via a `data:` URL. They would specify the real image source via an an attribute on the <img> tag just as `data-url` or `lazy-url`. So you should check for these attributes and use those URLs if present.
ARIA, or Accessible Rich Internet Applications, is a standard to make it easier for people with disabilities to access web content.
ARIA works by defining HTML attributes which helps the user-agent understand the purpose of the markup, so it can be presented differently. ARIA helps when converting HTML since it provides additional clues about the markup.
For example, ARIA roles will tell you that a <div> tag and it's children are a search box. They can tell you when an anchor tag is really a button.
Primarily you can use ARIA attributes to:
Some tags are invisible or hidden from view, and most likely should be ignored. There are many ways tags can be hidden of invisible:
All of these should be ignored.