💾 Archived View for gemi.dev › gemlog › 2022-07-26-kennedy-image-search.gmi captured on 2023-01-29 at 15:50:28. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-01-29)
-=-=-=-=-=-=-
2022-07-26 | #search #kennedy | @Acidus
I have added image search to Kennedy, my Gemini search engine. Now you can search for images across all of Gemini space which matches a query.
Screenshot: Kennedy image search results.
When you do a search on Kennedy, I run searches against both the text index and the image index. If I find image results, there is a link to those results at the top of the results:
Screenshot: Link to image results at top of search results
If you know you want to do an image search, I have a dedicated URL for that:
The results for an image search includes metadata like the image type, dimensions, and file size. Image type is determined by actually parsing the image using an image library. This is needed because about 1% of respones with an "image/*" MIME type aren't actually valid images.
Kennedy's stats:
Also included in the result is a link to a revamped "page info" page. This includes more data about the image, such as a list of pages that link it the image. This is a great way to get to the text content when you find an interesting image.
Info about Canon Cat image, linking you to the original article
I have been using Kennedy's image search myself for the last week or so, and found a lot of cool things like custom Doom levels, plans for renovating a van, retro PC builds, and an endless supply of cute cat and dog pictures.
There's a lot of great papers written about different ways to build an image search engine. I chose to go the simple-yet-reliable route: indexing both meaningful text in the URL, as well as the link text used in links to the image. I mix these into an FTS engine with porter stemming, which takes care of handling different plurals, suffixes, and word endings.
Collecting of early search engine academic papers
To see how this approach helps, lets see an example. Here are the search results for "sound card":
The results include 2 images of sound cards from someone buiding a retro gaming PC. The image files themselves have no valuable metadata to help identify that they are images of sound cards. The URLs also don't't indicate the subject of these images. Here is the link text for one of the image:
Photo of the eMachines 740 interior with the graphics and sound cards fitted
The use of "sound card" in the link text is what allows this query to match for this image. In fact, both images only mention sound cards in their link text. If I hadn't included link text as part of my image search algorithm the query "sound card" would have returned no results!
Another good example is this search for "cats".
Here we can see the benefits of stemming (finding "cat" when I search for "cats"), as well as results where the word "cat" only appears in the URL, or in the link text.
Image files do have metadata, but it's not very useful for text-based image search. After all, most people are not putting tags inside their images though metadata formats like XMP support that.
In fact, some metadata can be harmful. Location data where the photo was taken embedded inside the file can create a privacy or personal safety issue. Some editors include image thumbnails and edit history into the image file itself as metadata. This can backfire if the photo contained sensitive content before it was cropped. In the early 2000's TechTV's Cat Schwartz inadvertently posted naked pictures of herself to TechTV's website because her riskè photos were just cropped versions of her posing topless. Thumbnails of the original uncropped images were stored inside the cropped image file as metadata and were quickly discovered.
For these and other reasons, metadata is often stripped out of image before publishing. The metadata that remains, such as the color space or ICC profile used for the photo, or the name and version of the program that create it, is of little value for search engines.
It would be easy to extend Kennedy's approach to image search to other media like audio or video files. As an added benefit, audio and video files usually *DO* have meaningful metadata like title, artist, album, chapters, etc. That metadata could be indexed and complemented with terms from the URL and link text information. Of course:
It might make sense to add audio and video search to Kennedy for completeness, but it's not a priority.
As always I love to hear feedback on my projects If you have ideas on how to make Kennedy better, or just have a comment, please drop me a line.