💾 Archived View for gmn.clttr.info › sources › geminispace.git › commits captured on 2022-01-08 at 13:47:58. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2021-12-03)
-=-=-=-=-=-=-
Author: René Wagner <rwa@clttr.info>
Date: Wed Dec 29 10:59:46 2021 +0100
Message: don't delete excluded pages from the pages table
Or we loose external backlinks to this pages as well
which might be usefull.
Author: René Wagner <rwa@clttr.info>
Date: Wed Dec 29 10:57:01 2021 +0100
Message: update poetry version
Author: René Wagner <rwa@clttr.info>
Date: Wed Nov 24 20:45:10 2021 +0100
Message: show 30 latest hosts
Author: René Wagner <rwa@clttr.info>
Date: Sat Nov 20 17:13:32 2021 +0100
Message: exclude antenna filters
Author: René Wagner <rwa@clttr.info>
Date: Fri Nov 19 16:06:37 2021 +0100
Message: don't crash on URIs with non-number port
closes #37
Author: René Wagner <rwa@clttr.info>
Date: Tue Nov 16 16:21:25 2021 +0100
Message: update excludes
Author: René Wagner <rwa@clttr.info>
Date: Thu Nov 11 18:28:13 2021 +0100
Message: update contact
Author: René Wagner <rwa@clttr.info>
Date: Tue Nov 09 18:41:24 2021 +0100
Message: dependency update
Author: René Wagner <rwa@clttr.info>
Date: Sun Nov 07 17:27:36 2021 +0100
Message: cleanup excludes
Author: René Wagner <rwa@clttr.info>
Date: Mon Oct 25 20:45:57 2021 +0200
Message: save first_seen_at if a page is created through a link
Author: René Wagner <rwa@clttr.info>
Date: Thu Oct 14 20:22:12 2021 +0200
Message: add link to source in geminispace
Author: René Wagner <rwa@clttr.info>
Date: Thu Oct 14 18:54:12 2021 +0200
Message: more meta data for index cleanup
Author: René Wagner <rwa@clttr.info>
Date: Mon Oct 11 20:03:08 2021 +0200
Message: avoid crash when normalized_url is not set
fixes #34
Author: René Wagner <rwa@clttr.info>
Date: Mon Oct 11 19:45:42 2021 +0200
Message: use cronjob for automated start
Author: René Wagner <rwa@clttr.info>
Date: Thu Sep 16 19:53:51 2021 +0200
Message: some cleanup
- remove some unused code
- remove outdated excludes
- news update 2021-09-15
Author: René Wagner <rwa@clttr.info>
Date: Mon Sep 06 08:19:03 2021 +0200
Message: fix broken link to source code
Author: René Wagner <rwa@clttr.info>
Date: Sat Sep 04 09:03:14 2021 +0200
Message: do not add every single domain to the statistics file
Author: René Wagner <rwa@clttr.info>
Date: Wed Aug 18 17:23:23 2021 +0200
Message: news 2021-08-18
Author: René Wagner <rwa@clttr.info>
Date: Tue Aug 17 21:00:10 2021 +0200
Message: some minor changes
- update docs about indexing
- show historical stats in reverse order (newest first)
- some exclude cleanup
Author: René Wagner <rwa@clttr.info>
Date: Tue Aug 10 18:43:19 2021 +0200
Message: ensure that scheme is given when searching for backlinks
Author: René Wagner <rwa@clttr.info>
Date: Tue Aug 10 18:37:46 2021 +0200
Message: update 2021-08-07
Author: René Wagner <rwa@clttr.info>
Date: Fri Aug 06 16:50:59 2021 +0200
Message: ensure that seed-requests use absolute URIs
Author: René Wagner <rwa@clttr.info>
Date: Fri Aug 06 16:41:53 2021 +0200
Message: more excludes
Author: René Wagner <rwa@clttr.info>
Date: Fri Jul 23 13:11:09 2021 +0200
Message: implemented deletion of outdated data
- pages that never had any successfull crawl
- pages with the last successfull crawl more than 30 days ago
closes #24
Author: René Wagner <rwa@clttr.info>
Date: Tue Jul 20 19:14:39 2021 +0200
Message: small fixes and doc adjustments
Author: René Wagner <rwa@clttr.info>
Date: Sat Jul 17 19:40:20 2021 +0200
Message: remove obsolete code
- threads in serve/views.py, serve/models.py and gus/lib/db_model.py
- run_index_statistics in gus/lib/index_statistics.py
Author: Hannu Hartikainen <hannu@hrtk.in>
Date: Sat Jul 17 12:06:19 2021 +0300
Message: support prioritized robots.txt user-agents
Reimplement the can_fetch() function of RobotFileParser such that it
prioritizes multiple user-agents. Add unit test for said functionality
and set the user-agents this crawler uses to ["gus", "indexer", "*"] (as
they were in the past, though with bugs).
This was heavily inspired by the earlier discussion at
https://lists.sr.ht/~natpen/gus/%3C20210212070534.14511-1-rwagner%40rw-net.de%3E
Author: René Wagner <rwa@clttr.info>
Date: Sat Jul 17 12:35:01 2021 +0200
Message: more excludes and less logging
Author: René Wagner <rwa@clttr.info>
Date: Wed Jul 14 21:01:05 2021 +0200
Message: treat schemeless links as non-gemini links
a scheme is mandatory per spec
https://lists.orbitalfox.eu/archives/gemini/2020/003646.html
closes #12
Author: René Wagner <rwa@clttr.info>
Date: Wed Jul 14 20:56:50 2021 +0200
Message: remove pikkulog separation
Author: René Wagner <rwa@clttr.info>
Date: Wed Jul 14 08:36:25 2021 +0200
Message: minor code cleanup in db_model
Author: René Wagner <rwa@clttr.info>
Date: Wed Jul 14 08:32:13 2021 +0200
Message: update to some templates
most notably:
remove the footer on pages where it's not useful
Author: René Wagner <rwa@clttr.info>
Date: Tue Jul 13 17:20:53 2021 +0200
Message: remove Search model
We don't store search queries, although not personalized
this is no information we want to have.
Author: René Wagner <rwa@clttr.info>
Date: Tue Jul 13 13:21:28 2021 +0200
Message: enable 'newest-hosts' and 'newest-pages' sites again
closes #26
Author: René Wagner <rwa@clttr.info>
Date: Tue Jul 13 09:21:06 2021 +0200
Message: remove raw data from excluded capsules
first part of #24
Author: René Wagner <rwa@clttr.info>
Date: Mon Jul 12 21:37:55 2021 +0200
Message: index text files up to 5 MB
fix flagging pages as indexed
Author: René Wagner <rwa@clttr.info>
Date: Mon Jul 12 19:27:57 2021 +0200
Message: commit search index only when indexing is complete
unnecessary commits during indexing are time-consuming
remove dead "feedparser" code from crawl
Author: René Wagner <rwa@clttr.info>
Date: Mon Jul 12 16:57:33 2021 +0200
Message: store document id in whoosh index
Author: René Wagner <rwa@clttr.info>
Date: Mon Jul 12 14:58:33 2021 +0200
Message: some tweaks to indexing
- simplify backlinks counter query
- only count successfull crawled domains as known domains
- increase default root recrawl time
Author: René Wagner <rwa@clttr.info>
Date: Sun Jul 11 19:03:15 2021 +0200
Message: restructure crawl data
The "crawl" table is now obsolete and removed, all required
information is stored in the `page` table which simplifies
queries and will make data cleanup easier.
All relevant queries have been adjusted to honor this change.
Author: René Wagner <rwa@clttr.info>
Date: Sun Jul 11 09:05:01 2021 +0200
Message: remove Crawl table, all info is stored in page table now
Author: René Wagner <rwa@clttr.info>
Date: Sat Jul 10 09:08:50 2021 +0200
Message: don't persist robots.txt over multiple crawls
Instead fetch them again on every crawl run and only
cache for the the crawl session
Author: René Wagner <rwa@clttr.info>
Date: Fri Jul 09 22:05:55 2021 +0200
Message: improve indexing speed via optimized backlinks query
the query to calculate backlinks caused massive delays during indexing.
An unused join to the `crawl` table caused this behavior.
After removing the join, speed is very fast again.
Author: René Wagner <rwa@clttr.info>
Date: Fri Jul 09 17:38:45 2021 +0200
Message: again a new exclude
Author: René Wagner <rwa@clttr.info>
Date: Fri Jul 09 17:37:39 2021 +0200
Message: move gusmobile to new home
gusmobile was hosted on natpen's git which is not available
anymore.
The source is now mirrored on src.clttr.info and codeberg.org
Author: René Wagner <rwa@clttr.info>
Date: Sun Jul 04 21:49:27 2021 +0200
Message: update 2021-07-04 & more excludes
Author: René Wagner <rwa@clttr.info>
Date: Mon Jun 28 09:31:39 2021 +0200
Message: additional filter
Author: René Wagner <rwa@clttr.info>
Date: Sat Jun 26 13:16:35 2021 +0200
Message: update 2021-06-26
Author: René Wagner <rwa@clttr.info>
Date: Wed Jun 16 21:18:53 2021 +0200
Message: exclude godocs.io
Author: René Wagner <rwa@clttr.info>
Date: Mon Jun 14 09:13:51 2021 +0200
Message: error handling on page crawl save
Author: René Wagner <rwa@clttr.info>
Date: Fri Jun 04 11:40:44 2021 +0200
Message: update 2021-06-04
Author: René Wagner <rwa@clttr.info>
Date: Sat May 29 10:56:34 2021 +0200
Message: more exception handling on link update
Author: René Wagner <rwa@clttr.info>
Date: Thu May 27 15:24:13 2021 +0200
Message: fix wrong embedding of excludes
Author: René Wagner <rwa@clttr.info>
Date: Wed May 26 13:06:36 2021 +0200
Message: unify capitalisation of charset in statistics
Author: René Wagner <rwa@clttr.info>
Date: Tue May 25 22:05:40 2021 +0200
Message: move exclude definition to own file
closes #18
Author: René Wagner <rwa@clttr.info>
Date: Tue May 25 21:13:28 2021 +0200
Message: news 2021-05-25
Author: René Wagner <rwa@clttr.info>
Date: Fri May 21 21:58:18 2021 +0200
Message: some exception handling and updated service files
Author: René Wagner <rwagner@rw-net.de>
Date: Sun May 16 09:59:42 2021 +0200
Message: fix last wrong exception in crawl
Author: René Wagner <rwagner@rw-net.de>
Date: Fri May 14 20:59:54 2021 +0200
Message: fix wrong exception handling in crawl
Author: René Wagner <rwagner@rw-net.de>
Date: Wed May 12 17:46:33 2021 +0200
Message: update 2021-05-12
Author: René Wagner <rwagner@rw-net.de>
Date: Mon May 10 17:41:06 2021 +0200
Message: rewrite statistics gathering to pure sql
the peewee functions lead to a stupid error
because to much variables are generated
fixes #21
Author: René Wagner <rwagner@rw-net.de>
Date: Sat May 08 21:51:48 2021 +0200
Message: exception handling on page save
Author: René Wagner <rwagner@rw-net.de>
Date: Wed Apr 14 21:33:27 2021 +0200
Message: news 2021-04-14
Author: René Wagner <rwagner@rw-net.de>
Date: Mon Apr 05 08:07:46 2021 +0200
Message: delete tmp files of whoosh
Author: René Wagner <rwagner@rw-net.de>
Date: Thu Mar 25 21:33:31 2021 +0100
Message: use .fromisoformat for getting timestamp from db
tentative fix for #17
Author: René Wagner <rwagner@rw-net.de>
Date: Thu Mar 25 21:10:54 2021 +0100
Message: various corrections
Author: René Wagner <rwagner@rw-net.de>
Date: Sat Mar 20 20:58:58 2021 +0100
Message: hack: index update in separate dir
Author: René Wagner <rwagner@rw-net.de>
Date: Mon Mar 08 19:21:29 2021 +0100
Message: skip a capsule after 5 consecutive failed requests
This state is reset after the current crawl
closes #16
Author: René Wagner <rwagner@rw-net.de>
Date: Mon Mar 08 18:59:55 2021 +0100
Message: workaround for "index update blocks searches"
Author: René Wagner <rwagner@rw-net.de>
Date: Mon Mar 08 18:59:09 2021 +0100
Message: news update 2021-03-08
Author: René Wagner <rwagner@rw-net.de>
Date: Mon Mar 08 18:51:28 2021 +0100
Message: Merge branch 'master' of git://natpen.net/gus
Author: René Wagner <rwagner@rw-net.de>
Date: Fri Mar 05 19:02:58 2021 +0100
Message: update poetry deps
Author: René Wagner <rwagner@rw-net.de>
Date: Fri Feb 26 18:52:51 2021 +0100
Message: gsi specific updates 2021-02-26
Author: René Wagner <rwagner@rw-net.de>
Date: Mon Feb 22 19:06:02 2021 +0100
Message: robots.txt sections "*" and "indexer" are honored
We no longer use the "gus" section for ease of implementation.
It's probably barely used anyway.
Author: René Wagner <rwagner@rw-net.de>
Date: Fri Feb 12 08:05:34 2021 +0100
Message: correctly handle robots.txt
Honor the robots.txt entrys of "indexer" and "gus" as well
as the default * section.
The robot_file_map.p must be deleted on a live instance
after this change has been applied to refetch all robots
files, as previously only empty files have been stored.
Author: René Wagner <rwagner@rw-net.de>
Date: Fri Feb 12 08:53:20 2021 +0100
Message: add verbose search to robots.txt
This was missing in the first place.
Author: René Wagner <rwagner@rw-net.de>
Date: Wed Feb 10 19:05:47 2021 +0100
Message: Merge branch 'master' of git://natpen.net/gus
Author: René Wagner <rwagner@rw-net.de>
Date: Mon Feb 08 17:43:19 2021 +0100
Message: add some forbidden URIs & set max_crawl_depth
Author: René Wagner <rwagner@rw-net.de>
Date: Sun Feb 07 19:11:45 2021 +0100
Message: remove seed-requests from repo
Author: René Wagner <rwagner@rw-net.de>
Date: Sun Feb 07 17:48:36 2021 +0100
Message: Merge branch 'master' of git://natpen.net/gus
Author: René Wagner <rwagner@rw-net.de>
Date: Tue Feb 02 18:38:00 2021 +0100
Message: update python deps
Author: René Wagner <rwagner@rw-net.de>
Date: Tue Feb 02 17:39:42 2021 +0100
Message: updates geminispace.info 2021-02-02
Author: René Wagner <rwagner@rw-net.de>
Date: Sun Jan 31 21:08:02 2021 +0100
Message: introduce systemd-unit for indexer
The indexer is launched by systemd when the crawler finishes.
Author: René Wagner <rwagner@rw-net.de>
Date: Sun Jan 31 15:04:10 2021 +0100
Message: gsi specific updates
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat Jan 30 07:15:18 2021 -0800
Message: Make README heading lines more consistent
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat Jan 30 07:05:46 2021 -0800
Message: Fix trailing whitespace and reformat long string
Author: René Wagner <rwagner@rw-net.de>
Date: Fri Jan 29 14:43:11 2021 +0100
Message: gsi specific updates 2021-01-29
Author: René Wagner <rwagner@rw-net.de>
Date: Thu Jan 28 20:59:02 2021 +0100
Message: add systemd-units for automatic crawling
The template runs the crawler once a week on saturday afternoon.
If other launch times are wanted, gus-crawl.timer needs to be
modified.
Author: René Wagner <rwagner@rw-net.de>
Date: Wed Jan 27 13:35:54 2021 +0100
Message: add "/robots.txt" route to views.py
It's a hard coded approach to serve a robots.txt to other crawlers.
No crawler may access /add-seed & /threads and all relevant virtual agents
may not access /search and /backlinks
Author: René Wagner <rwagner@rw-net.de>
Date: Wed Jan 27 10:23:05 2021 +0100
Message: modify views to match geminispace.info
Author: Gogs <gogs@fake.local>
Date: Thu Jan 21 21:08:39 2021 +0100
Message: add seeds & update ignored urls
Author: ugla <ugla@u8.is>
Date: Sat Dec 26 18:30:35 2020 +0100
Message: Defer search requests to threads
Author: Remco <me@rwv.io>
Date: Tue Dec 22 12:46:04 2020 +0100
Message: Health test script and systemd service
Just for reference, it's already running elsewhere.
Signed-off-by: Natalie Pendragon <natpen@natpen.net>
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue Dec 22 10:00:58 2020 -0500
Message: [serve] Fix copy-paste error in status endpoint function name
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon Dec 21 12:04:41 2020 -0500
Message: [serve] Add status endpoint
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue Dec 08 10:10:58 2020 -0500
Message: [serve] Improve formatting of statistics page
The right-alignment of numbers stopped working since the number of
things in Geminispace got too big, so adding two extra spaces to each
alignment block.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Dec 06 11:29:38 2020 -0500
Message: [build_index] Import should_skip
Otherwise it breaks :)
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Dec 06 11:28:56 2020 -0500
Message: Refactor change frequency constants
Put the increments in the constants file, and standardize naming.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat Dec 05 09:04:23 2020 -0500
Message: [crawl] Abort robots.txt parsing attempt if not text/plain
Python's built-in robots.txt parsing functionality breaks if the
content type of the robots.txt is not correctly set to text/plain. If
this is the case, simply abort the parsing attempt and allow all.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu Nov 26 14:56:11 2020 -0500
Message: [serve] Update contributions list on about page
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu Nov 26 14:47:53 2020 -0500
Message: Bind to both IPv4 and IPv6
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Nov 22 21:50:41 2020 -0500
Message: [crawl] Ignore another radio stream
Author: Remco <me@rwv.io>
Date: Fri Nov 20 23:37:03 2020 +0100
Message: Speed up get_newest_hosts
Signed-off-by: Natalie Pendragon <natpen@natpen.net>
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue Nov 17 09:09:05 2020 -0500
Message: Add some more tests of GeminiResource
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue Nov 17 08:32:02 2020 -0500
Message: Add regex-based url exclusion support & refactor tests
This commit adds support for excluding URLs by regex, which is more
powerful than the prefix- and suffix-based exclusions GUS has so far
supported. There have been a number of cases, primarily involving
wiki-type sites, where it would be useful to match a URL by a pattern
that occurs in the middle of the URL, which is now possible. An
example of this is twinwiki's "_history" and "_revert" pages.
This commit refactors the existing test file to a more native pytest
style, from the previous unittest style. Additionally, it adds a new
set of tests for the URL exclusion functionality, covering both the
new regex-based exclusion functionality described above, as well as
the older style of prefix/suffix-based exclusion.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon Nov 16 08:50:53 2020 -0500
Message: Add TODO to README
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon Nov 16 08:44:31 2020 -0500
Message: Take exclusions into account when generating statistics
This will ensure accuracy of the statistics - it's relatively common
that index-excluded content ends up in the database, so this will make
sure the db-based calculations are generally more harmonious with the
index-based calculations/searches.
Note that it's not perfect, since I didn't address the calculations by
content_type/charset/etc. Those are a bit trickier to fix, so I will
have to think a bit more about the best way to deal with that. I
suspect it might warrant of bit of rearchitecting how exclusions work
generally. One idea I currently have for that is to keep the exclusion
list in the database, instead of in code like it currently is - that
would allow for inner joining against an exclusion table in db
queries, which would be really convenient.
Also, this commit removes the superfluous query for getting
domain_count - it's more performant just to count the list of domains
that were already constructed from the previous query.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon Nov 16 08:01:10 2020 -0500
Message: [serve] Fix formatting of dates on statistics page
Similar to the footer, these dates just need to be passed to the
datetime formatter GUS has defined for Jinja templates.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon Nov 16 07:50:07 2020 -0500
Message: Add two new TODOs to README
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon Nov 16 07:49:37 2020 -0500
Message: [build_index] Only index text pages <= 1KB in size
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon Nov 16 07:49:19 2020 -0500
Message: More exclusions
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon Nov 16 07:47:53 2020 -0500
Message: [serve] Fix index closing when program is killed
This got broken during the recent commit to put search functionality
in the search.py Index class
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Nov 15 10:56:01 2020 -0500
Message: [crawl] Increase increment to temp error change frequency
The crawls seem to be spending too much time on these, and there seems
to be a steady stream of new ones that all look like a common word
followed by a common TLD. Each one of these causes a long-running,
ultimately failed DNS lookup, so it ends up taking a long time. This
change should help naturally filter them out of crawls more quickly.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Nov 15 09:19:14 2020 -0500
Message: [serve] Update indexing documentation
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Nov 15 08:41:44 2020 -0500
Message: [serve] Update about page
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Nov 15 08:30:29 2020 -0500
Message: Bump rolling writer's batch size back up to 5000
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Nov 15 08:30:01 2020 -0500
Message: More exclusions
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat Nov 14 11:06:51 2020 -0500
Message: Add systemd config
This is how GUS is already being run, so now checking the config into
the repository to start version controlling it.
Author: Remco <me@rwv.io>
Date: Fri Nov 13 14:24:36 2020 +0100
Message: Move all whoosh related stuff into separate module
Signed-off-by: Natalie Pendragon <natpen@natpen.net>
Author: Remco <me@rwv.io>
Date: Thu Nov 12 21:03:07 2020 +0100
Message: A friend for the other duck
The second duck should acknowledge the first duck, don't you think?
Signed-off-by: Natalie Pendragon <natpen@natpen.net>
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Nov 11 07:27:50 2020 -0500
Message: Bump dependencies
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Nov 11 07:18:28 2020 -0500
Message: [build_index] Fix logging statement
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Nov 11 07:17:25 2020 -0500
Message: [serve] Add statistics_overall_historical template
Author: Natalie Pendragon <natpen@natpen.net>
Date: Fri Nov 06 08:56:01 2020 -0500
Message: Add .git-blame-ignore-revs file
As of Git 2.23, this can be used to exclude commits in Git blame
calculations. This is really helpful in excluding bulk
change/reformatting commits that don't do anything to affect code
functionality, but touch lots of files, and can make the commit
history more difficult to follow.
Per the official documentation
[here](https://www.git-scm.com/docs/git-blame), you can take advantage
of this manually like so:
git blame --ignore-revs-file .git-blame-ignore-revs foo.py
Additionally, you can set this up as a persistent repo-level
configuration setting like so, if so desired:
git config blame.ignoreRevsFile .git-blame-ignore-revs
Author: Natalie Pendragon <natpen@natpen.net>
Date: Fri Nov 06 08:44:51 2020 -0500
Message: [crawl] Make logging message slightly clearer
Author: Natalie Pendragon <natpen@natpen.net>
Date: Fri Nov 06 08:44:20 2020 -0500
Message: Check for null input in new strip_control_chars function
Author: Natalie Pendragon <natpen@natpen.net>
Date: Fri Nov 06 08:43:16 2020 -0500
Message: Update default logging config to log to both console and file
Author: Natalie Pendragon <natpen@natpen.net>
Date: Fri Nov 06 08:42:57 2020 -0500
Message: Reformat code with Black
Author: Natalie Pendragon <natpen@natpen.net>
Date: Fri Nov 06 07:22:02 2020 -0500
Message: [crawl] Strip control chars from URLs in crawl logging
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue Nov 03 08:38:23 2020 -0500
Message: Add exclusion improvement TODO to README
Author: Remco van 't Veer <remco@remworks.net>
Date: Sun Nov 01 15:39:26 2020 +0100
Message: Ignore link like lines in preformatted text blocks
Blocks of text between ``` lines should not be interpreted as markup.
Signed-off-by: Natalie Pendragon <natpen@natpen.net>
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon Nov 02 08:39:15 2020 -0500
Message: Add contributors section to about page
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon Nov 02 08:38:46 2020 -0500
Message: Fix the index build
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Nov 01 11:05:07 2020 -0500
Message: Clean up todo list in README
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat Oct 31 10:06:14 2020 -0400
Message: [build_index] Flush index segments to disk periodically
Author: Remco van 't Veer <remco@remworks.net>
Date: Sat Oct 31 16:53:41 2020 +0100
Message: Logging
Replace all print statements in the crawler and indexer with log
statements. Use logging categories to distinguish between debug
information (level "debug"), progress (level "info"), and things that
might need attention at some point (level "warn").
Signed-off-by: Natalie Pendragon <natpen@natpen.net>
Author: Remco van 't Veer <remco@remworks.net>
Date: Sat Oct 31 16:53:40 2020 +0100
Message: Drop unused imports
Signed-off-by: Natalie Pendragon <natpen@natpen.net>
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat Oct 31 07:23:29 2020 -0400
Message: Update gusmobile clone location in pyproject.toml
Author: Remco van 't Veer <remco@remworks.net>
Date: Tue Oct 27 20:26:59 2020 +0100
Message: Include notes on updating the index
Signed-off-by: Natalie Pendragon <natpen@natpen.net>
Author: Remco van 't Veer <remco@remworks.net>
Date: Tue Oct 27 17:02:13 2020 +0100
Message: Describe procedure to get gus up and running
Signed-off-by: Natalie Pendragon <natpen@natpen.net>
Author: Remco van 't Veer <remco@remworks.net>
Date: Tue Oct 27 17:02:12 2020 +0100
Message: Fix missing database column indexed_at on Page
It's used but never defined.
Signed-off-by: Natalie Pendragon <natpen@natpen.net>
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Oct 28 06:55:18 2020 -0400
Message: [crawl] Add a few new exclusions
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Oct 28 06:50:06 2020 -0400
Message: [build_index] Perform prefix-based URL exclusion during index build
Previously this exclusion only happened while performing the crawl,
but for a number of reasons, pages have ended up in the database that
should be excluded from the index. Some due to user error, some due to
the exclusion list growing over time.
The fact that they're still in the database means they are probably
impacting db-based calculations, so longer-term there probably should
be some sort of pruning process or something to keep the db entries
pared down to only what we care about.
Even after adding such pruning functionality though, I think this
changeset would still be valuable to ensure the index only gets valid
entries.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Sep 16 08:56:40 2020 -0400
Message: [serve] Add "jump to page" functionality to search
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Sep 16 08:43:12 2020 -0400
Message: [serve] Upgrade to Jetforce v0.6.0
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Sep 16 07:02:39 2020 -0400
Message: [serve] Add more quotes
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Sep 06 06:21:49 2020 -0400
Message: [serve] Update documentation and links a bit
Author: Natalie Pendragon <natpen@natpen.net>
Date: Fri Sep 04 08:21:41 2020 -0400
Message: [serve] Add dynamic quotes to footer
Author: Natalie Pendragon <natpen@natpen.net>
Date: Fri Sep 04 07:50:54 2020 -0400
Message: [serve] Add newest pages endpoint, revamp documentation and index
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu Sep 03 08:00:33 2020 -0400
Message: [serve] Add newest hosts route
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue Aug 25 04:37:52 2020 -0400
Message: [serve] Remove extra quotation mark in add seeds template
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue Aug 11 08:30:50 2020 -0400
Message: [crawl] Print change_frequency
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue Aug 11 08:18:04 2020 -0400
Message: Fix bug in GeminiResource url construction
It wasn't adding the colon to the scheme of URLs that started with
"//".
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Aug 09 09:18:19 2020 -0400
Message: [threads] Only work with textual pages
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Aug 05 14:33:07 2020 -0400
Message: [serve] Add favicon.txt route
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Aug 05 09:03:56 2020 -0400
Message: [serve] Add IP addresses to about page
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Aug 05 09:03:27 2020 -0400
Message: [threads] Add different sort orders for threads
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon Aug 03 12:55:10 2020 -0400
Message: [serve] Improve feed matching
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Aug 02 09:51:17 2020 -0400
Message: Update naming
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Aug 02 09:46:53 2020 -0400
Message: [crawl] Improve handling of change_frequency
This change centralizes the logic into lib/gemini.py for a start.
Additionally it fixes a bug in that the crawl was incrementing the
change_frequency when the page *was* changed. And lastly, this now
adds some pikkulog detection, so those pages get crawled frequently as
well now (which will help them stay current in thread construction).
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Aug 02 05:45:28 2020 -0400
Message: [serve] Add Known Feeds page
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Aug 02 05:42:59 2020 -0400
Message: [threads] Add collapsible log variations
Currently this does some work for both duplicated content (the last
two entries) as well as redirects (the first three entries). Fine for
now, but the redirect magic could and should be made more robust by
actually resolving the redirect chain in the index when attempting to
build threads.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue Jul 28 08:56:01 2020 -0400
Message: [threads] Fix thread ordering
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue Jul 28 07:04:45 2020 -0400
Message: [crawl] Index more errors
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue Jul 28 07:04:06 2020 -0400
Message: [crawl] Add change_frequency backoff
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue Jul 28 07:03:39 2020 -0400
Message: Bump dependencies
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue Jul 28 07:02:50 2020 -0400
Message: Add friendly authors and titles for threads
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon Jul 27 14:50:15 2020 -0400
Message: Threads v1
Author: Natalie Pendragon <natpen@natpen.net>
Date: Fri Jul 24 06:43:53 2020 -0400
Message: [serve] Save searches to db
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu Jul 23 14:40:17 2020 -0400
Message: [build_index] [serve] Distinguish cross-capsule backlinks
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu Jul 23 09:44:49 2020 -0400
Message: [crawl] Add is_cross_host_like field to db
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu Jul 23 08:35:09 2020 -0400
Message: Gitignore all the indexes
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu Jul 23 08:29:27 2020 -0400
Message: Bump dependencies
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu Jul 23 06:54:55 2020 -0400
Message: Create scripts directory
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Jul 22 13:29:00 2020 -0400
Message: Add normalized url to db
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue Jul 21 15:43:56 2020 -0400
Message: [serve] Add cert change to news page
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue Jul 21 14:49:36 2020 -0400
Message: [build_index] Account for per-page expiration
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon Jul 20 08:19:03 2020 -0400
Message: [build_index] Build index with backlink_count instead of backlinks
This works because all the actual fetching of backlinks is now handled
by database queries, so we can slim down the whoosh index a bit with
this change.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon Jul 20 07:56:52 2020 -0400
Message: [crawl] Start indexing errors
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Jul 19 09:23:46 2020 -0400
Message: [crawl] Update db model, and delete links before recreating
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Jul 19 08:18:31 2020 -0400
Message: [crawl] Ensure manual exclusions stay out of the database
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Jul 19 07:35:19 2020 -0400
Message: [serve] minor formatting updates
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Jul 19 07:32:05 2020 -0400
Message: [crawl] Support per-page expiration
This will allow crawls to intelligently decide which URLs to recrawl,
if any. Some pages, like site indexes, or gemlog pages, default to
expiring much more quickly than others. This way recrawls should pick
up links to e.g., new posts, fairly quickly. Conversely, existing
posts, and binary files, are considered to be more static, and will
expire much less frequently, and thus be recrawled less frequently.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Jul 15 09:09:39 2020 -0400
Message: [crawl] Rebuild link table completely and idempotently
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Jul 15 08:20:03 2020 -0400
Message: [serve] Get backlinks from db instead of index
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon Jul 13 19:55:07 2020 -0400
Message: [crawl] Set cap on maxiumum redirect chain length
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon Jul 13 19:18:16 2020 -0400
Message: [crawl] Abort when detecting self-redirects
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon Jul 13 19:17:36 2020 -0400
Message: [crawl] Ignore 80h gopher proxy
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Jul 12 09:27:35 2020 -0400
Message: [serve] Improve pager linking back to previous page
Specifically, if we're linking back to page #1, remove the page number
component of the URL path. This way, if you page forward and back,
then reload, you'll be prompted to enter a query. This improves the
user experience slightly in Elpher.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat Jul 11 08:33:56 2020 -0400
Message: [serve] Update backlinks links and presentation throughout GUS
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat Jul 11 06:56:49 2020 -0400
Message: [serve] Improve safety of backlinks code path
Before, it would throw an unhandled exception if the user entered an
invalid URL as their backlinks query.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Jul 08 06:18:15 2020 -0400
Message: [crawl] Add feature to seed incremental crawl with atom feeds
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon Jul 06 06:22:01 2020 -0400
Message: Make incremental build_index work
Some of the idempotency was lost during the shuffle to split the crawl
into two phases.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon Jul 06 06:20:01 2020 -0400
Message: DRY up the sqlite model and init_db code
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Jul 05 08:52:26 2020 -0400
Message: [serve] Improve handling of backlink searches
They were sensitive to trailing slashes before this. An alternative
approach to consider for the future would be to add "normalized_url"
to the index. This would increase the index size, but it normalizes
away trailing slashes, so would eliminate the need for two searches
here and improve performance of backlinks queries. If they turn out to
get a lot of use, this alternative approach will probably be the
better way to go.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Jul 05 08:02:54 2020 -0400
Message: [serve] Add historical statistics page
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Jul 05 07:01:16 2020 -0400
Message: [crawl] [serve] Run statistics and domains from sqlite db
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat Jul 04 06:43:27 2020 -0400
Message: Improve discovery of backlinks
Specifically, make sure the query picks up backlinks pointing to both
the slashed and slashless version of the URL in question.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Fri Jul 03 11:45:23 2020 -0400
Message: [serve] Fix minor bug in counting of backlinks
Empty backlink strings were getting counted as "1" instead of "0".
Author: Natalie Pendragon <natpen@natpen.net>
Date: Fri Jul 03 10:39:56 2020 -0400
Message: [crawl] [serve] Switch crawl to 2-phase with sqlite
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue Jun 30 08:57:53 2020 -0400
Message: [crawl] Ignore localhost
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue Jun 30 08:54:40 2020 -0400
Message: [serve] Add backlinks news and documentation
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue Jun 30 08:28:39 2020 -0400
Message: [serve] Improve verbose mode
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue Jun 30 08:24:56 2020 -0400
Message: [serve] Update header levels
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue Jun 30 07:07:36 2020 -0400
Message: [crawl] [serve] Add backlinks
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon Jun 22 16:57:03 2020 -0400
Message: [crawl] Ignore more bad content
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu Jun 18 07:16:13 2020 -0400
Message: Update README
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu Jun 18 06:58:28 2020 -0400
Message: [serve] Rearchitect serve to use templates and MVC pattern
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Jun 17 09:09:56 2020 -0400
Message: Add GUS licence
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Jun 17 07:36:13 2020 -0400
Message: [serve] Make seed request handling async again for now
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Jun 17 07:33:48 2020 -0400
Message: [crawl] Ignore some more alexschroeder pages
Author: Natalie Pendragon <natpen@natpen.net>
Date: Fri Jun 12 09:38:24 2020 -0400
Message: [serve] Sort domains on the known-hosts page
Author: Natalie Pendragon <natpen@natpen.net>
Date: Fri Jun 12 06:40:33 2020 -0400
Message: [serve] Add size to result rendering
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu Jun 11 06:38:56 2020 -0400
Message: [crawl] Start indexing response sizes
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Jun 10 08:09:33 2020 -0400
Message: [serve] Use preformatted blocks on the statistics page
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue Jun 09 07:01:45 2020 -0400
Message: Bump dependencies
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue Jun 09 06:55:11 2020 -0400
Message: [crawl] Start indexing lang parameter
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon Jun 08 07:29:11 2020 -0400
Message: [serve] Update some copy on about page
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon Jun 08 07:28:39 2020 -0400
Message: Revert "[crawl] Index raw content for regex searches"
This reverts commit c127a0a2e9a03b60d8ea82447c27af6b12cc128b.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Jun 07 08:32:16 2020 -0400
Message: [crawl] Ignore some more things
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Jun 07 07:05:02 2020 -0400
Message: [crawl] Add marmaladefoo's calculator to manual exclusions
Author: Natalie Pendragon <natpen@natpen.net>
Date: Fri Jun 05 07:35:12 2020 -0400
Message: Add easy CLI way of removing domains from index
Author: Natalie Pendragon <natpen@natpen.net>
Date: Fri Jun 05 06:46:55 2020 -0400
Message: [crawl] Remove manual exclusions for alexschroeder.ch
They updated their robots.txt, so now the Disallow lines are parsing
correctly.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Fri Jun 05 06:41:28 2020 -0400
Message: [crawl] Add custom crawl delays
And add the first one for alexschroeder's site, which still has a
robots.txt that doesn't parse properly.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu Jun 04 11:27:51 2020 -0400
Message: [crawl] Improve indexing performance
I was getting index out of bound issues for optimize calls before this
change - and when looking in the index/ dirs, there were over 30
thousand files. I think this caused issues with whoosh, so now I am
waiting to commit all the writes to the end of the crawl. It's more
unfortunate if the crawl dies, but c'est la vie. On the plus side, now
the optimize call is no longer really even necessary since the final
product is only a few index segments.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Jun 03 19:37:37 2020 -0400
Message: Update some seeds
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Jun 03 16:28:13 2020 -0400
Message: [crawl] Start indexing the charset
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Jun 03 12:50:59 2020 -0400
Message: [crawl] Only attempt to extract contained resources from text/gemini
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Jun 03 12:50:39 2020 -0400
Message: [crawl] Ignore some troublesome content from alexschroeder.ch
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Jun 03 12:50:10 2020 -0400
Message: [crawl] Fix default crawl delay when not specified explicitly
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Jun 03 10:58:45 2020 -0400
Message: [crawl] Persist index & crawl statistics on non-destructive crawls
Also, add a flag to track which serialized statistics lines originated
from incremental crawls.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Jun 03 10:53:30 2020 -0400
Message: Bump dependency versions
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Jun 03 10:49:16 2020 -0400
Message: [crawl] Index raw content for regex searches
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Jun 03 10:47:33 2020 -0400
Message: [serve] Use "OR" as the default connector for queries
Author: Natalie Pendragon <natpen@natpen.net>
Date: Fri May 29 14:40:56 2020 -0400
Message: [serve] Make sure two closely-timed seed requests don't break
This will prevent seed requests' incremental crawls from stomping on
each other, but due to the way in which incremental crawls
resolve (i.e., by restarting the entire GUS serve process via
systemctl), it also means any seed requests that came in after the
first will not be handled until either A) another seed request comes
in that ends up dealing with it, or B) a manual crawl is kicked off.
The situation is no worse than before however, so this is still an
improvement in the short-term.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu May 28 09:02:21 2020 -0400
Message: [crawl] Improve hierarchical handling of robots.txt entries
Give more priority to more specific entries - i.e., an entry for
user-agent "gus" should override an entry for user-agent "*".
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue May 26 09:48:26 2020 -0400
Message: [serve] Update copy on known hosts page
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue May 26 06:57:46 2020 -0400
Message: [crawl] Ignore some Geddit URL prefixes
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon May 25 21:44:46 2020 -0400
Message: [crawl] [serve] Add fetchable URL to the index
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon May 25 13:19:28 2020 -0400
Message: Bump version of Jetforce dependency
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon May 25 06:31:14 2020 -0400
Message: [crawl] Improve handling of quoting and unquoting URLs
Before everything got unquoted at the very beginning of GeminiResource
instantiation. This was slightly errant. It was fine for the
normalized_url and the indexable_url, but resulted in fetchable_url
not being sent quoted, which it should be.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun May 24 23:05:00 2020 -0400
Message: Rename fully_qualified_url to fetchable_url
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun May 24 23:00:23 2020 -0400
Message: Rename fully_qualified_massaged_url to indexable_url
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun May 24 22:54:38 2020 -0400
Message: [crawl] Fix bug in fully_qualified_massaged_url
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun May 24 10:08:37 2020 -0400
Message: [crawl] Stop storing responses in GeminiResource objects
I think this was causing memory overflows, since we were storing
potentially a lot of response content in memory without being able to
clean it up during long chains of recursive calls to crawl() of
contained resources.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun May 24 10:10:04 2020 -0400
Message: Bump version of gusmobile dependency
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun May 24 07:28:15 2020 -0400
Message: [crawl] Handle url fragments
Up to this point, fragments weren't being handled at all, so links to
two different fragments on the same page would both get indexed as
distinct results. With this change, we now strip fragments so the only
thing that ends up in the index is the fragmentless-URL one time.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat May 23 09:11:59 2020 -0400
Message: [crawl] Fix handling of robots.txt
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat May 23 07:19:09 2020 -0400
Message: [crawl] Exclude "rss.xml" paths
Author: Natalie Pendragon <natpen@natpen.net>
Date: Fri May 22 09:18:24 2020 -0400
Message: [crawl] Optimize the index after crawls
Author: Natalie Pendragon <natpen@natpen.net>
Date: Fri May 22 08:42:03 2020 -0400
Message: [serve] Update highlight scoring and rendering
Author: Natalie Pendragon <natpen@natpen.net>
Date: Fri May 22 07:31:18 2020 -0400
Message: [crawl] pickle and unpickle the robot_file_map
This way we don't have to re-request all the robots.txt files during
incremental crawls
Author: Natalie Pendragon <natpen@natpen.net>
Date: Fri May 22 07:20:20 2020 -0400
Message: Improve handling of unquoting URLs
Just do it once at the beginning of GeminiResource creation.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu May 21 16:07:42 2020 -0400
Message: [serve] Update documentation on filters
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu May 21 15:35:51 2020 -0400
Message: Update locked version of Gusmobile
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu May 21 10:59:33 2020 -0400
Message: [crawl] Add domain field to index
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu May 21 09:25:18 2020 -0400
Message: Remove outdated TODO
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu May 21 09:18:25 2020 -0400
Message: [serve] Update formatting of statistics page
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu May 21 08:39:54 2020 -0400
Message: [serve] Fix bug with first/next/previous page link formatting
Previously, it wasn't url encoding the query, so if a query had more
than one term, like e.g. "gemini hosting", the link line would get
formatted like "=> /search/2?gemini hosting Next page" so clients
would show the text "hosting Next page".
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu May 21 07:57:13 2020 -0400
Message: [serve] Only highlight nice content types in search results
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu May 21 07:33:36 2020 -0400
Message: [crawl] Make path exclusions more robust
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu May 21 06:53:46 2020 -0400
Message: [serve] Remove broken URL count from stats page
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu May 21 06:45:49 2020 -0400
Message: Add houston to seeds, but ignore its search results
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu May 21 06:45:28 2020 -0400
Message: [crawl] [serve] Add search highlights
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed May 20 09:33:58 2020 -0400
Message: [crawl] Index massaged URLs
Up to this point, we were indexing the URL from the gemini response
object. Instead, let's index something that's been a bit more
normalized and cleansed. We want to keep the capitalization, but strip
unnecessary ports and trailing slashes.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed May 20 09:32:32 2020 -0400
Message: [crawl] Handle trailing slash redirects better
This was recently refactored out, and it resulted in duplicate entries
in the index, like e.g. gus.guru and gus.guru/. This change should
prevent that from happening any more.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed May 20 08:15:27 2020 -0400
Message: [serve] Update the loading of statistics
Do it more dynamically, so after users submit seed requests, they will
show up immediately on the /known-hosts page.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue May 19 17:08:53 2020 -0400
Message: [crawl] Fix lots of bugs
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue May 19 06:47:51 2020 -0400
Message: [crawl] Crawl the seed requests after the main crawl
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue May 19 06:36:44 2020 -0400
Message: [crawl] Fix bug in relative URL parsing
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon May 18 15:52:48 2020 -0400
Message: [crawl] Fix bug with computing full_qualified_urls
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon May 18 09:12:31 2020 -0400
Message: [crawl] Use standardized print_index_statistics
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon May 18 09:01:31 2020 -0400
Message: [no-op] Clean up comments in whoosh_extensions
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon May 18 08:57:27 2020 -0400
Message: [serve] Crawl and index seed requests immediately
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun May 17 10:30:58 2020 -0400
Message: Update README TODOs
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun May 17 10:20:11 2020 -0400
Message: [crawl] Implement GeminiResource
This commit should actually be somewhat close to a no-op, but brings
substantial refactoring of the code to consolidate both functionality
related to gemini URLs as well as the source of truth for crawler
information about them (including relevant metadata) to a new class
called `GeminiResource`.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun May 17 07:45:25 2020 -0400
Message: [crawl] Exclude GUS search result pages from crawl
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun May 17 06:21:18 2020 -0400
Message: [crawl] Add seeds
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat May 16 14:51:29 2020 -0400
Message: [crawl] Add jan.bio to seeds
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat May 16 11:23:25 2020 -0400
Message: Add index.bak to gitignore
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat May 16 10:57:49 2020 -0400
Message: [crawl] Create non-destructive crawl option
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat May 16 09:23:21 2020 -0400
Message: [serve] Improve documentation on content type queries
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat May 16 09:05:55 2020 -0400
Message: [serve] Add verbose mode
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat May 16 08:22:57 2020 -0400
Message: [serve] Update how num_results is displayed
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat May 16 08:12:22 2020 -0400
Message: [serve] Improve search result data type
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat May 16 08:00:35 2020 -0400
Message: [crawl] [serve] Add more statistics
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat May 16 06:57:40 2020 -0400
Message: [crawl] Update seeds
Author: Natalie Pendragon <natpen@natpen.net>
Date: Fri May 15 08:03:04 2020 -0400
Message: [crawl] Update seeds
Author: Natalie Pendragon <natpen@natpen.net>
Date: Fri May 15 08:01:16 2020 -0400
Message: Update and reorder TODOs
Author: Natalie Pendragon <natpen@natpen.net>
Date: Fri May 15 06:27:44 2020 -0400
Message: [crawl] [no-op] Add a line after backup operation
Just to visually set it off from the first crawl operation.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu May 14 15:40:54 2020 -0400
Message: Update statistics TODOs
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu May 14 09:17:59 2020 -0400
Message: [crawl] Add new seed
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu May 14 08:49:56 2020 -0400
Message: [serve] Update statistics copy slightly
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu May 14 07:56:28 2020 -0400
Message: [serve] Implement paging
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu May 14 06:59:52 2020 -0400
Message: Update README ideas for more index/usage statistics
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed May 13 10:20:16 2020 -0400
Message: [crawl] Add new spanish site to crawl seeds
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed May 13 09:51:06 2020 -0400
Message: [crawl] Refactor manual exclusions and add fgaz' calculator
The calculator seems to generate links dynamically, so attempting to
crawl it will yield unending pages with links to more deeply-nested
mathematical operations.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue May 12 08:52:19 2020 -0400
Message: Add TODO for generating and sharing GUS usage statistics
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue May 12 08:46:14 2020 -0400
Message: [serve] Add news feature
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue May 12 08:18:40 2020 -0400
Message: [serve] Add page to show all known hosts
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue May 12 07:56:30 2020 -0400
Message: [statistics] Add ability to compute and print stats easily
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue May 12 07:23:09 2020 -0400
Message: [statistics] Refactor statistics objects to pass around dicts
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue May 12 07:07:12 2020 -0400
Message: [serve] Add page headers
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon May 11 14:51:35 2020 -0400
Message: [serve] Update copy for current index statistics
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon May 11 14:45:53 2020 -0400
Message: [serve] Stop hard-wrapping content
The Gemini spec was recently updated such that content creators are
now requested to NOT hard-wrap their content, so this commit updates
GUS to comply!
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon May 11 13:56:48 2020 -0400
Message: [serve] Report out current index statistics
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon May 11 13:16:04 2020 -0400
Message: Refactor some common/library code into separate files
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun May 10 12:12:16 2020 -0400
Message: [serve] Remove TODO to add documentation for content_type
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun May 10 11:50:46 2020 -0400
Message: [crawl] Alphabetize and add a few more seeds
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun May 10 10:39:05 2020 -0400
Message: [crawl] Backup old index before running crawl
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun May 10 10:38:47 2020 -0400
Message: [crawl] Add indexed_at field
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat May 09 17:34:52 2020 -0400
Message: [crawl] Compute and generate index statistics after each crawl
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat May 09 17:23:54 2020 -0400
Message: [serve] Update content_type search documentation
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat May 09 16:05:05 2020 -0400
Message: Add TODO to track Geminispace statistics
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat May 09 14:07:02 2020 -0400
Message: [serve] Add documentation for content_types
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat May 09 13:35:28 2020 -0400
Message: [serve] Add note that paging isn't implemented yet
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat May 09 13:35:08 2020 -0400
Message: [serve] Put index generation date in footer
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat May 09 12:38:49 2020 -0400
Message: Add a couple TODOs
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat May 09 11:54:45 2020 -0400
Message: [crawl] Add two new seeds
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat May 09 11:06:38 2020 -0400
Message: [crawl] Stop printing the sleep duration
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat May 09 11:00:01 2020 -0400
Message: [crawl] Improve error recovery
The crawler is starting to hit some indexing errors, presumably with
new content popping up in Geminispace that either is malformed on its
own, or perhaps is fine, but is exposing errors in GUS' crawling code.
As an immediate-term fix, this change commits documents to the index
more frequently, and recovers gracefully from indexing errors with
individual documents. This will slow down the indexing process, but A)
I think that's worth it for the resiliecy gain, and B) in practice, it
might not actually slow things down much at all, since the extra
writing time will likely get swallowed up by the kindness-sleep in
between most requests to the same domain (it will still cause extra
waits in between requests to two different domains, and extra time
incurred opening the index for each document regardless of domain).
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat May 09 10:58:18 2020 -0400
Message: [crawl] Adjust link line regex to only match at beginning of line
The crawler was starting to run into errors on source code, which some
people are now hosting in Geminispace, and which sometimes has syntax
that includes `=>` of it. I suppose this could have happened in
non-code contexts as well, but this is the first time it seems to have
loudly broken the crawl.
This fixes it.
Also, it occurs to me that I think there is a "raw-text block" type of
construct in the Gemini spec now, so I should probably add a TODO to
refactor the extract_gemini_links function to exclude any links found
within such a block.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Tue May 05 08:27:48 2020 -0400
Message: [crawl] Respect robots.txt crawl_delays and add a kind default
Author: Natalie Pendragon <natpen@natpen.net>
Date: Fri Apr 17 09:24:57 2020 -0400
Message: Add some TODOs
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu Apr 16 18:40:13 2020 -0400
Message: [serve] Fix bug in displaying "input" results
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu Apr 16 18:39:31 2020 -0400
Message: Update dependencies
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu Apr 16 18:19:07 2020 -0400
Message: [crawl] fix crawl bug with robots.txt
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu Apr 16 18:18:25 2020 -0400
Message: [serve] Update formatting
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat Mar 14 22:50:06 2020 -0400
Message: Improve it all
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu Mar 05 08:55:16 2020 -0500
Message: [serve] Add seed request tracking
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu Mar 05 07:50:57 2020 -0500
Message: [serve] Update aesthetics
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Mar 04 08:08:48 2020 -0500
Message: Add search suggestions
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Mar 04 08:08:03 2020 -0500
Message: Update indexing and query parsing
Author: Natalie Pendragon <natpen@natpen.net>
Date: Wed Mar 04 08:06:25 2020 -0500
Message: Add TODO to track freshness of content
Author: Natalie Pendragon <natpen@natpen.net>
Date: Mon Mar 02 06:43:56 2020 -0500
Message: [crawl] Respect "indexer" robots.txt entries
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Mar 01 12:12:28 2020 -0500
Message: Add more feature ideas to the README
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Mar 01 12:12:16 2020 -0500
Message: Index and serve mime types
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat Feb 29 08:33:12 2020 -0500
Message: Improve README readability
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat Feb 29 08:31:15 2020 -0500
Message: Add README todo to add paging
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat Feb 29 08:27:39 2020 -0500
Message: [serve] Remove numbers from search result rows
These were causing visual confusion with the way that some gemini
clients (like bombadillo and av-98) assign numbers to links and print
them in-line, which ends up being right next to these result numbers.
I don't think the result numbers provided much extra value, even when
not causing visual confusion, so this commit simply removes them.
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat Feb 29 08:13:22 2020 -0500
Message: Update README.md
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu Feb 27 09:06:38 2020 -0500
Message: Update README
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu Feb 27 08:45:55 2020 -0500
Message: Make GUS easier to run for others
This commit does the following:
1) adds a README with setup instructions
2) updates the dependency specification for gusmobile
to no longer point at a relative directory, which
is very likely unique to how I manage code directories
personally, and instead use a Git reference to the
forked version of gusmobile with the same changes.
For local hacking on the fork of Gusmobile, one should
clone that repository, update the pyproject.toml to
point to it on the local filesystem, and regenerate
their virtualenv.
I also considered simply copying out the relevant
code from the upstream gusmobile, but I have a goal
of maturing the hacks to it into more legit/robust
improvements that can eventually be contributed
back upstream :)
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Feb 23 09:30:52 2020 -0500
Message: Add some new seed sites
Author: Natalie Pendragon <natpen@natpen.net>
Date: Fri Feb 21 08:44:01 2020 -0500
Message: Respect robots.txt
Author: Natalie Pendragon <natpen@natpen.net>
Date: Thu Jan 30 08:47:38 2020 -0500
Message: Initial commit
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Feb 07 08:23:34 2021 -0800
Message: Add a few more url parsing test cases
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sun Feb 07 08:20:26 2021 -0800
Message: Update to Python 3.9 compatibility
Author: René Wagner <rwagner@rw-net.de>
Date: Thu Feb 04 21:06:57 2021 +0100
Message: update python deps
Signed-off-by: Natalie Pendragon <natpen@natpen.net>
Author: René Wagner <rwagner@rw-net.de>
Date: Thu Feb 04 21:05:38 2021 +0100
Message: introduce systemd-unit for indexer
The indexer is launched by systemd when the crawler finishes.
When launched through the unit, the output to stdout is
redirected to systemd-journald. There's no need for additional
file output, thus it has been removed.
Signed-off-by: Natalie Pendragon <natpen@natpen.net>
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat Jan 30 07:15:18 2021 -0800
Message: Make README heading lines more consistent
Author: René Wagner <rwagner@rw-net.de>
Date: Fri Jan 29 10:08:22 2021 +0100
Message: add systemd-units for automatic crawling
The template runs the crawler once a week on saturday afternoon.
If other launch times are wanted, gus-crawl.timer needs to be
modified.
Signed-off-by: Natalie Pendragon <natpen@natpen.net>
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat Jan 30 07:05:46 2021 -0800
Message: Fix trailing whitespace and reformat long string
Author: René Wagner <rwagner@rw-net.de>
Date: Thu Jan 28 11:33:45 2021 +0100
Message: add "/robots.txt" route to views.py
It's a hard coded approach to serve a robots.txt to other crawlers.
No crawler may access /add-seed & /threads and all relevant virtual agents
may not access /search and /backlinks
Signed-off-by: Natalie Pendragon <natpen@natpen.net>
Author: René Wagner <rwagner@rw-net.de>
Date: Wed Feb 10 11:06:47 2021 +0100
Message: limit max_crawl_depth to 100 for normal crawl
There are capsules out there that kill the crawler due
to a recursion exceeding the limits of python.
Python limit seems to be around 1000, so the value
can be increased if needed, but i don't think we
miss anything with the current value.
Signed-off-by: Natalie Pendragon <natpen@natpen.net>
Author: René Wagner <rwagner@rw-net.de>
Date: Wed Feb 10 07:07:12 2021 +0100
Message: increase frequency to avoid rescanning within a single crawl
Signed-off-by: Natalie Pendragon <natpen@natpen.net>
Author: René Wagner <rwagner@rw-net.de>
Date: Fri Feb 12 08:53:20 2021 +0100
Message: add verbose search to robots.txt
This was missing in the first place.
Signed-off-by: Natalie Pendragon <natpen@natpen.net>
Author: René Wagner <rwagner@rw-net.de>
Date: Fri Feb 12 08:05:34 2021 +0100
Message: correctly handle robots.txt
Honor the robots.txt entrys of "indexer" and "gus" as well
as the default * section.
The robot_file_map.p must be deleted on a live instance
after this change has been applied to refetch all robots
files, as previously only empty files have been stored.
Signed-off-by: Natalie Pendragon <natpen@natpen.net>