💾 Archived View for d.moonfire.us › blog › 2018 › 06 › 26 › author-intrusion-0.9.0 captured on 2024-05-26 at 14:57:56. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-04-26)
-=-=-=-=-=-=-
After a few weeks of work, the current rewrite of Author Intrusion got to a stopping point. This has the barest minimum functionality to detect echo words but it has a long way to go. Depressingly long way, but I need to let it settle a little before I jump back into it.
I also want a little encouragement so I'm going to toot my own horn and show some progress.
I think this is the ninth attempt I've had to write Author Intrusion[1]. Each time, I've encountered various walls where my ideas didn't have the performance or couldn't conceptually move beyond the proof of concept. Over the last eight years, I've learned a lot about writing this and each time I hope “this is it”.
The current goal is to write a command-line interface (CLI) inspired by compilers (like GCC and TypeScript `tsc`) and Git. Basically, have AI have a core set of function but don't worry about re-implementing a text editor (which was, in all honesty, one of my more common mistakes for previous versions).
I spent a lot of time trying to get a solid, cross-platform GUI for this. That includes various Gtk# and Electron implementations before I decided to switch to CLIs.
For reference, this is some of my efforts five years ago:
The “mmm” was a search and replace on one of my larger commissions to test performance. In this case, it was a 100k commissioned novel.
There were a lot of attempts in there and I spent a lot of time trying to create an editor. While I was “fairly” successful, I think it made the project too big for one person to do. Each major iteration, I've been removing features trying to get to a core functionality while still honoring my core goals of helping me write.
After the attempt to write it in Typescript, I decided this version is going to use .NET Core. The folks at Microsoft have done an amazing job of writing something that is fast and capable while running on both Windows and Linux (one of my requirements). It also is my core language, so I'm not struggling with learning and the tools like I was with Typescript. My primary environment is Visual Studio 2017 with ReSharper[2], Rider[3], or Visual Studio Code[4]. While I use Atom[5], global configuration options pretty much means my Atom setup is specifically for writing, not coding.
2: http://www.jetbrains.com/resharper/
3: http://www.jetbrains.com/rider/
4: https://code.visualstudio.com/
However, after the Typescript implementation and later code, I realize that the approach `npm` uses to manage packages is perfect for what I'm looking for. My struggles with various incarnations of MfGames Writing[6] showed me that bitrot is a major problem while writing over years. The earlier incarnations of the build framework would evolve to handle new novels but then it would break older generation in the process. With `npm`, I can have a specific version in one project, then the library could continue to evolve while still providing the ability to stay at an older version for those older works.
6: https://gitlab.com/mfgames-writing
NuGet[7] has a number of C# libraries for writing a client that would give me the same thing. Various utilities, analyzers, and libraries can be packaged up as NuGet packages and then installed. If they evolve, the older version can remain behind and still work.
This version has the basics of this in, I can install packages and have various functionality available for processing.
This part is actually one of the neatest parts, I think. I'm using Autofac[8], a bit of reflection, and the NuGet libraries to install packages and then load assemblies from those packages without needing to pull them into a central location.
As soon as the plugins load, the system rebuilds the plugins and injects functions. This can be various plugins, new XSLT functions, or anything else.
The above screenshot is an example of the layout plugin. It tags files in the project as to their purpose. I made a mistake earlier on this when I started doing grammar checking on notes. In this case, a plugin can indicate something as “content” (e.g., the novel or story) or “lookup” (notes) or something else.
Right now, I'm only implementing my “standard” project layout that I've used for the last few projects. Eventually I'll write more but I'm trying to get end-to-end before fleshing out ideas as needed.
A layout plugin is responsible for gathering metadata about a file so decisions can be made. Eventually this will go into the YAML header for the file. This means I'll be able to identify files that have a specific point of view with something like:
--- title: Chapter 1 pov: Dylan swain: scene --- It was a bright and depressingly sunny day....
The idea for this is to be able to list all chapters of a given POV and then arrange them chronologically while listing the location. Or tag a file as being a scene or sequel[9] so they are interspersed correctly.
9: https://en.wikipedia.org/wiki/Scene_and_sequel
A structure plugin basically figures out the structure of a file, such as figuring out if it is broken into paragraphs, sentences, and words. I didn't want to hard-code this because sentence splitting is hard and expensive plus my fantasy novels[10] all have epigraphs. Those who like Scrivner may want to arrange it into scenes.
10: https://fedran.com/sand-and-blood/chapter-01/
Structure plugins are boring but critical.
An important part of the structure is not applying a structure to certain files. Lookup files don't need to know the individual words or paragraphs. To limit it, I'm using a `scope` variable which is an Xpath into the project.
<project> <file path="/chapters/chapter-01.md" class="content" /> <file path="/characters/dylan.md" class="lookup" /> </project>
This means, using a path of `/content[is-content()]` will select only the content files but not the lookup. Originally, I implemented this as a CSS-like library which I eventually realized I was going down a rabbit hole (I still broke apart the library for later if I need it).
Again, C# has the ability to have defined XSLT functions so I wrote `is-content()` (injected via Autofac) that does custom logic.
As the various structure plugins operate (defined by the `author-intrusion.aipy` file), it will extend the XML structure used for selections.
<project> <file path="/chapters/chapter-01.md" class="content"> <para start="0" length="10" /> <para start="11" length="21" /> </file> <file path="/characters/dylan.md" class="lookup" /> </project>
<project> <file path="/chapters/chapter-01.md" class="content"> <para start="0" length="10"> <token start="0" length="3" /> <token start="5" length="4" /> <token start="9" length="1" /> </para> <para start="11" length="21"> <token start="11" length="3" /> <token start="16" length="4" /> <token start="20" length="1" /> </para> </file> <file path="/characters/dylan.md" class="lookup" /> </project>
This is because of another previous mistake (yeah, I made a lot). Various implementations tried to normalize contents while writing. This meant it would correct double spaces after periods or adjust the text.
That didn't work.
I also couldn't break it down into a simple tree structure because English didn't fit well. So, this XML just goes into the original file to get the text.
The entire reason to break apart a document into a structure is for the analysis. An analysis plugin, such as echo detection, uses the XML structure to gather information.
All plugins are configured in the `author-intrusion.aipy` (project file in YAML format).
plugins: analysis: - plugin: EchoDetection key: echoes scope: content select: //token[length() > 3] compare: text() within: 20 warning: 2 error: 5
The `plugin` attribute is the class to use the plugin. The key is just a label for Atom's linter in case someone uses multiple echo detections in a file.
The second section is to figure out what is being detected. The `scope` uses `content` which is a shorthand for `/file[is-content()]` or `/file[has-class("content")]`. This breaks apart the search process. Using `/` for the path would do echo detection across chapters (and require more memory and would be slower) while the file-level ones makes it more efficient by only comparing a single file against itself.
For every scope, the `select` figures out what is going to be compared. In this example, for every file, we select every word over three characters long (`//token[length() > 3]`). If we had three chapters of a thousand words each, this is a different of a single list of three thousand items (`scope: /`) verses three lists of a thousand each (`scope: //file[is-content()]`), or three hundred paragraphs of ten words each (`scope: //file[is-content()]/para`).
The third section is the `compare`. This basically figures out what to compare. The `text()` means the raw text of the file. However, functions to handle case-insensitivity, stemming (base words so `I jumped over the jumper` would have two echoes), Soundex (to find similar-sounding words near each other), or whatever else I need.
Finally, the echo detection has the rules for what is a detection. It basically counts how many identically entries (as determined by `compare`) in each `select`ed tag inside the `scope` within `within` entries. If that number is equal to or greater than `error`, then it marks that `select`ed element as an error.
In the above example, the echo detection says “for every word in a file, look at the twenty surrounding words. If there are five or more, it's an error, otherwise if there is two more then it's a warning”.
This isn't even remotely polished at this point. The project is self contained in that building it will generate the correct data, it just isn't… pretty.
dotnet run -v:q --no-build --no-restore --project src/AuthorIntrusion.Cli -- file-list -p "Examples/Sand and Blood"
Eventually, it should be something like:
aicli file lists
There is also a lot of missing functionality, using it probably requires me to understand, though I'd like to think it is pretty simple. Then again, I wrote it, of course it's simple.
You may have noticed that the `author-intrusion.aipy` file is somewhat complicated. This is actually intentional. There are a lot of tools for writers. I'm looking for something very specifically to help with flaws I'm aware of in my fiction, not a generic “one size fits all” application. Because of that, I need it to work for me instead of trying to inflict
This desire to customize to the author is purely inspired by James White's Fast Trip[11]. If you have a chance, consider reading it. It talks about modifying the environment the way you need to work, not the other way around.
11: http://www.sectorgeneral.com/shortstories/fasttrip.html
However, flexibility comes at a price: simplicity. I considered trying to make it easy, but I'm looking for something that looks for overuse of adverbs (technically correct) or gerunds. I want to be able to make sure a character only speaks in past tense or doesn't use a pronoun to identify themselves. One of the earlier ones I'm going to get done is looking for present tense outside of a quote. These are pie in the sky items, but I think possible.
Later, if this gains traction (e.g., users besides me), someone may come up with a fancy GUI or configuration wizard to add common settings.
Author Intrusion is currently being managed via its Gitlab project[12]. I'm not sure if it would be worthwhile for anyone to consider joining, but if you want to watch it, this would be the place.
12: https://gitlab.com/author-intrusion/author-intrusion-cil
If you have questions, please don't hesitate to poke me on any social network I'm on[13]. I always love to bounce ideas or talk about future place. The more I do, the more I can make it useful for everyone, not just myself.
Categories:
Tags:
Below are various useful links within this site and to related sites (not all have been converted over to Gemini).
https://d.moonfire.us/blog/2018/06/26/author-intrusion-0.9.0/