💾 Archived View for d.moonfire.us › blog › 2017 › 05 › 01 › author-intrusion captured on 2024-09-29 at 00:53:56. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-04-26)
-=-=-=-=-=-=-
After a rough start of the previous week, I got a chance to really focus on Author Intrusion[1] (AI). Despite the rather significant amount of check-ins and coding, I wasn't quite able to get to a really good “show off” point. Instead, this week ended up being a black triangle[2] (significant progress but nothing visible).
2: http://rampantgames.com/blog/?p=7745
I haven't really talked about AI for a while. The entire idea started in 2009 or so when I realized I had some major blindspots in my writing. At the time, it was an overuse of gerunds that was hanging over me but as my skill improves, the problems areas also shift around. My writing group[3] has been a great help in tracking these down but I felt a lot of them were things that could be detected ahead of time; basically something that would let me file off the rough edges before someone spent the effort to correct them.
To my surprise, the current offering of grammar checkers doesn't actually look for the same thing. They seem to review things sentence-by-sentence but I wanted something that looked at the entire document and looked for trouble areas. For example, using the same word repeatedly in a few paragraphs (echo words).
At the same time, I'm a programmer. I use a wonderful program called ReSharper[4] which has a lot of refactoring and analysis. There are times when I wanted to see every time a character shows up (name outside of dialog) or is talked about. I also rename characters (frequently) and suffer from the occasional search-and-replace bugs.
4: https://www.jetbrains.com/resharper/
That all lead to me want to create an IDE for writers, basically a Visual Studio and ReSharper that was geared toward authors.
One of my major influences is a short story by James White called Fast Trip[5]. I love that story because it comes down to “change the environment to succeed.” Well, that meant a couple of things for AI, it had to be flexible to work the way the author wants and it needed to use the paradigm that I was comfortable with.
5: http://www.sectorgeneral.com/shortstories/fasttrip.html
I've worked on off-and-on for about eight years. I've tried a lot of iterations, gotten to a point, and then hit some conceptal problem that threw everything into disarray. Some of them were understandings on my part (you can't reformat text while writing), getting caught working on the wrong thing (I should reinvent the word processor), and a ton of other dead ends.
This means that it isn't done until it is stable. So if you are not interested in alpha software, this might not be what you are looking for at this point.
The current design goals is to create a separate, independent program that can be called by text editors. Inspired by OmniSharp[6], I figured it was better to create a program that did one thing well (analysis and refactor novels) and then create hooks that can talk to other editors like Emacs and Atom[7].
I'm also using Grunt[8] and NPM[9] from my own experiences. Some of the earlier implementations of AI were self-contained with all the plugins installed there. However, as I've been working on my Javascript publishing framework[10], versioning is very important. I can't break the analysis of novels I wrote ten years ago just because the code has improved to handle the novels of today. NPM has a great way of handling that with the `package.json` file that lists specific packages and versions, which can be installed and used. So, AI is done the same way, the system uses specific versions of packages to ensure that it always produces the same data. If you upgrade the underlying packages, then you can deal with the changes and check in the results knowing it won't change tomorrow.
10: https://gitlab.com/mfgames-writing-js/
I've written AI in a number of languages (Javascript, Typescript, C#, Pythong) along with different UIs. I've gotten lost in a lot of them. At the moment, my skills in C# are considerably more advanced than the others, but more importantly, the tools (ReSharper) make me a lot more effective. I don't want this to be a learning experience at this point because I have a lot of novels I need to write and I have to focus on either making fun tools (AI) or finish books.
I don't want this to be an argument of language. I like C# and I'm good at it, so I'm using it.
It isn't there yet. I'm working for an end-to-end which means setting up the files and being able to run a program (`aicli`) and have it produce an output that shows echo words (the first analysis I want to write and my current blindspot). What does work requires a relatively specific setup (not that bad, everything is contained inside the repository[11]) and a bit of hand-holding.
11: https://gitlab.com/author-intrusion/author-intrusion-cil/
This is still in the “black triangle” point since the technical framework is set in place, I'm still combining them to produce something that looks cool.
All of the examples are based on a subset of Sand and Blood used for testing. It can be found here[12].
Like Grunt, there is a single file that controls how AI works with a project. I took inspiration from Visual Studio Code[13] and Atom in that every file isn't explictly added to the project. Instead, AI assumes that all files underneath the directory containing the `author-intrusion.yaml` file is part of the project. If you add a chapter, the YAML file won't change but the system will pick it up.
13: https://code.visualstudio.com/
I used YAML because it is easy to use, doesn't require a lot of noise, and pretty much handles the relatively simplistic data. Also, unlike JSON, YAML can let you copy a section from one place to another which I consider to be pretty useful.
I am trying to avoid changes a lot of files whenever you add a chapter, that's just noise in most cases. Related to that, AI will create a `.ai` directory for its internal caches, that shouldn't be checked into source control at all.
Eventually, `aicli` will search for `author-intrusion.yaml`. That way, like Grunt, it can be called anywhere inside the directory tree and it will “do the right thing” to produce consistent results. This is the same thing `git` does among other CLI implementations.
The basic `author-intrusion.yaml` file is pretty simple:
file: - plugin: AddClassFromPath match: chapters/*.markdown class: chapter data: - plugin: AddClassFromData select: file.chapter class: pov-{pointOfView} layout: - plugin: SplitLines - plugin: SplitParagraphs - plugin: OpenNlpSplitTokens select: para - plugin: WordTokenClassifier analysis: - plugin: Echoes select: file.chapter token.word scope: token.word:within(10) threshold: { error: 5, warning: 2 }
In the example file from the link above, there is a lot more in the file either for documentation purposes or just to work out some concept or idea.
There are four sections: `file`, `data`, `layout`, and `analysis`. These are various operations that are performed on each file to classify (`file` and `data`), organize (`layout`), and analyze (`analysis`) the files.
Most of the processes are based on the idea of a “plugin”. A plugin is a discrete class (typically in its own NuGet[14] package) that is versioned and specific to the current project. They are identified by the `plugin` property inside the list. A plugin can be used more than once.
analysis: - plugin: Echoes id: Words Within 10 select: file.chapter token.word scope: token.word:within(10) threshold: { error: 5, warning: 2 } - plugin: Echoes id: Sounds Alike select: file.chapter token.word scope: token.word:within(5) compare: :root:attr(soundex) threshold: { error: 5, warning: 2 }
In each case, the `plugin` property of each plugin identifies which plugin to use. This is usually the base class but it means I'll have to have a registry to list which plugins do what. The reason for doing this is because if someone else writes a plugin (extends AI), they can push it up to NuGet (about that later) and everyone can use it. It doesn't require a release of the main code (assuming `AuthorIntrusion.Contracts` remains the same).
The rest of the properties are based on that plugin. There is a bit of interesting complexity in making this work (e.g., hack but I wrote it) but everything is type safe when it comes to coding.
In the above examples, `select` and `scope` are based on CSS selectors. I couldn't find a library that did it generically, so I wrote one that just creates an abstract syntax tree of CSS selectors which is used by this library. Like most of the plugins, I'll break it into a separate project once AI gets stable.
The advantage of using CSS, such as `file.chapter token.word` or `:root` is that many developers understand how CSS selectors works. Except for the specifics (pseudo-classes for example), how they chain together, how you combine them, those are known enough that I don't have to force someone to learn something new.
I'm setting up the layout of a file (using the `layout` plugins) to look like HTML. Those plugins create the tag/element structure that CSS uses.
Right now, the following are implemented:
There is also scoped pseudo-classes. These are used to get elements based on their relationship to another element. In every case so far, the `scope` property is based on each element found in the `select` of the plugin.
These scoped variables are used so we can control what we are looking for. For example, the echoes plugin may look to see if the same word has been used somewhere within ten words of the current one:
analysis: - plugin: Echoes select: token.word scope: token.word:within(10)
It can also be used to make sure the same word isn't used at the beginning of the surrounding paragraphs:
analysis: - plugin: Echoes select: para token.word:first-child scope: para:within(1) token.word:first-child
There is a third category used for comparisons (used by the Echoes). These are where I hack the CSS system but basically let me define operations.
This lets us analyze for something more than just the text. For example, words that sound the same:
analysis: - plugin: Echoes select: token.word scope: token.word:within(10) compare: :root:attr(soundex) # :attr() isn't done yet
Or ones that have the same base word:
analysis: - plugin: Echoes select: token.word scope: token.word:within(10) compare: :root:attr(stem)
A third example is if we are looking for too many sentences that start with the same pattern of parts of speech. This is to find where “bob did this”, “bob did that”, “mary did something”.
layout: - plugin: ClassifyElementRange # Not done yet select: sent token.word:first-child scope: sent:parent token.word:self-or-after(2) tag: sent-leading analysis: - plugin: Echoes select: sent-leading scope: sent-leading:within(5) compare: :root:attr(pos)
Now, how all those topics are used. The first set of plugins are simply based on the structure of the project. Every file has one automatic element, the `file` which is the `html` of the file. We can add classes and identifiers to that to limit how the rest of the code works.
The biggest one is AddClassFromPath. You can use that to add the `chapter` class to files in the `chapters` directory. That way, the later selectors can use `file.chapter` or `.chapter` to limit processing (don't need to do grammar on your notes).
file: - plugin: AddClassFromPath match: chapters/*.markdown class: chapter
I write using Markdown with a YAML header. My chapters look like this:
--- availability: public when: 1471/3/28 MTR duration: 25 gm date: 2012-02-18 title: Rutejìmo locations: primary: - Shimusogo Valley characters: primary: - Shimusogo Rutejìmo secondary: - Shimusogo Hyonèku referenced: - Funikogo GanĂłsho - Shimusogo Gemènyo - Shimusogo ChimĂpu - Shimusogo YutsupazĂ©so concepts: referenced: - The Wait in the Valleys purpose: - Introduce Rutejìmo - Introduce Hyonèku - Introduce naming conventions - Introduce formality rules - Introduce the basic rules of politeness summary: > Rutejìmo was on top of the clan's shrine roof trying to sneak in and steal his grandfather's ashes. It was a teenage game, but also one to prove that he was capable of becoming an adult. He ended up falling off the roof. The shrine guard, Hyonèku, caught him before he hurt himself. After a few humiliating comments, he gave Rutejìmo a choice: tell the clan elder or tell his grandmother. Neither choice was good, but Rutejìmo decided to tell his grandmother. --- > When a child is waiting to become an adult, they are subtly encouraged to prove themselves ready for the rites of passage. In public, however, they are to remain patient and respectful. --- Funikogo GanĂłshyo, *The Wait in the Valleys* Rutejìmo's heart slammed against his ribs as he held himself still. The cool desert wind blew across his face, teasing his short, dark hair. In the night, his brown skin was lost to the shadows, but he would be exposed if anyone shone a lantern toward the top of the small building. Fortunately, the shrine house was at the southern end of the Shimusogo Valley, the clan's ancestral home, and very few of the clan went there except for meetings and prayers.
Given that, I would use a plugin to let me add identifiers and classes to files based on the YAML header (the bit between the `---` lines). For example, the following would add a `pov-ShimusogoRutejìmo` class to the file.
data: - plugin: AddClassFromData # Not done select: file.chapter class: pov-{characters.primary}
There will be ways of manipulating it, but basically it lets you tag the chapter with “scene” or “sequel” if you follow the Techniques of the Selling Writer[15] or want to identify if a chapter is combat or talking. It doesn't matter how you want to tag it, you can use any element in the data to filter or make decisions later. This will also be used for the querying to let you say “what chapters are from Rutejìmo's point of view” or “what scenes happen at night”. Eventually, I'll tie it into my culture library so you can also say “show me the chapters in chronological order”.
The bulk of my effort in the last two weeks has been in the layout. This is what carves up the contents of the file into something that can be selected via the CSS system. The default is only to have the `file` element which contains everything.
layout: - plugin: SplitLines tag: line # implied
The above splits everything into `<line>` elements. We need that for reporting line number errrors. Below, we have a plugin that splits things into paragraphs based on Markdown rules (a blank line separates a paragraph). Using the above examples, this is what lets us use the `para .word:first-child` for selectors.
layout: - plugin: SplitParagraphs tag: para # implied
It gets complicated when we start adding tokens. A token is a word or puncutation. I'm using OpenNLP[16] to break apart the words at the moment. This splits up the contents of the paragraph (because of the `select: para`) into tokens.
16: https://opennlp.apache.org/
layout: - plugin: OpenNlpSplitTokens select: para
Once I have tokens, I can add the `word` class to the actual words.
layout: - plugin: WordTokenClassifier class: word # implied
Eventually, there will be a `OpenNlpSplitSentences` plugin where you would split the paragraph into sentences and then split the sentences into tokens. That isn't done but we don't need it for the base. I will also eventually create a `ParagraphSentenceWord` plugin that does most of this and makes it easier to add it as a single line.
The end result is that we have an abstract tree that represents the file. Eventually there may be a lot more such as dialog identification, blockquotes and epigraphs, and whatever else makes sense.
The main reason for the layout plugins is to make the selectors work for the analysis. It is also a complex bit of code that has to run in order unlike analysis which can be multi-threaded for a given file.
Finally, the last bit. I'm not done with anything else, but the analysis plugin is the bit that finds errors and warnings. It is the parts that “does” something. In other words, the entire reason I'm writing this.
analysis: - plugin: Echoes select: file.chapter token.word scope: token.word:within(10) threshold: { error: 5, warning: 2 }
I'm working on this next, but I hope to have `aicli check` produce an error like `gcc` or a complier that gives a specific file and line number (via the `SplitLines` plugin) that lists the errors. It will be the same thing Atom uses to highlight problems.
Since analysis plugins only add/remove errors, they don't change the structure. This means they can use all the CPUs using worker processes to make it as efficient as possible. I'm also going to eventually have some optimizations put in (most elements are hashed so they can be cached).
This last week, I've gotten the folllowing:
The goal is to get a simple echo words analysis done, basically looking for duplicates within ten words.
I really want to get this to the point it can be used. I think it will be benefical even with the basics for finding problem spots with my personal projects and can benefit others. I also think that once I get the end-to-end, additional functionality will be easily added for other needs.
Development is currently on the drem-0.0.0[17] branch on Gitlab. I'm adding issue of secondary items while tagging them with expected complexity. While I don't expect anyone to contribute, if someone does get inspired the `simple` and `trivial` items are good starting points.
17: https://gitlab.com/author-intrusion/author-intrusion-cil
I plan on adding notes for contributing “soon”.
So far, everything is MIT licensed but there will be other licenses already involved. OpenNLP is Apache licensed so I'll need to figure out how that interacts with MIT. Once I split the packages into individual files, it will probably be easier but for the time being, life is faster if I keep it as a single repository.
Categories:
Tags:
Below are various useful links within this site and to related sites (not all have been converted over to Gemini).