đŸ Archived View for tilde.pink âș ~ssb22 âș annogen.gmi captured on 2023-09-08 at 16:57:06. Gemini links have been rewritten to link to archived content
View Raw
More Information
âŹ
ïž Previous capture (2023-07-22)
âĄïž Next capture (2023-11-04)
đ§ View Differences
-=-=-=-=-=-=-
Annotator Generator
Annotator Generator is an examples-driven generator of fast text annotators.ââAnnotateâ in this context means to add pronunciation or other information to each word, and/or to split text into words in a language that does not use spaces.
- You supply a corpus of pre-annotated texts for Annotator Generator to work out the rules and exceptions
- Annotator Generator creates table-driven code in C, Java, Javascript, Dart, or Python with 2 and 3 compatibility
- The resulting program should be able to annotate any text that contains words or phrases similar to those found in the examples
- It can output the annotations alone or it can combine them with the original text using HTML Ruby markup or simple braces
- If anything is unclear (didnât happen in the examples, or thereâs not enough context to figure out which example should be applied) then the program will leave it unannotated so you can pass it to a backup annotation program if you have one.
- If you have no backup annotator then try setting the -y option, which makes Annotator Generator try harder to find context-independent rules with context-dependent exceptions, so as to annotate as much text as possible.
- Generated annotators can act as filters for Web Adjuster; options are also provided for generating Android apps, browser extensions, and clipboard annotators for Windows and Windows Mobile, or you could format the annotations on a Unix terminal
Legal considerations
Annotator code will contain individual words and some phrases from the original corpus (and these can be read even by people who do *not* have the unannotated version); with regards to copyright law, I expect the annotator code will count as an âindexâ to the collection, the copyright of which exists separately to that of the original collection, but laws do vary by country and I am not a solicitor so please act judiciously.
Legally obtaining that original annotated corpus is up to you.âIf you are in the UK the government says non-commercial text mining is allowed (terms of use prohibiting *non-commercial* mining are unenforceable), provided you:
1. respect network stability (i.e. wait a long time between each download),
2. connect directly to the publisher (this law bypasses the *publisherâs* terms of use, not those of third-party search engines like Google),
3. use the result *only* for mining, not for republishing the original text (so you canât publish your unprocessed crawl dumps either),
4. and still respect any prohibitions against sharing whatever mining tools you made for the site (as this law is only about text mining, not about the sharing of tools).
Laws outside the UK are different (and Iâm not a lawyer) so check carefully.âBut if the websiteâs terms donât actually prohibit writing an *unpublished* scraper for non-commercial mining purposes, perhaps you wonât need a legal exceptionâbut you should still respect their bandwidth and do it slowly, both for moral reasons (itâs the right thing to do) and pragmatic ones (you wonât want their sysadmins and service providers taking action against you).
Download and Usage
Download annogen.py; you will need Python (compatible with both Python 2.7 and Python 3+) and you will need a command prompt.
annogen.py
Version 3.3599
Usage: annogen.py [options]
Options:
- -h, --help show this help message and exit
- --infile= Filename of a text file (or a compressed .gz, .bz2 or .xz file or URL) to read the input examples from.âIf this is not specified, standard input is used.
- --incode= Character encoding of the input file (default utf-8)
- --mstart= The string that starts a piece of text with annotation markup in the input examples; default <ruby><rb>
- --mmid= The string that occurs in the middle of a piece of markup in the input examples, with the word on its left and the added markup on its right (or the other way around if mreverse is set); default </rb><rt>
- --mend= The string that ends a piece of annotation markup in the input examples; default </rt></ruby>
- -r, --mreverse Specifies that the annotation markup is reversed, so the text before mmid is the annotation and the text after it is the base text
- --no-mreverse Cancels any earlier --mreverse option in Makefile variables etc
- --end-pri= Treat words that occur in the examples before this delimeter as having âhigh priorityâ for Yarowsky-like seed collocations (if these are in use).âNormally the Yarowsky-like logic tries to identify a âdefaultâ annotation based on what is most common in the examples, with the exceptions indicated by collocations.âIf however a word is found in a high-priority section at the start, then the first annotation found there will be taken as the ideal âdefaultâ even if itâs in a minority in the examples; everything else will be taken as an exception.
- -s, --spaces Set this if you are working with a language that uses whitespace in its non-markedup version (not fully tested).âThe default is to assume that there will not be any whitespace in the language, which is correct for Chinese and Japanese.
- --no-spaces Cancels any earlier --spaces option in Makefile variables etc
- -c, --capitalisation Donât try to normalise capitalisation in the input.âNormally, to simplify the rules, the analyser will try to remove start-of-sentence capitals in annotations, so that the only remaining words with capital letters are the ones that are always capitalised such as names.â(Thatâs not perfect: some words might always be capitalised just because they never occur mid-sentence in the examples.)âIf this option is used, the analyser will instead try to âlearnâ how to predict the capitalisation of all words (including start of sentence words) from their contexts.
- --no-capitalisation Cancels any earlier --capitalisation option in Makefile variables etc
- -w, --annot-whitespace Donât try to normalise the use of whitespace and hyphenation in the example annotations.âNormally the analyser will try to do this, to reduce the risk of missing possible rules due to minor typographical variations.
- --no-annot-whitespace Cancels any earlier --annot-whitespace option in Makefile variables etc
- --keep-whitespace= Comma-separated list of words (without annotation markup) for which whitespace and hyphenation should always be kept even without the --annot-whitespace option.âUse when you know the variation is legitimate.âThis option expects words to be encoded using the system locale (UTF-8 if it cannot be detected).
- --suffix= Comma-separated list of annotations that can be considered optional suffixes for normalisation
- --suffix-minlen= Minimum length of word (in Unicode characters) to apply suffix normalisation
- --post-normalise= Filename of an optional Python module defining a dictionary called âtableâ mapping integers to integers for arbitrary single-character normalisation on the Unicode BMP.âThis can reduce the size of the annotator.âIt is applied in post-processing (does not affect rules generation itself).âFor example this can be used to merge the recognition of Full, Simplified and Variant forms of the same Chinese character in cases where this can be done without ambiguity, if it is acceptable for the generated annotator to recognise mixed-script words should they occur.âIf any word in the examples has a different annotation when normalised than not, the normalised version takes precedence.
- --glossfile= Filename of an optional text file (or compressed .gz, .bz2 or .xz file or URL) to read auxiliary âglossâ information.âEach line of this should be of the form: word (tab) annotation (tab) gloss.âExtra tabs in the gloss will be converted to newlines (useful if you want to quote multiple dictionaries).âWhen the compiled annotator generates ruby markup, it will add the gloss string as a popup title whenever that word is used with that annotation (before any reannotator option is applied).âThe annotation field may be left blank to indicate that the gloss will appear for all other annotations of that word.âThe entries in glossfile do not affect the annotation process itself, so itâs not necessary to completely debug glossfileâs word segmentation etc.
- -C, --gloss-closure= If any Chinese, Japanese or Korean word is missing from glossfile, search its closure of variant characters also, using the Unihan variants file specified by this option
- --no-gloss-closure Cancels any earlier --gloss-closure option in Makefile variables etc
- -M, --glossmiss-omit Omit rules containing any word not mentioned in glossfile.âMight be useful if you want to train on a text that uses proprietary terms and donât want to accidentally âleakâ those terms (assuming theyâre not accidentally included in glossfile also).âWords may also be listed in glossfile with an empty gloss field to indicate that no gloss is available but rules using this word neednât be omitted.
- --no-glossmiss-omit Cancels any earlier --glossmiss-omit option in Makefile variables etc
- --words-omit= File (or compressed .gz, .bz2 or .xz file or URL) containing words (one per line, without markup) to omit from the annotator.âUse this to make an annotator smaller if for example if youâre working from a rules file that contains long lists of place names you donât need this particular annotator to recognise but you still want to keep them as rules for other annotators, but be careful because any word on such a list gets omitted even if it also has other meanings (some place names are also normal words).
- --manualrules= Filename of an optional text file (or compressed .gz, .bz2 or .xz file or URL) to read extra, manually-written rules.âEach line of this should be a marked-up phrase (in the input format) which is to be unconditionally added as a rule.âUse this sparingly, because these rules are not taken into account when generating the others and they will be applied regardless of context (although a manual rule might fail to activate if the annotator is part-way through processing a different rule); try checking messages from --diagnose-manual.
- --c-filename= Where to write the C, C#, Python, Javascript, Go or Dart program.âDefaults to standard output, or annotator.c in the system temporary directory if standard output seems to be the terminal (the program might be large, especially if Yarowsky-like indicators are not used, so itâs best not to use a server home directory where you might have limited quota).
- --c-compiler= The C compiler to run if generating C and standard output is not connected to a pipe.âThe default is to use the âccâ command which usually redirects to your ânormalâ compiler.âYou can add options (remembering to enclose this whole parameter in quotes if it contains spaces), but if the C program is large then adding optimisation options may make the compile take a long time.âIf standard output is connected to a pipe, then this option is ignored because the C code will simply be written to the pipe.âYou can also set this option to an empty string to skip compilation.âDefault: cc -o annotator
- --outcode= Character encoding to use in the generated parser (default utf-8, must be ASCII-compatible i.e. not utf-16)
- --rulesFile= Filename of a JSON file to hold the accumulated rules.âAdding .gz, .bz2 or .xz for compression is acceptable.âIf this is set then either --write-rules or --read-rules must be specified.
- --write-rules Write rulesFile instead of generating a parser.âYou will then need to rerun with --read-rules later.
- --no-write-rules Cancels any earlier --write-rules option in Makefile variables etc
- --read-rules Read rulesFile from a previous run, and apply the output options to it.âYou should still specify the input formatting options (which should not change), and any glossfile or manualrules options (which may change), but no input is required.
- --no-read-rules Cancels any earlier --read-rules option in Makefile variables etc
- -E, --newlines-reset Have the annotator reset its state on every newline byte.âBy default newlines do not affect state such as whether a space is required before the next word, so that if the annotator is used with Web Adjusterâs htmlText option (which defaults to using newline separators) the spacing should be handled sensibly when there is HTML markup in mid-sentence.
- --no-newlines-reset Cancels any earlier --newlines-reset option in Makefile variables etc
- -z, --compress Compress annotation strings in the C code.âThis compression is designed for fast on-the-fly decoding, so it saves only a limited amount of space (typically 10-20%) but might help if RAM is short.
- --no-compress Cancels any earlier --compress option in Makefile variables etc
- -Z, --zlib Compress the embedded data table using zlib (or pyzopfli if available), and include code to call zlib to decompress it on load.âUseful if the runtime machine has the zlib library and you need to save disk space but not RAM (the decompressed table is stored separately in RAM, unlike --compress which, although giving less compression, at least works âin placeâ).âOnce --zlib is in use, specifying --compress too will typically give an additional disk space saving of less than 1% (and a runtime RAM saving thatâs greater but more than offset by zlibâs extraction RAM).âIf generating a Javascript annotator with zlib, the decompression code is inlined so thereâs no runtime zlib dependency, but startup can be ~50% slower so this option is not recommended in situations where the annotator is frequently reloaded from source (unless youâre running on Node.js in which case loading is faster due to the use of Nodeâs âBufferâ class).
- --no-zlib Cancels any earlier --zlib option in Makefile variables etc
- -l, --library Instead of generating C code that reads and writes standard input/output, generate a C library suitable for loading into Python via ctypes.âThis can be used for example to preload a filter into Web Adjuster to cut process-startup delays.
- --no-library Cancels any earlier --library option in Makefile variables etc
- -W, --windows-clipboard Include C code to read the clipboard on Windows or Windows Mobile and to write an annotated HTML file and launch a browser, instead of using the default cross-platform command-line C wrapper.âSee the start of the generated C file for instructions on how to compile for Windows or Windows Mobile.
- --no-windows-clipboard Cancels any earlier --windows-clipboard option in Makefile variables etc
- --java= Instead of generating C code, generate Java, and place the *.java files in the directory specified by this option.âThe last part of the directory should be made up of the package name; a double slash (//) should separate the rest of the path from the package name, e.g. --java=/path/to/wherever//org/example/annotator and the main class will be called Annotator.
- --android= URL for an Android app to browse (--java must be set).âIf this is set, code is generated for an Android app which starts a browser with that URL as the start page, and annotates the text on every page it loads.âUse file:///android_asset/index.html for local HTML files in the assets directory; a clipboard viewer is placed in clipboard.html, and the app will also be able to handle shared text.âIf certain environment variables are set, this option can also compile and sign the app using Android SDK command-line tools (otherwise it puts a message on stderr explaining what needs to be set)
- --android-template= File to use as a template for Android start HTML.âThis option implies --android=file:///android_asset/index.html and generates that index.html from the file specified (or from a built-in default if the special filename âblankâ is used).âThe template file may include URL_BOX_GOES_HERE to show a URL entry box and related items (offline-clipboard link etc) in the page, in which case you can optionally define a Javascript function âannotUrlTransâ to pre-convert some URLs from shortcuts etc; also enables better zoom controls on Android 4+, a mode selector if you use --annotation-names, a selection scope control on recent-enough WebKit, and a visible version stamp (which, if the device is in âdeveloper modeâ, you may double-tap on to show missing glosses).âVERSION_GOES_HERE may also be included if you want to put it somewhere other than at the bottom of the page.âIf you do include URL_BOX_GOES_HERE youâll have an annotating Web browser app that allows the user to navigate to arbitrary URLs: as of 2020, this is acceptable on Google Play and Huawei AppGallery (non-China only from 2022), but not Amazon AppStore as they donât want âcompetitionâ to their Silk browser.
- -L, --pleco-hanping In the Android app, make popup definitions link to Pleco or Hanping if installed
- --no-pleco-hanping Cancels any earlier --pleco-hanping option in Makefile variables etc
- --bookmarks= Android bookmarks: comma-separated list of package names that share our bookmarks.âIf this is not specified, the browser will not be given a bookmarks function.âIf it is set to the same value as the package specified in --java, bookmarks are kept in just this Android app.âIf it is set to a comma-separated list of packages that have also been generated by annogen (presumably with different annotation types), and if each one has the same android:sharedUserId attribute in AndroidManifest.xmlâs âmanifestâ tag (youâll need to add this manually), and if the same certificate is used to sign all of them, then bookmarks can be shared across the set of browser apps.âBut beware the following two issues: (1) adding an android:sharedUserId attribute to an app that has already been released without one causes some devices to refuse the update with a âcannot installâ message (details via adb logcat; affected users would need to uninstall and reinstall instead of update, and some of them may not notice the instruction to do so); (2) this has not been tested with Googleâs new âApp Bundleâ arrangement, and may be broken if the Bundle results in APKs being signed by a different key.âIn June 2019 Play Console started issuing warnings if you release an APK instead of a Bundle, even though the âsize savingsâ they mention are under 1% for annogen-generated apps.
- -e, --epub When generating an Android browser, make it also respond to requests to open EPUB files.âThis results in an app that requests the âread external storageâ permission on Android versions below 6, so if you have already released a version without EPUB support then devices running Android 5.x or below will not auto-update past this change until the user notices the update notification and approves the extra permission.
- --no-epub Cancels any earlier --epub option in Makefile variables etc
- --android-print When generating an Android browser, include code to provide a Print option (usually print to PDF) and a simple highlight-selection option.âThe Print option will require Android 4.4, but the app should still run without it on earlier versions of Android.
- --no-android-print Cancels any earlier --android-print option in Makefile variables etc
- --known-characters= When generating an Android browser, include an option to leave the most frequent characters unannotated as âknownâ.âThis option should be set to the filename of a UTF-8 file of characters separated by newlines, assumed to be most frequent first, with characters on the same line being variants of each other (see --freq-count for one way to generate it).âWords consisting entirely of characters found in the first N lines of this file (where N is settable by the user) will be unannotated until tapped on.
- --freq-count= Name of a file to write that is suitable for the known-characters option, taken from the input examples (which should be representative of typical use).âAny post-normalise table provided will be used to determine which characters are equivalent.
- --android-audio= When generating an Android browser, include an option to convert the selection to audio using this URL as a prefix, e.g. https://example.org/speak.cgi?text= (use for languages not likely to be supported by the device itself).âOptionally follow the URL with a space (quote carefully) and a maximum number of words to read in each user request.âSetting a limit is recommended, or somebody somewhere will likely try âSelect Allâ on a whole book or something and create load problems.âYou should set a limit server-side too of course.
- --extra-js= Extra Javascript to inject into sites to fix things in the Android browser app.âThe snippet will be run before each scan for new text to annotate.âYou may also specify a file to read: --extra-js=@file.js or --extra-js=@file1.js,file2.js (do not use // comments in these files, only /* ... */ because newlines will be replaced), and you can create variants of the files by adding search-replace strings: --extra-js=@file1.js:search:replace,file2.js
- --tts-js Make Android 5+ multilingual Text-To-Speech functions available to extra-js scripts (see TTSInfo code for details)
- --no-tts-js Cancels any earlier --tts-js option in Makefile variables etc
- --existing-ruby-js-fixes= Extra Javascript to run in the Android browser app whenever existing RUBY elements are encountered; the DOM node above these elements will be in the variable n, which your code can manipulate or replace to fix known problems with sitesâ existing ruby (such as common two-syllable words being split when they shouldnât be).âUse with caution.âYou may also specify a file to read: --existing-ruby-js-fixes=@file.js
- --existing-ruby-lang-regex= Set the Android app or browser extension to remove existing ruby elements unless the document language matches this regular expression.âIf --sharp-multi is in use, you can separate multiple regexes with comma and any unset will always delete existing ruby.âIf this option is not set at all then existing ruby is always kept.
- --existing-ruby-shortcut-yarowsky Set the Android browser app to âshortcutâ Yarowsky-like collocation decisions when adding glosses to existing ruby over 2 or more characters, so that words normally requiring context to be found are more likely to be found without context (this may be needed because adding glosses to existing ruby is done without regard to context)
- --extra-css= Extra CSS to inject into sites to fix things in the Android browser app.âYou may also specify a file to read --extra-css=@file.css
- --app-name= User-visible name of the Android app
- --compile-only Assume the code has already been generated by a previous run, and just run the compiler
- --no-compile-only Cancels any earlier --compile-only option in Makefile variables etc
- -j, --javascript Instead of generating C code, generate JavaScript.âThis might be useful if you want to run an annotator on a device that has a JS interpreter but doesnât let you run your own binaries.âThe JS will be table-driven to make it load faster.âSee comments at the start for usage.
- --no-javascript Cancels any earlier --javascript option in Makefile variables etc
- -6, --js-6bit When generating a Javascript annotator, use a 6-bit format for many addresses to reduce escape codes in the data string by making more of it ASCII
- --no-js-6bit Cancels any earlier --js-6bit option in Makefile variables etc
- -8, --js-octal When generating a Javascript annotator, use octal instead of hexadecimal codes in the data string when doing so would save space.âThis does not comply with ECMAScript 5 and may give errors in its strict mode.
- --no-js-octal Cancels any earlier --js-octal option in Makefile variables etc
- -9, --ignore-ie8 When generating a Javascript annotator, do not make it backward-compatible with Microsoft Internet Explorer 8 and below.âThis may save a few bytes.
- --no-ignore-ie8 Cancels any earlier --ignore-ie8 option in Makefile variables etc
- -u, --js-utf8 When generating a Javascript annotator, assume the script can use UTF-8 encoding directly and not via escape sequences.âIn some browsers this might work only on UTF-8 websites, and/or if your annotation can be expressed without the use of Unicode combining characters.
- --no-js-utf8 Cancels any earlier --js-utf8 option in Makefile variables etc
- --browser-extension= Name of a Chrome or Firefox browser extension to generate.âThe extension will be placed in a directory of the same name (without spaces), which may optionally already exist and contain icons like 32.png and 48.png to be used.
- --browser-extension-description= Description field to use when generating browser extensions
- --manifest-v3 Use Manifest v3 instead of Manifest v2 when generating browser extensions (tested on Chrome only, and requires Chrome 88 or higher).âThis will be required for all Chrome Web Store uploads starting in June 2023.
- --dart Instead of generating C code, generate Dart.âThis might be useful if you want to run an annotator in a Flutter application.
- --no-dart Cancels any earlier --dart option in Makefile variables etc
- --dart-datafile= When generating Dart code, put annotator data into a separate file and open it using this pathname.âNot compatible with Dartâs âWeb appâ option, but might save space in a Flutter app (especially along with --zlib)
- -Y, --python Instead of generating C code, generate a Python module.âSimilar to the Javascript option, this is for when you canât run your own binaries, and it is table-driven for fast loading.
- --no-python Cancels any earlier --python option in Makefile variables etc
- --reannotator= Shell command through which to pipe each word of the original text to obtain new annotation for that word.âThis might be useful as a quick way of generating a new annotator (e.g. for a different topolect) while keeping the information about word separation and/or glosses from the previous annotator, but it is limited to commands that donât need to look beyond the boundaries of each word.âIf the command is prefixed by a # character, it will be given the wordâs existing annotation instead of its original text, and if prefixed by ## it will be given text#annotation.âThe command should treat each line of its input independently, and both its input and its output should be in the encoding specified by --outcode.
- -A, --reannotate-caps When using --reannotator, make sure to capitalise any word it returns that began with a capital on input
- --no-reannotate-caps Cancels any earlier --reannotate-caps option in Makefile variables etc
- --sharp-multi Assume annotation (or reannotator output) contains multiple alternatives separated by # (e.g. pinyin#Yale) and include code to select one by number at runtime (starting from 0). This is to save on total space when shipping multiple annotators that share the same word grouping and gloss data, differing only in the transcription of each word.
- --no-sharp-multi Cancels any earlier --sharp-multi option in Makefile variables etc
- --annotation-names= Comma-separated list of annotation types supplied to sharp-multi (e.g. Pinyin,Yale), if you want the Android app etc to be able to name them.âYou can also set just one annotation names here if you are not using sharp-multi.
- --annotation-map= Comma-separated list of annotation-number overrides for sharp-multi, e.g. 7=3 to take the 3rd item if a 7th is selected
- --annotation-postprocess= Extra code for post-processing specific annotNo selections after retrieving from a sharp-multi list (@file is allowed)
- -o, --allow-overlaps Normally, the analyser avoids generating rules that could overlap with each other in a way that would leave the program not knowing which one to apply.âIf a short rule would cause overlaps, the analyser will prefer to generate a longer rule that uses more context, and if even the entire phrase cannot be made into a rule without causing overlaps then the analyser will give up on trying to cover that phrase.âThis option allows the analyser to generate rules that could overlap, as long as none of the overlaps would cause actual problems in the example phrases.âThus more of the examples can be covered, at the expense of a higher risk of ambiguity problems when applying the rules to other texts.âSee also the -y option.
- --no-allow-overlaps Cancels any earlier --allow-overlaps option in Makefile variables etc
- -y, --ybytes= Look for candidate Yarowsky seed-collocations within this number of bytes of the end of a word.âIf this is set then overlaps and rule conflicts will be allowed when seed collocations can be used to distinguish between them, and the analysis is likely to be faster.âMarkup examples that are completely separate (e.g. sentences from different sources) must have at least this number of (non-whitespace) bytes between them.
- --ybytes-max= Extend the Yarowsky seed-collocation search to check over larger ranges up to this maximum.âIf this is set then several ranges will be checked in an attempt to determine the best one for each word, but see also ymax-threshold and ymax-limitwords.
- --ymax-threshold= Limits the length of word that receives the narrower-range Yarowsky search when ybytes-max is in use.âFor words longer than this, the search will go directly to ybytes-max.âThis is for languages where the likelihood of a wordâs annotation being influenced by its immediate neighbours more than its distant collocations increases for shorter words, and less is to be gained by comparing different ranges when processing longer words.âSetting this to 0 means no limit, i.e. the full range will be explored on all Yarowsky checks.
- --ymax-limitwords= Comma-separated list of words (without annotation markup) for which the ybytes expansion loop should run at most two iterations.âThis may be useful to reduce compile times for very common ambiguous words that depend only on their immediate neighbours.âAnnogen may suggest words for this option if it finds they take inordinate time to process.
- --ybytes-step= The increment value for the loop between ybytes and ybytes-max
- -k, --warn-yarowsky Warn when absolutely no distinguishing Yarowsky seed collocations can be found for a word in the examples
- --no-warn-yarowsky Cancels any earlier --warn-yarowsky option in Makefile variables etc
- -K, --yarowsky-all Accept Yarowsky seed collocations even from input characters that never occur in annotated words (this might include punctuation and example-separation markup)
- --no-yarowsky-all Cancels any earlier --yarowsky-all option in Makefile variables etc
- --yarowsky-multiword Check potential multiword rules for Yarowsky seed collocations also.âWithout this option (default), only single-word rules are checked.
- --no-yarowsky-multiword Cancels any earlier --yarowsky-multiword option in Makefile variables etc
- --yarowsky-thorough Recheck Yarowsky seed collocations when checking if any multiword rule would be needed to reproduce the examples.âThis could risk âoverfittingâ the example set.
- --no-yarowsky-thorough Cancels any earlier --yarowsky-thorough option in Makefile variables etc
- --yarowsky-half-thorough Like --yarowsky-thorough but check only what collocations occur within the proposed new rule (not around it), less likely to overfit
- --no-yarowsky-half-thorough Cancels any earlier --yarowsky-half-thorough option in Makefile variables etc
- --yarowsky-debug= Report the details of seed-collocation false positives if there are a large number of matches and at most this number of false positives (default 1). Occasionally these might be due to typos in the corpus, so it might be worth a check.
- --normalise-debug= When --capitalisation is not in effect. report words that are usually capitalised but that have at most this number of lower-case exceptions (default 1) for investigation of possible typos in the corpus
- --normalise-cache= Optional file to use to cache the result of normalisation.âAdding .gz, .bz2 or .xz for compression is acceptable.
- -1, --single-words Do not generate any rule longer than 1 word, although it can still have Yarowsky seed collocations if -y is set.âThis speeds up the search, but at the expense of thoroughness.âYou might want to use this in conjuction with -y to make a parser quickly.
- --no-single-words Cancels any earlier --single-words option in Makefile variables etc
- --max-words= Limits the number of words in a rule. 0 means no limit. --single-words is equivalent to --max-words=1. If you need to limit the search time, and are using -y, it should suffice to use --single-words for a quick annotator or --max-words=5 for a more thorough one (or try 3 if --yarowsky-half-thorough is in use).
- --multiword-end-avoid= Comma-separated list of words (without annotation markup) that should be avoided at the end of a multiword rule (e.g. sandhi likely to depend on the following word)
- -d, --diagnose= Output some diagnostics for the specified word.âUse this option to help answer âwhy doesnât it have a rule for...?â issues.âThis option expects the word without markup and uses the system locale (UTF-8 if it cannot be detected).
- --diagnose-limit= Maximum number of phrases to print diagnostics for (0 means unlimited).âDefault: 10
- -m, --diagnose-manual Check and diagnose potential failures of --manualrules
- --no-diagnose-manual Cancels any earlier --diagnose-manual option in Makefile variables etc
- -q, --diagnose-quick Ignore all phrases that do not contain the word specified by the --diagnose option, for getting a faster (but possibly less accurate) diagnostic.âThe generated annotator is not likely to be useful when this option is present.
- --no-diagnose-quick Cancels any earlier --diagnose-quick option in Makefile variables etc
- --priority-list= Instead of generating an annotator, use the input examples to generate a list of (non-annotated) words with priority numbers, a higher number meaning the word should have greater preferential treatment in ambiguities, and write it to this file (or compressed .gz, .bz2 or .xz file).âIf the file provided already exists, it will be updated, thus you can amend an existing usage-frequency list or similar (although the final numbers are priorities and might no longer match usage-frequency exactly).âThe purpose of this option is to help if you have an existing word-priority-based text segmenter and wish to update its data from the examples; this approach might not be as good as the Yarowsky-like one (especially when the same word has multiple readings to choose from), but when there are integration issues with existing code you might at least be able to improve its word-priority data.
- -t, --time-estimate Estimate time to completion.âThe code to do this is unreliable and is prone to underestimate.âIf you turn it on, its estimate is displayed at the end of the status line as days, hours or minutes.
- --no-time-estimate Cancels any earlier --time-estimate option in Makefile variables etc
- -0, --single-core Use only one CPU core even when others are available on Unix
- --no-single-core Cancels any earlier --single-core option in Makefile variables etc
- -p, --status-prefix= Label to add at the start of the status line, for use if you batch-run annogen in multiple configurations and want to know which one is currently running
License
Annotator Generator is free software licensed under the Apache License, Version 2.0 (this is also the license used by Web Adjuster).âIf you use it in a good project, Iâd appreciate hearing about it.
Citation
If you need to cite a peer-reviewed paper:
Silas S. Brown. Web Annotation with Modified-Yarowsky and Other Algorithms. Overload 112 (December 2012) pp.4-7
Legal
All material © Silas S. Brown unless otherwise stated. Android is a trademark of Google LLC. Apache is a registered trademark of The Apache Software Foundation. Firefox is a registered trademark of The Mozilla Foundation. Google is a trademark of Google LLC. Google Play is a trademark of Google LLC. Huawei is a trademark of Huawei Technologies Co., Ltd registered in China and other countries. Java is a registered trademark of Oracle Corporation in the US and possibly other countries. Javascript is a trademark of Oracle Corporation in the US. Microsoft is a registered trademark of Microsoft Corp. Python is a trademark of the Python Software Foundation. Unicode is a registered trademark of Unicode, Inc. in the United States and other countries. Unix is a trademark of The Open Group. Windows is a registered trademark of Microsoft Corp. Any other trademarks I mentioned without realising are trademarks of their respective holders.