Annotator Generator

Annotator Generator is an examples-driven generator of fast text annotators. “Annotate” in this context means to add pronunciation or other information to each word, and/or to split text into words in a language that does not use spaces.

Legal considerations

Annotator code will contain individual words and some phrases from the original corpus (and these can be read even by people who do *not* have the unannotated version); with regards to copyright law, I expect the annotator code will count as an “index” to the collection, the copyright of which exists separately to that of the original collection, but laws do vary by country and I am not a solicitor so please act judiciously.

Legally obtaining that original annotated corpus is up to you. If you are in the UK the government says non-commercial text mining is allowed (terms of use prohibiting *non-commercial* mining are unenforceable), provided you:

1. respect network stability (i.e. wait a long time between each download),

2. connect directly to the publisher (this law bypasses the *publisher’s* terms of use, not those of third-party search engines like Google),

3. use the result *only* for mining, not for republishing the original text (so you can’t publish your unprocessed crawl dumps either),

4. and still respect any prohibitions against sharing whatever mining tools you made for the site (as this law is only about text mining, not about the sharing of tools).

Laws outside the UK are different (and I’m not a lawyer) so check carefully. But if the website’s terms don’t actually prohibit writing an *unpublished* scraper for non-commercial mining purposes, perhaps you won’t need a legal exception—but you should still respect their bandwidth and do it slowly, both for moral reasons (it’s the right thing to do) and pragmatic ones (you won’t want their sysadmins and service providers taking action against you).

Download and Usage

Download annogen.py; you will need Python (compatible with both Python 2.7 and Python 3+) and you will need a command prompt. (You can also use pip install annogen or pipx run annogen if you prefer, and there is history on GitHub.)

annogen.py

Version 3.383

Usage: annogen.py [options]

Options:

License

Annotator Generator is free software licensed under the Apache License, Version 2.0 (this is also the license used by Web Adjuster). If you use it in a good project, I’d appreciate hearing about it.

Citation

If you need to cite a peer-reviewed paper:

Silas S. Brown.  Web Annotation with Modified-Yarowsky and Other Algorithms.  Overload 112 (December 2012) pp.4-7

Legal

All material © Silas S. Brown unless otherwise stated. Android is a trademark of Google LLC. Apache is a registered trademark of The Apache Software Foundation. Firefox is a registered trademark of The Mozilla Foundation. Gecko is a registered trademark of Netscape Communications Corporation. GitHub is a trademark of GitHub Inc. Google is a trademark of Google LLC. Google Play is a trademark of Google LLC. Huawei is a trademark of Huawei Technologies Co., Ltd registered in China and other countries. Java is a registered trademark of Oracle Corporation in the US and possibly other countries. Javascript is a trademark of Oracle Corporation in the US. Microsoft is a registered trademark of Microsoft Corp. Python is a trademark of the Python Software Foundation. Unicode is a registered trademark of Unicode, Inc. in the United States and other countries. Unix is a trademark of The Open Group. Windows is a registered trademark of Microsoft Corp. Any other trademarks I mentioned without realising are trademarks of their respective holders.