💾 Archived View for bleyble.com › users › quokka › software › textMunger captured on 2020-10-31 at 23:55:06. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2020-09-24)
-=-=-=-=-=-=-
Here is a Python script to take plain-text (or partly formatted text) file and convert the content into differeny formats and layouts. Hopefully it will either entertain or educate someone out there.
The goal behind this was to take a largely unformatted, unwrapped bulk of text and spit out a 'magazine style' plain text format.
The goal was not to produce output that could be published as-is, i.e. without any kind of final proof-reading and further tinkering. It's there do to _most_ of the work but not _all_ of the work. Probably the most useful thing it does it take blocks of text and wrap them to the specified size. Rearranging text into columns and pages can also make long tracts a bit easier to read through. All of these treatments are optional.
A recent addition has been conversion to HTML, although this is very simple conversion and the result will probably want going through afterwards to clean things up prior to publication.
I also separated out the different functions into standalone scripts, should you wish to use the 'wrap' or 'unwrap' modules for something else. The newer version will prevent breakage of dropped capitals across new pages, if dropped capitals and pagination are selected.
Disclaimer: (Expectations Management) I am not a programmer so don't expect wonders. Also, don't use this for anything vital.
The following zip file contains the updated Python script, accessory files, and some example input files:
Download textMunger2.zip (26 kB)
Here is the original version ,from the glog post:
Download textMunger.zip (19 kB)
The script relies on the file extension of the input and output files to determine the direction of conversion.
Generally, you can get a result with this script by running at least:
python3 TextMunger.py --input <Path to the input file> --output <Path to the output file>
There are only two required flags: --input and --output
--input
The source file as either absolute or relative path.
--output
The destination file as either absolute or relative path. The file extension is used to inform the script of the output format. Use .gmi or .gemini for Gemini formatting or .htm or .html for HTML formatted output.
Running:
python3 TextMunger.py --help
will display all the available argument flags. These are all optional and the defaults are set so as to try and make as few changes to the input as possible, apart from those strictly necessary.
Almost all the optional flags are for the case where you want to rearrange a bulky text file into something easier to read, e.g. by pagination etc. These are detailed below.
This is the main use of this script and where most of the additional formatting options might make sense.
To convert plain-text to Gemini (with default wrapping to 80 columns):
python3 TextMunger.py --input ./examples/input.txt --output ./examples/output.gmi
or without wrapping any text and use first two lines as headers (these flags are explained below):
python3 TextMunger.py --input ./examples/input.txt --output ./examples/output2.gmi --pagewidth 0 --titlesfromcontent
Links and other formatting cannot be inferred from plain text. However, textMunger will allow the first two lines of the input text to be used to create a H1 formatted title (from the first line) and an H2 formatted subtitle (from the second line, e.g. as a time stamp or byline for the document).
The text can be further munged into columns and pages. A masthead and dropped capitals can also be added, using the following additional formatting options:
--pagewidth
This is the maximum width, in character-columns, of the output. The default is 80 characters. Lines longer than this value will be wrapped to the nearest word, i.e. there's no hyphenation or broken words. To disable wrapping, set this to value to zero.
--align
Text which has been wrapped will then be adjusted to either left (default), right, or centred alignment. This applies to the whole document (excluding the first two header lines, if that option is selected, which are always left-aligned).
--columns
Wrapped text can be split text into a number of columns. The defalt is 1 column, i.e. the text is wrapped to --pagewidth only. If more than one column is requested then text will be wrapped tighter and rearranged into columns. The whole body of text is split into columns so long documents may look awkward unless the --paginate option is also activated.
It will not allow words to break across lines, so your maximum word length is affected by the width of the column,
i.e. maximum word length = (pagewidth / columns) - columngaps, where 'word' can is any continuous sequence of non-whitespace characters.
--columngaps
If the text body is split into more than one column then a gap is added beween colums. The width of this gap, in characters, is specified with this option.
--indentby
The first line of paragraphs will be indented by this many characters. The default value is 0, to disable indentation. New paragraphs eligible for a dropped capital will not be indented.
Width of the paragraph indent (in characters).')
--dropcap
Requires the source document to be appropriately formatted. If so, it will place a dropped capital (a large fancy character like in medieval manuscripts) within eligible paragraphs. An eligible paragraph is one in which the preceeding line contains only two minus signs, e.g. '--'. The first letter of the paragraph is removed and a dropped capitals inserted. Text for each dropped capital is sourced from a text file given by --dropcapsource.
--dropcapsource
Dropped capitals are drawn from this text file. The layout of this file is: the character to be replaced is on one line. The line immediately after that character contains the 'ASCII Art' text, over multiple lines, which will comprise the dropped capital. Each dropped capital must occupy the same number of lines; the number is determined from the number of lines between the first lookup character and the second in the source file (e.g. lines between 'A' and 'B' in the example file.
--paginate
Flag to split text will into 'pages'. Each page will be separated by a footer line with page numbers. Pagination can make long documents easier to read, especially if --columns has been set.
--pageheight
Number of lines in a 'page' when paginating.
--titlesfromcontent
If true, use the first two lines as the article title (line 1) and byline (line 2) for the output. The title will be prefixed with # for H1 and byline with ## for H2 styling.
--addsplash
Output will be headed by a fancy splash title masthead, sourced from the --splashfile file.
--splashfile
Content from this file will be used as masthead for the output file. This content will not be formatted like the rest of the input so should already be in the style (e.g. width) that you want.
Example converting plain-text into Gemini formatted magazine layout (two columns, paginated, dropped capitals, and indented paragraphs):
python3 TextMunger.py --input ./examples/input.txt --output ./examples/fancyOutput.gmi --columns 2 --dropcap --indentby 2 --paginate --titlesfromcontent --addsplash
Conversion to HTML will add header and footer information (sourced from two text files) either side of your input file's content. If --titlesfromcontent is specified, then two headings will be derived from the first two lines of the input and formatted as H1 and H2 respectively. Gemini links and unordered lists will also convert.
To convert Gemini to HTML:
python3 TextMunger.py --input ./examples/input.gmi --output ./examples/output.html
To convert plain-text to HTML:
python3 TextMunger.py --input ./examples/input.txt --output ./examples/output2.html
To convert Gemini to plain-text ... you're probably better off just renaming the .gmi file as .txt!
However, something like this would work:
python3 TextMunger.py --input ./examples/input.gmi --output ./examples/output.txt --pagewidth 0
In this example we disable wrapping in order to preserve the structure of the preformatted sections of the input file. Running with wrapping enabled will also generate an error as some lines contain 'words' which are are longer than the wrapping width (which defaults to 80 characters).
So I hope this is useful to someone ... I learned a lot by writing it! If there's interest, I can put this somewhere with source control for further development.
In the latest update, key modules were separated into their own files. Two of these, wrap.py and unwrap.py can be run independently to wrap or unwrap text.
Accepts a reduced list of input arguments, but with the same effect as for textMunger.py:
Required:
Optional:
As for textMunger, this will wrap text to --wrapwidth columns. It will not allow word to break across lines (so your maximum word length is also --wrapwidth).
Only two arguments are needed, for input and output. This will attempt to find each wrapped paragraph from the input and place it as a single line in the output. It works best if there is a blank line separating paragraphs but it should still work OK without them.
Required:
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
~EOF~