Web Access Gateway bugs and problems

This is an old bug list about the old Web Access Gateway, which is no longer maintained, having been largely replaced by my stylesheets for low vision and Web Adjuster. This page is now for historical interest only.

1. About this file

The following is a list of some outstanding gateway bugs, in no particular order.  It is mostly in terse note form.  The numbering is subject to change.

2. SSL switching problems

Redirect to ssl version when TYPE https url - do it with a Location directive (if CAN_SWITCH_SSL is defined in platform.h)

Also, getting images over non-SSL (in an SSL page) is a potential privacy compromise if unauthorised persn is snooping the net (& someone cld compromise _integrity_ of SSL pages by chg the char images) - document or fix

(but no big problem because the browser should warn anyway)

3. some URLs could potentially be mis-handled

protocol://user:pass@host:port - the user:pass bit might sometimes be incorrectly handled (might matter if someone encodes their links like that)

4. ampersands in linked URLs

e.g. on http://www.jython.org/cgi-bin/faqw.py?req=index (try follow a link; doesn’t work until you press OK) (should probably re-write them somewhere)

5. Cyrillic stuff (mainly .tbl sortout)

The gateway does not recognise the ISO designators for Cyrillic, Esc - L and Esc - A.  This is because I don’t know which ISO designator goes with which code page.

fread /other/nobackup/*.count (and *.freqtbl) into an array of ints; find max, max-1 etc; top N in reverse order (to 4d?)

(tbl’s: try using .py prototype to get them into text 1st.  vice versa?)

M.T. sent these URLs http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/CJK.html http://web.kyoto-inet.or.jp/people/tomoko-y/biwa/wnn/iso-2022.html

- Get a frequency table for Cyrillic - Improve auto-detect code (maximise chars that fall within the highest-frequency range?) - Rename “DOS Russian” - add Cp866 - IBM - Cp855 - KOI8-R - some errors in the table; see http://koi8.pp.ru/utf-8.koi8-r.htmlu - koi8.pp.ru/koi8-r_unicode.txt - What I will try to do is get the mapping tables into a human-editable form.  Then if you like you can edit them.  But it may be some time before I can do that. - charset= stuff (need alias table) (modify .tbl files?  or do it separately) - [ CU Slavonic & East European Society ] ; [ [CU Yugoslav Society] (about 40 members on soc-cuyu) ]

/usr/share/i18n/charmaps could be useful

6. Update scripts need fixing

This file (gateway.bugs) is translated to HTML & updated by the website update script; it should be done by the gateway update script (to pageroot) like the help file is.

Maybe have “The latest version is N.N.N” at top of access.html (use htp.def?) and rsync it (or Makefile - rsync won’t work due to date stamp problems)

7. Non-standard colour settings get lost

If you put a non-standard colour in the URL and then select the “colours” button, it gets lost because it is not one of the options.  Maybe if none of the options match the current value, add a new one that does (quoting the HTML figure or something).

8. Javascript status lines prevent hover background colour

Maybe add something to onMouseOver and onMouseOut

9. xhtml stuff

html2xhtml ok but script problm (do *after* proc) (ok for now...) (or just hack it - “write out the comments inside script, *maybe* w/out <!– –>”) (lower pri: put <html> </html> in if not already there) (do we get the ?xml? thing, + this, into mytest itself?) Also it would be nice to upgrade the HTML spec to 4 (esp. tables) (lower pri: integrate it with the C++ HTML filter, & remove the code that’s made redundant by it)

10. Need to handle links (and ALTs) that say “here” or “click here”

(Apart from all those links that say “here” - if a blind person, or a mobile phone user, is trying to get a summary of a page by getting the computer to just output the links, they get “here, here, here”.  Not to worry - one of these days I’ll add an option to my web mediator to handle them.)

Meaningless ALT tags (“Click here!”) - http://www.fujitsu.co.jp/hypertext/hdd/drive/disk_e.html

11. Need something about DOCUMENTS IN CAPITALS

Gateway: Need to do something about entire sentences being in capitals (make them title case instead) (but leave acronyms etc alone)

12. Accessing MIDI etc

Option to strip width & height from embed? <embed src=”x.mid” width=2 height=0 autostart=true loop=true>

13. Flash and strings

tracttext.cc, libz (zlib1g-dev) Chris Lightfoot (saved in MiscStuff) non-HTML and plug-in sort out; strings; swf thing (sep CGI? system command??); http://www.flashgallery.co.uk/ source

swf: As predicted the content was minimal, but I could extract the links to the other subpages which was sufficient to obtain the required information.

Without naming the guilty parties I’ve encountered a fair number of such sites on the Univeristy Societies webserver.....

Lois: If you just want to get the bare text out of a Word doc, the Unix ‘strings’ command can be useful. At least on the few that I’ve tried.

14. Collapsing newlines

natwest.com sortout (collapse newlines option) : <p><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br> <p>If you have remained on this page ...

15. Embedded stylesheets need URL redirection

Access gateway bug (embedded stylesheets need URL redirection!) (plus mention it in the presentation)

Feb  9 09:45:24 ssb22 /usr/sbin/imgserver: Error 404 on URL “head”

pc358.nmus.pwf.cam.ac.uk - - [09/Feb/2001:09:45:22 +0000] “GET /cgi-bin/access?Ac=A&Au=http://perch.tripod.co.jp/ HTTP/1.1” 200 11888 “http://ssb22.joh.cam.ac.uk/cgi-bin/access?Ac=A&Au=http://www.nsknet.or.jp/~m-saito/index.htm” “Mozilla/4.0 (compatible; MSIE 5.01; Windows NT)”

<ul style=”line-height: 150%; list-style-image: url(’images/headline01.gif’); margin-left: 35%”>

16. CSSA links “news/news”

cssa: can we pick up on these and get down to one word? news/news about/about activity/activity

17. Japan2001 viewer.html is wrong

lynx -source http://www.embjapan.org.uk/viewer.html|grep japan2001  was wrong (emailed the webmaster Feb9)

18. Need to move to XML (when translators finished)

Might be message drift - check carefully (incl. help.htm)

19. ISO decoding bug

This hit the “invalid ISO designator?” thing but it was a space encoded in ISO-7 or something 27 44 65 32 27 40 66 Esc , A space Esc ( B

Also: “2000-11-22: NEEDATTENTION” (stuff that may break Shift-JIS & UTF-8)

20. Need a “preferred image style” option

Before “alternative base URL”: “Preferred image style” (default, Simplified, Traditional, Korean) (if MULTIPLE_STYLES_SUPPORTED) ENV_PREFERRED_STYLE needs documenting and adding to the UI For Traditional, Aeus=t NB a blank value is OK (default)

21. Using Chieko’s proxy

ALLOW_USER_PROXY_SETTING maybe ?

22. Replacing BODY tag interferes with scripts (and colours)

Bug: body onLoad doesn’t get executed on “enable scripts” (since body is replaced) See also NEEDATTENTION in access.c++ / “BODY” re colour override (mixing author’s and user’s)

23. asahi.com images without HEIGHT and WIDTH

asahi.com: Images without HEIGHT & WIDTH causes Netscape to load *all* images before displaying any of the page

24. Image server reliability

Check monash & japan2001 img server stats from time to time

imgserver Might have been an alarm clock - socket had been registered to listen, so OS accepted it, but waiting for it to get back to select() Blocking write etc?

watch the japan2001 imgserver

Your home directory was unavailable (due to a server upgrade), hence all the messages.  You invoke a cron job once every minute, so your home directory was probably inaccessible to Nexus for 142 minutes.

Your “cron” job on nexus ./isitup localhost || (pkill imgserver ; ulimit -n 1024; ./imgserver)

produced the following output:

Alarm Clock Terminated

(is it “isitup” that does this?) yes, 10sec timeout (Does it get stuck anywhere?  gdb???)

Sometimes runs out of quota Compress the data file ? (zcat) (careful...) (or include portable decompression source..)

25. Image server speed

server.c++ HTTP/1.1 pipelining (o/p buf retry, watch max size [but could just drop connection when sent current lot], etc) (How many browsers/proxies/etc implement this anyway?) (IE *might*)

DONE added expires and last-modified (does make a difference!)

Ignore net/khttpd (buggy & kernel crash!)

Got “ab” - Apache HTTP server benchmarking tool

/usr/sbin/ab -k -t 60 -c 10 Image server: Requests per second:    249.74 Transfer rate:          65.45 kb/s received Apache: Requests per second:    1045.35 Transfer rate:          3452.46 kb/s received

50 times faster !?   Get a profile !

/usr/sbin/ab -k -t 60 -c 10 http://ssb22.joh.cam.ac.uk:7080/t/6211.gif /usr/sbin/ab -k -t 60 -c 10 http://ssb22.joh.cam.ac.uk/

From flevit: Image server: Requests per second:    74.82 Transfer rate:          19.62 kb/s received Apache: Requests per second:    53.47 Transfer rate:          176.63 kb/s received

Still 10 times higher transfer rate, but requests/sec not much higher (other thing could be a localhost thing)

26. Image server needs more GIFs

(add other gifs 1st; get through Cam proxy; transformations; remember decompress)

See “Unicode” section re getting them

Unicode imgs - they’re proportional!

zcat -f /var/log/syslog*|grep “Error 404”|sed -e “s/.*URL \”//” -e “s/\”//”|sort|uniq

Chinese stuff: COULD get it from TeX, if can find a way of auto-cropping the PostScript & cnvt to a bitmap format

27. Unicode stuff

Unicode has now gone beyond 16-bits (slides need update)

gateway & unicode (multiple “spellings” of accent-add etc) “Filesystem case-sensitivity (was Re: Picking up hermes mail)” on ucam.comp.linux

[but might be post-Unicode 3.0]

20000..2A719 : 42,778 : CJK Unified Ideographs, Extension B    (These constitute all remaining unencoded ideographs from the Kangxi    Dictionary, the Han Yu Da Zidian, a set of 6356 characters from Japan, 908    Hong Kong government characters, 169 characters from Korea, 29,794    characters from TCA in Taiwan, and 4050 characters from Vietnam.) :    00-Feb-02 Accepted : 00-Sep-25

etc

About the Online Code Charts

These charts are provided as a convenient online                   reference to the character contents of the Unicode                   Standard, Version 3.0 but do not provide all the                   information needed to fully support individual scripts                   using the Unicode Standard. Proper Unicode support                   requires considerably more than providing glyphs for                   characters, and requires consulting the Unicode                   Standard and the Unicode Technical Reports.

You may freely use these code charts for personal or                   internal business uses only. You may not incorporate                   them into any product or publication, or otherwise                   distribute or archive them without express written                   permission from the Unicode Consortium.

The information on these pages may be update from                   time to time. The Unicode Consortium is not liable for                   errors or omissions in these charts or the standard                   itself.

Blocks

The Unicode Standard divides its codespace into a                   number of blocks.

The chart index contains a table of most of the blocks;                   missing are blocks of unassigned characters, and                   blocks of characters with no visual representation such                   as the surrogate blocks and private use area. You can                   also go to a full character chart for each block (except                   for the Han ideographs and Hangul syllables).

Fonts

The fonts used in these charts were provided to the                   Unicode Consortium by a number of different font                   designers. Note that the glyphs in these charts are only                   representative; there can be wide variation in the glyphs                   used to represent any particular character, as discussed                   in the standard.

SOME mapping tables (Windows): http://oss.software.ibm.com/icu/charset/ Also ftp://ftp.unicode.org/Public/MAPPINGS/

You may embed references to the glyph images on the Unicode site in your own web pages. For example, to display a Euro sign (U+20AC) you can use the following HTML:

<IMG SRC=”http://charts.unicode.org/Glyphs/20/U20AC.gif”>

The subdirectory to use within the Glyphs/ directory is the first two hexadecimal digits of the Unicode code point. The set of glyphs available covers all of Unicode 3.0 with the exception of Han ideographs and Hangul syllables. However, you should only make occasional use of these glyphs. If there is too much web traffic the Unicode Consortium may be forced to discontinue this service.

(see source of http://www.unicode.org/charts/web.html for codepoints)

http://charts.unicode.org/unihan/unihan.acgi$0x4E95 (generates links to cached images; not permanent; but URLs quite regular so hit the main page first and then get the cached images, only if haven’t already got the image) 3400-9FFF and F900-FAFF ftp://ftp.unicode.org/Public/UNIDATA/Unihan.txt

http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html xmbdfed (package installed) o  Export of XBM files from glyph bitmap editors. (well, can export to HEX, which can probably be converted) but mgk25’s fonts are wrong sizes http://czyborra.com/unifont/

28. “.gif” isn’t always a GIF (can leave gateway)

Oh dear, this leaves the gateway: http://www.askntl.com/adverts/adverts.asp?url=/telephone/great-value-calls/default.asp&image=/adverts/468by60/phone-bill.gif

(Can we start with a HEAD request if we’re using own code? What about the overhead of having to re-connect if no keep-alive? etc)

29. Extracting URLs from OPTION VALUE

kingston.com mirrors navigation:

<select name=”site” size=1 onChange=”javascript:formHandler()”>                                 <option selected value=””>Worldwide sites                                 <option value=”http://www.kingston.com/sproot/”><font size=”1” face=”verdana, arial”>Argentina</a></option>                                 <option value=”http://www.kingston.com/germany/”><font size=”1” face=”verdana, arial”>Austria                                 <option value=”http://www.kingston.com.br/”><font size=”1” face=”verdana, arial”>Brazil</a></option>                                 <option value=”http://www.kingston.com/sproot/”><font size=”1” face=”verdana, arial”>Chile</a></option>                                 <option value=”http://www.kingston.com/denmark/”><font size=”1” face=”verdana, arial”>Denmark                                 <option value=”http://www.kingston.com/europe/”><font size=”1” face=”verdana, arial”>Europe                                 <option value=”http://www.kingston.com/finland/”><font size=”1” face=”verdana, arial”>Finland                                 <option value=”http://www.kingston.fr/”><font size=”1” face=”verdana, arial”>France</a>                                 <option value=”http://www.kingston.com/germany/”><font size=”1” face=”verdana, arial”>Germany</a></option>                                 <option value=”http://www.kingston.com/ukroot/”><font size=”1” face=”verdana, arial”>Ireland</a></option>                                 <option value=”http://www.kingston.com/israel/”><font size=”1” face=”verdana, arial”>Israel</a></option>                                 <option value=”http://www.kingston.com/italy/”><font size=”1” face=”verdana, arial”>Italy</a></option>                                 <option value=”http://www.kingston.co.jp/”><font size=”1” face=”verdana, arial”>Japan</a></option>                                 <option value=”http://kingston.softbank.co.kr/”><font size=”1” face=”verdana, arial”>Korea</a></option>                                 <option value=”http://www.kingston.com/sproot/”><font size=”1” face=”verdana, arial”>Latin America</a></option>                                 <option value=”http://www.kingston.com/sproot/”><font size=”1” face=”verdana, arial”>Mexico</a></option>                                 <option value=”http://www.kingston.com/nl/”><font size=”1” face=”verdana, arial”>Netherlands</a></option>                                 <option value=”http://www.kingston.com/norway/”><font size=”1” face=”verdana, arial”>Norway</a></option>                                 <option value=”http://www.kingston.com/spain/”><font size=”1” face=”verdana, arial”>Spain</a></option>                                 <option value=”http://www.kingston.com/sweden/”><font size=”1” face=”verdana, arial”>Sweden</a></option>                                 <option value=”http://www.kingston.com/germany/”><font size=”1” face=”verdana, arial”>Switzerland</a></option>                                 <option value=”http://www.kingston.com/ukroot/”><font size=”1” face=”verdana, arial”>United Kingdom</option>                                 <option value=”http://www.kingston.com/sproot/”><font size=”1” face=”verdana, arial”>Uruguay</a></option>                                 </select>

30. Stylesheets / line spacing etc

line spacing etc (stylesheets? gateway “spacing” button?? with text explaining it’s only CSS-aware browsers) P {word-spacing: 10px} P {letter-spacing: 5px} P {line-height: 12pt}

31. iMode and Access Gateway - Notes

“Access” oops: [Access Systems America]  - http://www.access-us-inc.com/ Provider of a microbrowser which is used in many I-Mode devices.

Showcase of Japanese Keitai Culture http://ssb22.joh.cam.ac.uk/cgi-bin/access?Ac=@&Aeck=PREF%3DID%3D4ce132f816f47144:TM%3D982953159:LM%3D982953159%2C.google.com&Au=http://nooper.co.jp/showcase/%3Fl%3Den

The HTTP User-Agent: header identifies an i-Mode browser with a string something like DoCoMo/1.0/F50i for the older 501 models, and something like DoCoMo/2.0/F502i/c10 for the newer models. The first part of the string says DoCoMo indicating that it is an i-Mode client. The next part indicates the supported HTML version number. The third part indicates the device model number The fourth part, only available on certain 502 models, indicates the current cache size. As with WAP devices, an i-Mode device can only accept a certain amount of data in one go. The number is in kilobytes, and the default size is 5KB. [Does this include the images??] Screen size is very small, usually no more than 16 (English) characters by 6-8 lines. <HTML> <HEAD> <TITLE>Main MENU</TITLE> </HEAD> <BODY> <FONT COLOR=RED>Main MENU</FONT> <BR> <IMG SRC=ad_small.gif ALIGN=RIGHT> <A HREF=new.tcl ACCESSKEY=”1”>News</a> <BR> <A HREF=addr.tcl ACCESSKEY=”2”>Directory</a> </BODY> </HTML> The ACCESSKEY attribute of a hyperlink provides one-key access to select and follow the URL, from the phone’s numeric keypad. [but is the number included?] [Don’t have to include this - the recommended implementation does it by default] The i-Mode phone terminals do not support HTTP Cookies at this time. Note that on i-Mode phones, the password field in the HTTP authentication dialog box which pops up only supports entry of numeric passwords. Authorization: header is present. If not, it issues a WWW-Authenticate challenge WWW-Authenticate: Basic realm=”ACS_iMode” return 401 “text/html; charset=shift_jis” “please login” (or “incorrect login”)

For compressing HTML (& removing unwanted tags), see http://www.w3.org/TR/1998/NOTE-compactHTML-1998 (table in Appendix A of supported tags & attribs) Images can be nightmarish (esp. large ones; transferred & scaled down) Please ensure each page uses less than 5KB of data volume. (Depending on the tags being used, some pages cannot be displayed even though they contain less than 5KB of data.) We recommend a data volume per page of less than 2KB. The maximum length of a character string is 200 bytes after URL encoding. The maximum length of a URL that can be input directly is 100 bytes. The maximum length of a URL that can be added to the bookmark list is 100 bytes. The maximum length of the title of a page/bookmark is 24 bytes.

i-mode users are responding to banner ads and e-mail advertising to a far greater extent than standard Web users. I-mode - the “i” is for information

There is a basic data charge per packet, 0.3 YEN (approx. US-cent 0.3) per data packet transmitted of 128 byte. As an example, looking at the basic imode-Menu, the standard DoCoMo welcome screen or user interface, will set you back about 2.7 YEN (i.e. approx. US-cent 2.7). There are no connection time charges for imode. In addition there are other charges for using email and for premium subscription services.

imode emails have to be shorter than 250 Kanji (double byte characters), or shorter than 500 Roman Characters (single byte characters) The default email address of imode users is 090xxxxxxxx@docomo.ne.jp, where “090xxxxxxxx” is the mobile telephone number.

For example &#63647; is an icon of a sun shining.

SJIS+imgs; Remove all images (ALT?); disable status line scripts; don’t add “end of web page”; don’t put [ ]; don’t show date stamp. Also: Don’t add TITLE= to any HR; don’t add META tags; compact space; &#146; to ‘; compress the options; MAYBE compress Au= in some other way as well (besides removing http://); remove things like <b> <i> etc that are not supported (don’t use colours instead - it will drive the size up)

geometry is really 16x7, but lynx margins take up 4 more lines xterm -geometry 16x11 -e lynx -nocolor -nopause -noreverse -nounderline (formatting can be bad) Some phones have 20x8 (10 kanji) xterm -geometry 20x12 -e lynx -nocolor -nopause -noreverse -nounderline (formatting can be wrong, e.g. centre etc)

32. Stylesheets that say display:none

gateway.bugs: <DIV ID=”incoming” STYLE=”display:none”> (means don’t display; stripping the STYLE will cause it to do so.  + don’t strip content if JavaScript enabled.) (Do we really want to take out this text though?  But at least count it as a banner?  Option???)

33. lynx -trace

lynx -trace: outputs stuff to a file called Lynx.trace

34. Multipart form encoding

—————————827779986791670271271312593 —————————–827779986791670271271312593

in cgilib.c++ CGIEnvironment::tryDecodingMultipart() See all **** stuff esp. boundary

CONTENT_TYPE=multipart/form-data; boundary=—————————10617267281005157210847669114 CONTENT_LENGTH=4339 Input: —————————–10617267281005157210847669114 Content-Disposition: form-data; name=”iconid”

5 —————————–10617267281005157210847669114 Content-Disposition: form-data; name=”message”

test —————————–10617267281005157210847669114 Content-Disposition: form-data; name=”A1attachment”; filename=”codepoints.html” Content-Type: text/html

include ..... —————————–10617267281005157210847669114–

Or: Content-Disposition: form-data; name=”A1attachment”; filename=”random_seed”

P....\n

35. Cookies minor things

aftr don’t store remote session IDs, have “store remote session IDs even across servers” (default No)

Cookies: Need to default the domain (not to everything!) when setting (although this would increase the size of the URLs...)

Note: %26 (&) and %3D (=) seem to occur a lot in the cookie - better % compression system ? (%m for aMpersand and %q for eQuals ?  Somehow code all ASCII that must be %-escaped?) (watch we don’t send this fancy stuff to remote servers!) How did the cookies get so big anyway? The problem seems to be unique to Yahoo

Edit the cookies on the form?? Gateway cookies: Should be OK, because not getting *image* cookies. 4K URL limit is a worry! (when user not supporting cookies) (Temp: clear all cookies when reaches maximum size?  cut down?) (really carry cookies when no longer browsing their source domain? e.g. search engine cookies)

All FORMS: METHOD=”post” (careful; some browsers put warnings up)

36. border=0 to CSS ?

border=0 css ?  (can it be done?  Which browsers need border=0, do they all have CSS support, etc) check “s

37. Finish/document NONLOCAL_PASSWORD

(see localusr.c++) NB insecure etc (unless using SSL, and even then, watch Location box, cache, history, etc)

38. proper framesets

gateway proper framesets (gateway.bugs? fair amnt of coding) (but jp etc) If a certain var is present, instead of charset, URL box, date stamp, etc (or rewind once done), have “[Expand this frame]” (no BR) Or just call the string [Options] It links to the page with the var clear & target=_top Put var in when doing a FRAMESET Keep it when doing a link iff name is not _top (& it’s already present) (may still fail if new NAMEs for new windows, but not to worry - cn still get a “expand this frame”)

39. Plugins

Plugin: file.swf [Enable plugins] [Extract text] (& links; use swf code if necessary) <P>(plugin: jsb.mid [download] [activate plugins] [hide plugins])</P>

Old notes (might no longer need them) -

<EMBED src=”$FILENAME$” width=$WIDTH$ height=$HEIGHT$ type=”application/x-Sibelius-Score” alt=”$FILENAME$” codebase=”http://www.sibelius.com/cgi/plugin.pl” pluginspage=”http://www.sibelius.com/cgi/plugin.pl”>

codebase and pluginspage should now be substituted Netscape: Takes codebase and goes ?application/whatever, ignores pluginspage, changes msg to “click here after installing”.  How do you get (or prevent) the adverts window?

40. More encodings trouble (dating from 1999)

Chinese table problem Chinese table in Japanese?

Have commented out the pinchATable(“Cp33722”,”IBM eucJP/5050”,f,0); - REALLY returns max bytes =3 (in EUC) Need better decompilation Cp964 (AIX TW) really is 4 bytes max; need to sort out & comment back in

Need to sort out //if(neverBelow127) throw(new IOException(“Didn’t expect neverBelow127 to be true here”));

TEContainer.h: Implement void setAutoDetectMimeCharsetIfCharsetIsAppropriateToLanguage(const char* mimeDesignator) {};

LOG the charsets (and the detected results) as people use web pages?

Need to find official list, really

Some of these may be MIME charsets: iso-8859-1 Shift_JIS big5 gb2312 euc-kr euc-jp windows-1250 windows-1251 windows-1253 iso-8859-9 utf-8 x-mac-roman x-mac-ce ks_c_5601-1987 ? x-gb2312-11 x-euc-tw x-cns11643-1 x-x-big5 ...

HZ-GB-2312

o iso-2022-jp     (see Section 3.1.3) o iso-2022-jp-2   (see Section 3.1.3) o iso-2022-kr     (see Section 3.1.4) o iso-2022-cn     (see Section 3.1.5) o iso-2022-cn-ext (see Section 3.1.5) o iso-8859-1

ISO- ? -[0-9]?

- UCS-2                 0x6F22 0x5B57   - UCS-4                 0x00006F22 0x00005B57 UCS-2: FEFF, also escape sequences (Level 3 = supports all characters)   UCS-2 Level 1           <ESC> % / @          0x1B252F40      162   UCS-2 Level 2           <ESC> % / C          0x1B252F43      174   UCS-2 Level 3           <ESC> % / E          0x1B252F45      176

JIS X 0221-1995 == ISO 10646-1:1993 (based on Unicode 1.1)

See ftp://ftp.tiu.ac.jp/jis/ re JIS X 0213-199X etc Also get ISO sequences for all the other encodings (+ MIME charset etc)

MISSING: <OPTION VALUE=”Cp1125”>Ukraine:  IBM PC</OPTION>

EXTRA: pinchATable(new CharToByteCp856());

Cp33722 and Cp942 need Yen substitution

<OPTION VALUE=”Cp037”>Misc:  CP 037</OPTION> <OPTION VALUE=”Cp437”>Misc:  DOS 437</OPTION> <OPTION VALUE=”Cp850”>Misc:  DOS Latin-1</OPTION> <OPTION VALUE=”Cp500”>Misc:  EBCDIC 500V1</OPTION> <OPTION VALUE=”Cp1046”>Misc:  IBM EBCDIC</OPTION> <OPTION VALUE=”Cp285”>Misc:  IBM UK</OPTION> <OPTION VALUE=”8859_1”>Misc:  ISO 8859-1</OPTION> <OPTION VALUE=”8859_2”>Misc:  ISO 8859-2</OPTION> <OPTION VALUE=”8859_3”>Misc:  ISO 8859-3</OPTION> <OPTION VALUE=”8859_4”>Misc:  ISO 8859-4</OPTION> <OPTION VALUE=”8859_9”>Misc:  ISO 8859-9</OPTION> <OPTION VALUE=”MacDingbat”>Misc:  Macintosh Dingbat</OPTION> <OPTION VALUE=”MacRoman”>Misc:  Macintosh Roman</OPTION> <OPTION VALUE=”MacSymbol”>Misc:  Macintosh Symbol</OPTION> <OPTION VALUE=”Cp1252”>Misc:  Windows Latin-1</OPTION>

// Need to add more encodings

// See ftp://unicode.org/pub/MappingTables/

Cp936 is GB2312 with some corrections

charset=HZ-GB-2312 (NB One class may have several charsets)

/* Two-byte ISO codes: JIS X 0208-1990: Esc & @ before JIS X 0208-1983 @ = JIS C 6226-1978 DONE A = GB2312 DONE B = JIS X 0208-1983 DONE C = KSC5601 D = JIS X 0212-1990 E = ISO-IR-165:1992 (Lunde: “ISO-IR-165:1992 can be considered a superset of GB 2312-80, GB 6345.1-86, and GB 8565.2-88” [and more - ssb]) DONE G-M = planes for CNS11643 (1-7)

41. Tables/Unicode stuff (old)

Problem: &brvbar;Y&sup1;M&not;+=&macr;&ordf;+)W&iexcl;A&acute;s&sup1;M&ordf;e&acute;=&curren;s&curren;t&iexcl;A+&ordf;&sup1;M&cedil;g&ordm;&middot;&uml;s-y (is that GB2312?) - access needs to decode whole table!

Look at http://www.fontlab.com/download.htm (check legal permissions)

http://www.unicode.org/img/CJKtr8/UFA2D.gif

CJK images + readings: Use all the data on http://charts.unicode.org/unihan/unihan.acgi$0x4E95 and other stuff (takes a long time to get it) See the HTML & reverse-engineer?

Complete text file ftp://ftp.unicode.org/Public/2.0-Update/UNIHAN.TXT (but still need the different imgs)

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData-Latest.txt has the “LATIN-CAPITAL-LETTER-D” etc text

Blocks index http://charts.unicode.org/Unicode.charts/normal/Unicode.html

42. More mapping tables todo (old)

See mapping tables checkConverter - expand to USE the other mappings

void setAutoDetectMimeCharsetIfCharsetIsAppropriateToLanguage(const char* mimeDesignator) {}; has not been implemented (te-container.h) Eg. WINDOWS-1251 (may not be exactly what’s stored) (And there’s a “not yet implemented anyway” in literals.h)

Unihan database also has definitions

Norway needs a different table from Denmark Also Finnish & Sweden

“Decoding prior to native decoding” thing: Problem if HTML sequences are *MEANT*! Decode only if chars in this range are always in HTML?

Have a “Language for messages” button?

NB Greek&Russian may be in most CJK, also Jp in C and K, also C in CJK; also UTF-8 etc (especially Chinese)

JIS no ‘escape’ thing? (ie. take $B as Esc $B etc)

Stuff to check: // NEEDATTENTION Check the following! if(!shiftOutRequired) isoEncodingInUse=-1; // So resets immediate ones at end of line

// *** Need to sort out misc folder! // Sort JIS folder

43. From later-on.txt

ssb22:/tmp/brian$ export RSYNC_RSH=ssh

ssb22:/tmp/brian$ rsync -v silas@brian.accu.org:access/platform.h . (needs password)

Packages rsync sftp

<EMBED SRC=”jsb.mid” HIDDEN=true AUTOSTART=true>

Remove (or don’t) HIDDEN and AUTOSTART -> Allow background music to start automatically

Images: Give the button text as images before the button! Also need to write SELECT as RADIO

problemExtentions[] could be more elegant / less storage etc

Spam trap: Would it be better with sleep?

<HTML lang=”fr”> <EM lang=”ja”>some Japanese</EM> <P lang=”es”>...Interpreted as Spanish... <P>...Interpreted as French again...

<ABBR title=”Idaho”>ID</ABBR> <ACRONYM title=”World Wide Web”>WWW</ACRONYM> (have an acronyms dictionary?)

Black  = #000000    Green  = #008000     Silver = #C0C0C0    Lime   = #00FF00     Gray   = #808080    Olive  = #808000     White  = #FFFFFF    Yellow = #FFFF00     Maroon = #800000    Navy   = #000080     Red    = #FF0000    Blue   = #0000FF     Purple = #800080    Teal   = #008080     Fuchsia= #FF00FF    Aqua   = #00FFFF

In the near future, browsers will display grouped lists with expanding and collapsing levels of detail.   To group items, use the OPTGROUP element (with the SELECT element). For example:

<FORM action=”http://somesite.com/prog/someprog”       method=”post”> <P><SELECT name=”ComOS”> <OPTGROUP label=”Comm Servers”> <OPTGROUP label=”PortMaster 3”> <OPTION label=”3.7.1” value=”pm3_3.7.1”>PortMaster 3 with ComOS 3.7.1   <OPTION label=”3.7” value=”pm3_3.7”>PortMaster 3 with ComOS 3.7   <OPTION label=”3.5” value=”pm3_3.5”>PortMaster 3 with ComOS 3.5 </OPTGROUP> <OPTGROUP label=”PortMaster 2”>   <OPTION label=”3.7” value=”pm2_3.7”>PortMaster 2 with ComOS 3.7   <OPTION label=”3.5” value=”pm2_3.5”>PortMaster 2 with ComOS 3.5 </OPTGROUP> </OPTGROUP> <OPTGROUP label=”Routers”> <OPTGROUP label=”IRX”>   <OPTION label=”3.7R” value=”IRX_3.7R”>IRX with ComOS 3.7R   <OPTION label=”3.5R” value=”IRX_3.5R”>IRX with ComOS 3.5R </OPTGROUP> </OPTGROUP> </SELECT> </FORM>

The new FIELDSET element groups form controls while the LEGEND element labels each group. For example,

<FORM action=”http://somesite.com/adduser” method=”post”>   <FIELDSET>     <LEGEND>Personal information</LEGEND>     <LABEL for=”firstname”>First name:</LABEL>     <INPUT type=”text” id=”firstname” tabindex=”1”>     <LABEL for=”lastname”>Last name:</LABEL>     <INPUT type=”text” id=”lastname” tabindex=”2”>     ...more personal information...   </FIELDSET>   <FIELDSET>     <LEGEND>Medical History</LEGEND>     ...medical history information...   </FIELDSET> </FORM>

Give each frame a title

IFRAME as well as FRAME

Provide alternative text for all image submit buttons

<INPUT TYPE=”image” SRC=”bobbylogo.gif” ALT=”The bobby logo” WIDTH=200 HEIGHT=200>

Option for button with show URL

Cache-Control: no-cache Pragma: no-cache Expires: 0

<META HTTP-EQUIV=”Window-target” CONTENT=”_top”>

Options: Content-language: en-GB Window-target: _top

<META HTTP-EQUIV=”Set-Cookie” CONTENT=”cookievalue=xxx;expires=Friday, 31-Dec-99 23:59:59 GMT; path=/”>

Can we put <PRE> around text/plain ?

export LS_COLORS=” alias ls=”ls –color=auto” export PS1=”\h:\W\\$ “ (caps W for last part of dir only)

44. Grep NEEDATTENTION etc

COMPILER_USES_LSB_MSB_INTS: Perhaps create alternative versions of data files for other compilers, add to installation instructions (plus a test) Haven’t tested that it produces the same output!

Sort L_NO_FREQTBL out

45. Korean etc

Japanese frequency table!

arrows consisting of dashes and greater-than signs –> etc

Korean: What about ISO-2022-KR and EUC-KR? And ISO646?  QP?  How do they relate to KS-C-5601?

Unreproducable bug report - Korean pages looking like Japanese - suspecting wrong language selection

46. Link to home page should go through gateway

The link to the gateway’s home page should probably go through the gateway (but what if the installation is not working?)  Also help.htm links (and it needs more processing).

47. Misc (old) (some may have been fixed)

favicon.ico redirect to loc of original page ???

gateway: If <FORM> and </FORM> does not match in “banner”, DO NOT MOVE IT!!! (eg. http://access.adobe.com/simple_form.html)

HTML4 forms can have “disabled” controls - option to remove them?

48. Background hover colour needs help text

Document AecL (background hover colour) & link into options (NB say “(read help)”) uses css In some browsers (e.g. some versions of Konqueror), you have to also select “Don’t add status line code to links” (under the Options button) for this to work.

49. Colours needs an “other” button

- with larger selection of colours?  (how organised?  rows?) or some sort of selector?

50. Spacer removal doesn’t remove <p>&nbsp;

<p>&nbsp; lots of times is left intact

51. Inline help errors

Error: Failed to find help text for option AeI

Error: Failed to find help text for option Aefn~ssb22/mytest

52. Doesn’t work well with online email providers

email providers PRE, NOBR (zh chars).  Also TEXTAREA

53. more options compression

“=on” in the checkbox options can just be “=” in the links, nothing in the cookies, and “value=1” in hidden form options

54. banner split bug

gateway bug: http://dmoz.org/cgi-bin/add.cgi?where=Computers/Multimedia/Music_and_Audio/Software/Composition/Fractal_and_Generative cuts the banner in the middle of a FORM, resulting in the SUBMIT button being invisible in Netscape

55. PC HK/TW symbols

sometimes detects PC HK/TW rather than Big5 - no great problem (a few symbols don’t display, e.g. cdot (u+2022) sometimes rendered as u+2027 and image not available). Might want some kind of detection bias but it won’t be easy (really want a fuzzy logic system of some sort)

56. Segfault on bogus HTML

gateway sig-11 faults in strlen in HttpHeader::readHttpEquivs() when document HEAD has the following bogus tag:

<meta http-equiv=Content-typecontent=”text/html; charset=utf-8”>

(i.e. if there is a missing space before ‘content’)

Legal

All material © Silas S. Brown unless otherwise stated. Apache is a registered trademark of The Apache Software Foundation. Javascript is a trademark of Oracle Corporation in the US. Mozilla is a registered trademark of The Mozilla Foundation. PostScript is a registered trademark of Adobe Systems Inc. TeX is a trademark of the American Mathematical Society. Unicode is a registered trademark of Unicode, Inc. in the United States and other countries. Unix is a trademark of The Open Group. Windows is a registered trademark of Microsoft Corp. Any other trademarks I mentioned without realising are trademarks of their respective holders.