Starting with version 3.0, GED2HTML now includes some explicit support
for alternate character sets, including international characters.
The GEDCOM 5.5 specifications require the use of the ANSEL character set,
but in my experience many GEDCOMs are not actually encoded using this
character set. More common are ASCII, ISO-Latin-1 (ISO-8859-1),
as well as IBM-PC encodings based on various DOS code pages,
which are in fact explicitly disallowed by the GEDCOM 5.5 specification!
Internally, GED2HTML "supports" the ISO-Latin-1 character set.
What this really means is that, aside from possibly converting
ASCII characters with codes 127 or less from lower case to upper case,
the processing performed by GED2HTML simply treats characters in
the GEDCOM as 8-bit values to be passed through from input to output.
According to my reading of the
HTML specification, a compliant ``user agent'' (e.g. a browser)
should accept HTML files encoded using this character set.
Netscape, for example, will read and display such characters properly.
If a browser ``barfs'' on characters with codes 128 and above, then it
is not an HTML-compliant browser.
A number of people have complained that GED2HTML versions prior to
version 3.0 "didn't support international characters".
What this generally meant was that their GEDCOMs were encoded in
some IBM-PC character set, and when the 8-bit codes from their
GEDCOMs were passed through to the HTML output and interpreted by
an HTML-compliant browser, the characters that were displayed by the
browser did not appear the same as the ones they originally entered using
their genealogy program.
To help with this problem, GED2HTML version 3.0 and later now applies the
following procedure which is intended to make more GEDCOMs produce output
that matches the user's expectations:
- When a GEDCOM is read, GED2HTML uses the information in the CHAR
field of the HEAD record to attempt to determine the character
set in which the GEDCOM is encoded.
Currently, GED2HTML recognizes the tokens
"ASCII", "ANSI", "ANSEL", "IBM WINDOWS", "MS-DOS",
and "IBMPC". The default is ANSI (ISO-Latin-1), in case the
character set cannot otherwise be determined.
- As the GEDCOM is read in, GED2HTML attempts to translate it, from the
character set in which it is encoded, to the "equivalent"
encoding in ISO-Latin-1. It is not always possible to do an
exact job of this, because, for example, there are ANSEL
sequences for which there is no ISO-Latin-1 equivalent.
Anyway, it does the best it can, and I am open to suggestions
for improvement.
- Once the data has been translated internally into ISO-Latin-1,
no further change (other than possible lower/upper case conversions
to characters with codes less than 128) is made to the data,
before it is eventually emitted as HTML output.
If you find that GED2HTML is assuming the wrong character set for
your GEDCOM, you should override what the GEDCOM says by setting the
CHARACTER_SET output interpreter variable to the appropriate string.
NOTE: Brother's Keeper is known to lie about the character set
it has used to encode the GEDCOM, using
1 CHAR IBMPC
when it really means
1 CHAR ANSI
This is a typical circumstance in which you would need to override the
choice of character set.
This would be done, e.g., by putting
-D CHARACTER_SET=ANSI
in the "Additional Options"
field of the dialog box under Windows 3.1 and Windows 95, or
on the command line under Unix.
One problem I have had with the above scheme is figuring out a reasonable
way of doing lower/upper case conversions on characters with codes 128
and above. A worse problem is obtaining the proper collating sequence
for sorting names into alphabetical order, because the proper ordering
can depend on the particular (human) language being used.
It now appears to me that the so-called "locale" support is maturing
under many operating systems, so starting in GED2HTML version 3.5,
I am relying on this support to perform the proper lower/upper case
conversions and comparison operations. If you find that conversions
are not being done properly, you might be able to modify the default
behavior by explicitly setting the LOCALE output interpreter variable.
See here for more details on variables and
the output interpreter.
If all else fails, the best advice I can offer is to turn off lower/upper
case conversion in surnames by setting the option variable
UPPER_CASE_SURNAMES
to 0.
Some people seem to feel compelled to put HTML "entity codes" (e.g. ö)
in their GEDCOM's. My opinion on this is that these codes are HTML-specific,
and have no business being in a GEDCOM. If I were trying to be really nice
to non-compliant HTML browsers, I would translate ISO-Latin-1 characters
in the range (128-255) to their HTML "entity codes" when creating the
HTML files. I might still implement this, but it is not a high priority,
because in my opinion (supported by the HTML spec cited above) a browser
that cannot display these codes is broken.
GED2HTML home page
Copyright © 1995-2000 Eugene W. Stark. All rights reserved.
SEND ME EMAIL