HTML Import Add-on is capable of reading HTML documents in various formats, extracting the genealogical information they contain, and converting that information into GEDCOM form to be displayed in Gedcom Viewer, or, ultimately, to be saved in a file. HTML Import Add-on is currently capable of processing the following kinds of HTML documents:
Uncustomized output produced by various versions of GED2HTML.
Individual data pages generated by the RootsWeb WorldConnect server.
Data pages generated by GEDitCOM using the standard Web page style.
Data pages generated by Legacy 4.0.
To use HTML Import Add-on, first make sure that you
have loaded the add-on, as described
here.
If the add-on has been loaded successfully, then its presence
will be indicated in the About
screen.
Also, there will be an Import HTML
item under
the Tools
menu.
Assuming HTML Import Add-on has been successfully loaded,
selecting this menu item will open a dialog asking you for the
URL of the document that you wish to load.
The easiest way to enter this URL is to first use a normal Web
browser to display the document you want to load, then select
the URL of this document and paste it into the dialog box.
Make sure that you are looking at a page that shows information
about a particular individual, because
HTML Import Add-on for the most part currently does not
understand any other kinds of pages, such as index or search pages.
For a demonstration, you can try the following page generated
by GED2HTML:
http://www.starkeffect.com/gedcomviewer/doc/import_demo/d0003/g0000033.html#I828
If GEDCOM Viewer complains that the current GEDCOM is
read-only, then use the New
item from the File
menu to clear the existing GEDCOM before attempting to import.
If the import was successful, the GEDCOM Viewer index
panels will be updated to include the names of individuals extracted
from the HTML document.
You can then use the viewer to browse the imported information
as if you had loaded it from a GEDCOM file.
If the import was not successful, you will see an error dialog
indicating that the format of the document was not understood.
If this happens for a page apparently produced by
GED2HTML,
it could mean that the format was customized by the person who
created the page. In this case, it will probably not be possible
to import that document.
If the import is unsuccessful for a
RootsWeb WorldConnect
data page, it might mean that RootsWeb has changed the format of the
page. In that case, I would like to hear about it
(please send the exact URL that failed) so that I can issue an update.
Each time you import an HTML document,
HTML Import Add-on will attempt to cause the imported
document to display in your Web browser (in case you weren't already
viewing it). If you find this behavior inconvenient, it can be
disabled by using the options editor
to set the BROWSER_TRACKS_IMPORT
option variable
to false
.
Once you have successfully imported at least one HTML document,
or if you have loaded a GEDCOM that was previously created by
importing HTML documents, then when you are viewing the information
about particular GEDCOM record and you see an
Import
button displayed in the Gedcom Viewer
button bar, it means that there is a link to a page containing
information about that record which has not yet been imported.
In that case, if you click on the Import
button,
the import will be attempted.
If successful, the display will be updated to incorporate any
new information obtained from the imported document.
Besides using the Import
button to manually
select documents to import, it is also possible to use the
Auto-import HTML
item under the Tools
menu to automatically follow links and import HTML documents linked
from the existing GEDCOM.
For this operation to be useful, you must already have imported
some HTML documents (or be viewing a GEDCOM created by previously
importing HTML documents), and that GEDCOM must contain some links
to documents that have not yet been imported.
When you select the Auto-import HTML
item, you will
be asked to confirm the operation. This is because once you
initiate an auto-import on a GEDCOM that contains links to
remote documents it is possible for a large number of requests
to the remote Web server to be generated in a short period of time
(note that currently HTML Import Add-on will refuse
to auto-import remote documents if running under a demo license,
though auto-import of local files is allowed).
A progress display will update showing how many links have yet to be
followed. When all the links have been exhausted,
Gedcom Viewer will refresh the display to reflect the
information imported.
When importing information from a collection of linked HTML documents, information about the same individual may appear on more than one page. HTML Import Add-on in most cases is able to identify the two individuals as the same and merge the information. Sometimes if the main data page for a particular individual has not yet been imported, you may see displayed multiple references to that individual. Often the multiple references will disappear as more pages are imported, but in some cases, redundant information will persist. For example, it may appear that the individual has married the same spouse twice or has multiple occurrences of the same birth information. HTML Import Add-on does the best it can to remove such information, but it is not always possible for it to do so.
Once you are finished importing HTML documents, you can save the
resulting GEDCOM using the Save as
option from the
File
menu. In theory, you should be able to load
the resulting GEDCOM file into another genealogy program.
However, HTML Import Add-on
uses rather
long and strangely formatted cross-reference IDs for the GEDCOM
records. This is so each individual imported by
HTML Import Add-on
obtains a unique ID that will
be the same across multiple sessions. Some genealogy programs
will not accept these long IDs. If that is the case, then you
can use the Relabel GEDCOM
item under the
Tools
menu to relabel the GEDCOM before saving it.
Note that once you have relabeled a GEDCOM, if you later read
in that GEDCOM and attempt to import further data, it will
no longer be possible for HTML Import Add-on to correctly
identify situations in which information about the same individual
is obtained from multiple documents. So in that case you will
likely end up with redundant information in your GEDCOM.
For that reason, I suggest that if you have to use the
Relabel GEDCOM
tool, then you should also save a
copy of the unlabeled GEDCOM in case you need to import further
data into it.
When importing HTML pages, you might find in some cases that
HTML Import Add-on does not manage to identify the surnames
of individuals from their data pages. In this case, the individuals'
names will be shown, but they will be displayed and sorted as though
they had an empty surname.
Generally this does not occur with pages created by
GED2HTML,
however it will occur when importing HTML pages created by
GEDitCOM and it sometimes
occurs when importing HTML pages from
RootsWeb WorldConnect.
At the moment, there isn't a workaround for
RootsWeb WorldConnect.
However, for GEDitCOM pages,
what you can do to work around this problem is to first import the
index.html
index document at the root of the
file tree (i.e. from within the same folder that
contains the Records
subfolder).
HTML Import Add-on understands the format of this
index file, and it contains sufficient information to identify
the surname of each individual.