HTML Import Add-on

HTML Import Add-on is capable of reading HTML documents in various formats, extracting the genealogical information they contain, and converting that information into GEDCOM form to be displayed in Gedcom Viewer, or, ultimately, to be saved in a file. HTML Import Add-on is currently capable of processing the following kinds of HTML documents:

Uncustomized output produced by various versions of GED2HTML.
Individual data pages generated by the RootsWeb WorldConnect server.
Data pages generated by GEDitCOM using the standard Web page style.
Data pages generated by Legacy 4.0.

Basic Usage

To use HTML Import Add-on, first make sure that you have loaded the add-on, as described here. If the add-on has been loaded successfully, then its presence will be indicated in the About screen. Also, there will be an Import HTML item under the Tools menu. Assuming HTML Import Add-on has been successfully loaded, selecting this menu item will open a dialog asking you for the URL of the document that you wish to load. The easiest way to enter this URL is to first use a normal Web browser to display the document you want to load, then select the URL of this document and paste it into the dialog box. Make sure that you are looking at a page that shows information about a particular individual, because HTML Import Add-on for the most part currently does not understand any other kinds of pages, such as index or search pages. For a demonstration, you can try the following page generated by GED2HTML:

http://www.starkeffect.com/gedcomviewer/doc/import_demo/d0003/g0000033.html#I828

If GEDCOM Viewer complains that the current GEDCOM is read-only, then use the New item from the File menu to clear the existing GEDCOM before attempting to import. If the import was successful, the GEDCOM Viewer index panels will be updated to include the names of individuals extracted from the HTML document. You can then use the viewer to browse the imported information as if you had loaded it from a GEDCOM file. If the import was not successful, you will see an error dialog indicating that the format of the document was not understood. If this happens for a page apparently produced by GED2HTML, it could mean that the format was customized by the person who created the page. In this case, it will probably not be possible to import that document. If the import is unsuccessful for a RootsWeb WorldConnect data page, it might mean that RootsWeb has changed the format of the page. In that case, I would like to hear about it (please send the exact URL that failed) so that I can issue an update.

Each time you import an HTML document, HTML Import Add-on will attempt to cause the imported document to display in your Web browser (in case you weren't already viewing it). If you find this behavior inconvenient, it can be disabled by using the options editor to set the BROWSER_TRACKS_IMPORT option variable to false.

Once you have successfully imported at least one HTML document, or if you have loaded a GEDCOM that was previously created by importing HTML documents, then when you are viewing the information about particular GEDCOM record and you see an Import button displayed in the Gedcom Viewer button bar, it means that there is a link to a page containing information about that record which has not yet been imported. In that case, if you click on the Import button, the import will be attempted. If successful, the display will be updated to incorporate any new information obtained from the imported document.

Besides using the Import button to manually select documents to import, it is also possible to use the Auto-import HTML item under the Tools menu to automatically follow links and import HTML documents linked from the existing GEDCOM. For this operation to be useful, you must already have imported some HTML documents (or be viewing a GEDCOM created by previously importing HTML documents), and that GEDCOM must contain some links to documents that have not yet been imported. When you select the Auto-import HTML item, you will be asked to confirm the operation. This is because once you initiate an auto-import on a GEDCOM that contains links to remote documents it is possible for a large number of requests to the remote Web server to be generated in a short period of time (note that currently HTML Import Add-on will refuse to auto-import remote documents if running under a demo license, though auto-import of local files is allowed). A progress display will update showing how many links have yet to be followed. When all the links have been exhausted, Gedcom Viewer will refresh the display to reflect the information imported.

When importing information from a collection of linked HTML documents, information about the same individual may appear on more than one page. HTML Import Add-on in most cases is able to identify the two individuals as the same and merge the information. Sometimes if the main data page for a particular individual has not yet been imported, you may see displayed multiple references to that individual. Often the multiple references will disappear as more pages are imported, but in some cases, redundant information will persist. For example, it may appear that the individual has married the same spouse twice or has multiple occurrences of the same birth information. HTML Import Add-on does the best it can to remove such information, but it is not always possible for it to do so.

Once you are finished importing HTML documents, you can save the resulting GEDCOM using the Save as option from the File menu. In theory, you should be able to load the resulting GEDCOM file into another genealogy program. However, HTML Import Add-on uses rather long and strangely formatted cross-reference IDs for the GEDCOM records. This is so each individual imported by HTML Import Add-on obtains a unique ID that will be the same across multiple sessions. Some genealogy programs will not accept these long IDs. If that is the case, then you can use the Relabel GEDCOM item under the Tools menu to relabel the GEDCOM before saving it. Note that once you have relabeled a GEDCOM, if you later read in that GEDCOM and attempt to import further data, it will no longer be possible for HTML Import Add-on to correctly identify situations in which information about the same individual is obtained from multiple documents. So in that case you will likely end up with redundant information in your GEDCOM. For that reason, I suggest that if you have to use the Relabel GEDCOM tool, then you should also save a copy of the unlabeled GEDCOM in case you need to import further data into it.

Notes

When importing HTML pages, you might find in some cases that HTML Import Add-on does not manage to identify the surnames of individuals from their data pages. In this case, the individuals' names will be shown, but they will be displayed and sorted as though they had an empty surname. Generally this does not occur with pages created by GED2HTML, however it will occur when importing HTML pages created by GEDitCOM and it sometimes occurs when importing HTML pages from RootsWeb WorldConnect. At the moment, there isn't a workaround for RootsWeb WorldConnect. However, for GEDitCOM pages, what you can do to work around this problem is to first import the index.html index document at the root of the file tree (i.e. from within the same folder that contains the Records subfolder). HTML Import Add-on understands the format of this index file, and it contains sufficient information to identify the surname of each individual.