See also Scope & Limitations of WTW Markup (Sample analysis categories rather than critical editions, pitfalls & strengths of analytical markup, etc.)
The WTW Project currently enriches its texts using SGML (Standard Generalized Markup Language), according to the TEI-Lite version of the guidelines prepared by the TEI (Text Encoding Initiative). And, as noted on our project home page, we also attempt to follow the Level 4 (Basic Content Analysis) recommendations endorsed by the Digital Library Federation. But for ease of encoding we subdivide our Basic Content Analysis into (1) Structural and (2) Basic Content encoding. We also perform (3) extensive analytical encoding:
NB: See below for a summary of our Attribute ValuesSTRUCTURE
When considered appropriate, WTW makes sparing use of the following structural elements (besides <text> and <body>):
<front>: used for prefaces, tables of contents; <back>: used for afterwords, appendices, endnotes, apparatus (when included); <titlepage>: including verso if present, divided by < pb N="verso" >; <list>: used with <item> to reflect tables of contents, errata, subcription lists, "other titles by the same author," cast lists, etc.; <div1, etc.>: used with N= attribute to record sequence; <head>; <argument>; <epigraph>; <opener>; <dateline>; <salute>; <signed>; <closer>; <trailer>; <q>: used only for quotations that are set off typographically (ie, not used for inline quotations, or for direct speech in prose fiction); <q>: used for letters quoted in text as follows: q/text/body/div1 type=letter, including "opener, "dateline," "salute," "signed," "closer" as appropriate; <p>; <lg>: used within "div" for all verse of more than one line--even wihout stanzas-- to assist retrieval; <l>: include use of the REND attribute to record indentation; <milestone>: used with UNIT="typography" N="****" to represent divisions within poems so marked; <pb>: the page break is placed at the beginning of the page; <figure>: also used to encode frontispieces, within a separate div/p.
NB:
*Regarding <note>: the WTW project does not routinely reproduce notes (although this policy is being re-examined). BASIC CONTENT
When considered appropriate, WTW makes sparing use of the following basic content elements:
<foreign lang=xxx> using 3-character language abbreviations. If appropriate, this tag also includes <rend=ital>; <title>; <emph>:
(a) used for for words that are emphasized linguistically or rhetorically, rather than only typographically;
(b) easiest to spot in dialog; <hi>:
(a) used for ambiguous and/or typographically emphasized text that is not "foreign," "title," "emph";
(b) often used in texts with multiple instances of italics;
(c) used--instead of <q>--for inline quotations, but only when italicized; <sic>: used to indicate typographic errors, with the CORR attribute to note corrections; <reg>: used in preference to <orig>, <corr>, etc., to regularize unusual forms of names in text, together with the ORIG attribute to indicate form in source text; <add>; <delete>; <unclear>; <sp>: used to encode speeches, with speakers identified within < speaker > elements;
NB:
*Regarding <name>: WTW has discontinued <name> tagging. Nor do we currently encode dates, times. ADVANCED CONTENT (ANALYTICAL CATEGORIES)
WTW uses the SGML elements "InterpGrp" and "Interp" to allow for a limited amount of content analysis according to the following four categories (three of the categories have several sub-categories, as indicated in Section II below):
Ethnicity
Gender Marking
Transportation
Women's Occupations
IMPORTANT NOTE: Level of specificity
To ensure an adequate amount of granularity, texts are encoded at the paragraph level, but only with regard to the first significant word (later references to the same sub-category in the same paragraph are not encoded). For example, only the first reference to a "carriage" in a paragraph would be tagged (as "trans-road"). However, a reference in the same paragraph to a different form of transportation--as in "boat"--would also be tagged (as "trans-water").
To assist the encoder and user, this section on analytical encoding is divided into three parts:
I. Outline of the SGML Analytical Structure
The TEI Guidelines suggest that analytical encoding be declared in a special BACK portion of the electronic text. Analytical values using the element may be arranged in groups using the element. Here is a template showing how one of our analytical groups is recorded in SGML:
ETHNICITY <eth>
The Ethnicity category has no sub-categories, but it is our most complex category.
There are several points to remember in assigning and using this tag:
Marked versus Unmarked:
The ethnicity tag is normally used only for marked references to members of ethnic groups, or discussions of ethnic concepts and distinctions--and NOT in cases where a passing mention occurs.
Keyword versus Content:
The marked nature of the reference may be indicated in different ways:
Sometimes the ethnicity reference is clearly indicated. This may involve obvious keywords such as: native, Semitic, civilized, barbaric, English, German, Hindu, coolie, etc.
(However, not all occurrences of these keywords are tagged. A reference to a "French" diplomat will not be tagged if it appears from the context that the reference to French nationality is not important).
Sometimes the ethnicity reference is somewhat less clearly indicated. That is, although a marked reference to, or discussion of, ethnic distinctions is present, obvious keywords are missing--and detailed content analysis is required in order to identify these passages.
(This may occur when the writer is making conceptual references to ethnic distinctions, or expressing a point of view about ethnic differences, or defining ethnic identities, or making contrastive comments).
Granularity or Level of Specificity:
Ethnicity tagging occurs on the paragraph level. But only the first significant word **or phrase** is tagged. We use the following paragraph from M. F. Sheldon's "Customs among the Natives of East Africa" (1892) to illustrate this point:
"The higher up towards Kilemanjaro I went, the more intelligent I found the natives, and not only were they more intelligent, but they had a finer physique; their lineaments lost their negroid characteristics, they became more or less Egyptic; their heads were long, with features more or less regular; their colour--never black--but like the Malay type, a deep sepia; however, the customs were approximately the same with all the tribes. They are all polygamists, and this from necessity more than from the licentiousness which taints the East."
In this paragraph, WTW has attached the ethnicity tag to the word "natives," but not the words "negroid," "Egyptic," or "Malay"; the latter are references to ethnicity, but occur in the same paragraph.
GENDER MARKING
(Used only for overt or marked references to gender)
Male <gndr-male>
Used only when there is an obvious instance of male dominance or privilege, or the lack thereof.
Female <gndr-fem>
Used only when a woman is explicitly transgressing gender lines, or when there is an explicit comparison between men and women (i.e. NOT in every case when a woman is mentioned.)
Other <gndr-other>
Used in cases where a different gender system is encountered or when a person's gender is ambiguous.
TRANSPORTATION
Air <trans-air>
Includes: aircraft, balloon
Rail <trans-rail>
Includes: streetcar, train
Road <trans-road>
Includes: carriage, horseback
Water <trans-water>
Includes: boat, canoe, ship
Other <trans-other>
WOMEN'S OCCUPATIONS
Used to identify local populations (i.e. not the travellers themselves)
Agriculture <occ-agri>
Includes (for subsistence and for sale): crop production, dairy work;
Arts <occ-arts>
Includes: journalism, music, painting, poetry, writing (artistic and professional);
Business <occ-busi>
Includes: importing, informal economy, large corporation, legal work, market vending, selling weaving, shopkeeping, small business;
Crafts <occ-crafts>
Includes (for subsistence and for sale): embroidery, weaving;