Part C - Social Aspects

Language

Itemize some language-related variations across the world
Introduce the need for internationalization

Variations | Rendering Text | Rendering Objects | Internationalization | Exercises



Aside from Universal Grammar, languages show considerable difference in detail.  Most aspects of language are local and culturally specific.  Communications across different languages and cultures rest on conventions and adopted international standards.  Comprehensive internationalization and localization involves development of techniques that support local usage in specific cultures. 

Globalization has accelerated the need for degrees of uniformity and common protocols.  Universal standards are part of these protocols.  They facilitate communication and make products available across different cultures.  Localization initiatives have addressed aspects of language such as culturally specific rendering of text and objects. 


Variations

One common misconception about languages is that every country has its own primary language.  A country may have one language or several languages and a language may be spoken in one country or several countries. 

For example,

  • Canada - English, French
  • Belgium - Dutch, French
  • Switzerland - Italian, French, and German

A common mistake is the use of flags to indicate which language to select.  This is a poor choice due to the lack of one-to-one relationships between countries and languages throughout the world. 

Moreover, regional differences exist across languages: spelling and usage of English in the United States differs from that in Canada or in the United Kingdom.

To accommodate these variations, we specify language by both country and language.  For example,

  • EN.US - American English
  • FR.CA - Canadian French

Rendering Text

Many of the problems of rendering language text can be identified by studying Japanese.  Nevertheless, this just scratches the surface.  Japanese text is still simpler than text in countries with multiple dialects and scripts.

Japanese

Japanese text is a mixture of three scripts, which we can combine into a single sentence.  This results in an extremely large number of characters to render. 

Japanese scripts:

  • Kanji - the complete written language with 50,000 characters
  • Kana - symbols representing sounds, broken into two groups
    • Hiragana - native Japanese sounds
    • Katakana - foreign words other than Chinese and Korean
    • Romanji - letters from the Roman alphabet to use for untranslated foreign words

Direction

There are three directionalities in written languages:

  • left to right - European languages, Thai
  • left to right / vertical - Chinese, Japanese, Korean
  • bidirectional - Hebrew, Arabic

Bidrectional

In bi-directional languages, we write and read text right-to-left, but embedded numbers and other languages left-to-right.  We read the number 123-4567 embedded within text:

  • left-to-right if it is a phone number
  • right-to-left (4567 minus 123) if it is a subtraction

Translation

Accurate translation is critical to international communication.  For accurate translations, we need

  • good translators
  • good glossaries that define each word in the text so that the translators fully understand the intended meaning

Rendering Objects

When rendering objects, we pay attention to local conventions such as

  • sorting rules
  • date and time formats
  • address formats
  • telephone number formats
  • currency and numbering formats
  • paper sizes

Sorting Rules

Different countries have different sorting rules.  Complete collation tables

  • handle accents in the language
  • account for letter sequences ("cho" comes before "co" in Spanish)
  • prefer lower case before upper case (opposite to ASCII)

ISO/IEC 14651:2007 provides guidelines on international collation.  This standard specifies a method for comparing character sequences and provides a common template.

Date and Time Formats

Date and time formats differ throughout the world.  Some are big-endian (start at the big end), others are little-endian (start at the little end), still others are middle-endian (start at the middle end). 

date and time formats
Date and Time Formats (source: Gadren Wikipedia 2007 CC-BY-SA)

Map showing what countries primarily use which calendar date format:

   dd/mm/yyyy
   dd/mm/yyyy and yyyy/mm/dd
   yyyy/mm/dd
   mm/dd/yyyy
   mm/dd/yyyy and dd/mm/yyyy
   mm/dd/yyyy, dd/mm/yyyy, and yyyy/mm/dd

ISO 8601

ISO 8601 is an international standard for the exchange of date and time information.  It was published in 1988 and specifies a big-endian format for international representations of time and date.  One of the possible formats is YYYY-MM-DD HH:MM:SS:FF, where FF is a decimal fraction of a second.  Zero padding is mandatory wherever necessary.  The hyphens separate the date elements, while the colons separate the time elements.  Both separators may be dropped to yield the basic format, but are retained for readability.  The least significant digits may be dropped.  Other formats are weekly and ordinal.

This standard requires every date to be consecutive.  The reference point for the standard is the date of the Metre Convention (1875-05-20) in the Gregorian calendar.  This calendar, also known as the Western calendar or the Christian calendar is the internationally accepted civil calendar.  The standard is compatible with this calendar all the way back to its introduction in 1582-10-15.  However, the standard is incompatible with the Julian calendar due to the consecutive date requirement.  The Julian calendar assumed that the time between vernal equinoxes was 365.25 days, which is why it was eventually replaced by the Georgian calendar: the time is actually 11 minutes shorter.

This standard uses a 24-hour clock.  The standard assumes local time unless specified.  The reference is Coordinated Universal Time (UTC), which is based upon International Atomic Time (TAI).  The format for referencing UTC is Date Time + the offset from UTC.  You can find the offset(s) for any country here.

Atomic Time

In 1967, the International System of Units (SI) defined a second of time as the duration of 9,192,631,770 cycles of radiation corresponding to the transition between two levels of radiation of a caesium-133 atom.  You can find a brief, lay-person's description of a caesium atomic clock here.  The international measure of a second is based upon 300 atomic clocks worldwide.  In 1997, the International Bureau of Weights and Measures (BIPM) declared that the definition of the SI second refers to a caesium atom at rest at absolute zero temperature. 

Address Formats

Addresses have different formats throughout the world

Country Format
Argentina Name of Addressee
Name of Street, Number
Postal Code, Name of City
Australia Name of Addressee
Subunit-Number, Name of Street
Name of City, Postal Code
Brazil Name of Addressee
Name of Street, Number
Name of City, State
Postal Code
Australia Name of Addressee
Subunit-Number, Name of Street
Name of City, Postal Code
China Country, Postal Code
Province, City, District, Building Name, Number
Name of addressee
Hungary Name of Addressee
Name of City, State
Name of Street, Number
Postal Code
South Korea Postal Code, City
Ward, Neighbourhood
Building Name, Floor, Room Number
Name of addressee
Singapore name of addressee
number and name of street
name of city and postal code

For a more complete description of address formats, see here

Telephone Number Formats

The structure of telephone numbers differs throughout the world.  There are different numbers of digits, different separators, and different groupings.

Country Example
Australia 649-800-445-768
Austria 1234 56 78 90
Belgium 12-345- 67 89
Denmark 12 34 56 78
Germany (123) 4 56 78 90
Italy 123-456 78 90

For a more complete description of telephone number formats, see here

Currency and Numbering Formats

Currency and numbering formats differ throughout the world. 

Currency Symbols

List of circulating currencies

currency symbols
currency symbols
Currancy Symbols (source: user:16@r Wikipedia 2006 CC-BY-SA)

Location of + and -

Some countries place the minus sign as a suffix, some place the minus sign as a prefix.

Digit Group Separators

The digit group separator depends upon the decimal separator.  If the decimal separator is a dot, the digit group separator is a comma or a space.  If the decimal separator is a comma or Mommayyez, the digit group separator is a dot or a space.  A thin space is recommended by the ISO 31-0 standard and the International Bureau of Weights and Measures to facilitate reading.

decimal separator
Decimal Separators (source: Gadren Wikipedia 2007 CC-BY-SA)

English: Decimal Separator Countries:
   dot
   comma
   Momayyez (forward slash similar to comma)
   unknown

Billion

Different countries have different definitions of one billion.  The meaning depends upon whether the long or short scale is being used.  Check here.

Paper Sizes

The ISO 216 standard describes the metric paper sizes used throughout most of the world.  The relation to the sizes used in Canada and the United States is shown in the table below.

United States Canada ISO 216 Metric Units Inches
A (letter) 8.5x11 A4 210x297mm 8.27x11.69in
Legal 8.5x14      
B (ledger) 11x17 A3 297x420mm 11.69x16.54in
Business Card 2x3.5 A8 53x74mm 2.07x2.91in

Internationalization and Localization

The numeronym for internationalization is I18n.  The numeronym for localization is L10n.  I18n focuses on the adaptation of products for use throughout the globe. 

This involves support for

  • multiple languages
  • multiple character sets
  • differing formats for
    • numbers
    • dates
    • currency
  • printing on different paper sizes

Internationalization is part of the process of ongoing globalization.  This process involves worldwide economic, political, technological, and social integration and entails making the necessary technical, managerial, personnel, marketing and other enterprise decisions to support localization. 

Localization refers to the addition of special features to allow products to be used in specific locales.  It provides local:

  • language support
  • currency support
  • cultural support
  • symbols
  • ordering

Why I18n and L10n?

The need for I18n and L10n is acute.  Only 8-10% of the world's population uses the English language as its primary language.  Even in the United States, large parts of the population use other languages:

  • Miami - 78%
  • Los Angeles - 45%
  • San Francisco - 42%
  • New York city - 25% of subway riders speak no English whatsoever

US based Palm Computing has a 68% share of the Latin American market. 

Personal computer suppliers are distributed throughout the world

  • US - 38.8%
  • Europe - 25%
  • Asia - 12%

I18n and L10n provide the means for tapping global markets.  I18n products are no longer designed for the English speaking market, but are first and foremost designed for the international market.


Exercises

  • A 16-minute video on internalization and localization in Drupal 6