Plain Text

Plain text and rich text[editarrow-up-right]

According to The Unicode Standard:

  • "Plain text is a pure sequence of character codes; plain Un-encoded text is therefore a sequence of Unicode character codes.

  • In contrast, styled text, also known as rich text, is any text representation containing plain text plus added information such as a language identifier, font size, color, hypertext links, and so on.

SGML, RTF, HTML, XML, and TEX are examples of rich text fully represented as plain text streams, interspersing plain text data with sequences of characters that represent the additional data structures." [1]arrow-up-right

According to other definitions, however, files that contain markuparrow-up-right or other meta-dataarrow-up-right are generally considered plain text, so long as the markup is also in directly human-readablearrow-up-right form (as in HTMLarrow-up-right, XMLarrow-up-right, and so on). Thus, representations such as SGMLarrow-up-right, RTFarrow-up-right, HTMLarrow-up-right, XMLarrow-up-right, wiki markuparrow-up-right, and TeXarrow-up-right, as well as nearly all programming language source code files, are considered plain text. The particular content is irrelevant to whether a file is plain text. For example, an SVGarrow-up-right file can express drawings or even bitmapped graphics, but is still plain text.

The use of plain text rather than binary files enables files to survive much better "in the wild", in part by making them largely immune to computer architecture incompatibilities. For example, all the problems of Endiannessarrow-up-right can be avoided (with encodings such as UCS-2arrow-up-right rather than UTF-8, endianness matters, but uniformly for every character, rather than for potentially-unknown subsets of it).

The purpose of using plain text today is primarily independence from programs that require their very own special encoding or formatting or file formatarrow-up-right. Plain text files can be opened, read, and edited with ubiquitous text editorsarrow-up-right and utilities.

A command-line interfacearrow-up-right allows people to give commands in plain text and get a response, also typically in plain text.

Many other computer programs are also capable of processing or creating plain text, such as countless programs in DOSarrow-up-right, Windowsarrow-up-right, classic Mac OSarrow-up-right, and Unixarrow-up-right and its kin; as well as web browsers (a few browsers such as Lynxarrow-up-right and the Line Mode Browserarrow-up-right produce only plain text for display) and other e-textarrow-up-right readers.

Plain text files are almost universal in programming; a source code file containing instructions in a programming languagearrow-up-right is almost always a plain text file. Plain text is also commonly used for configuration filesarrow-up-right, which are read for saved settings at the startup of a program.

Plain text is used for much e-mailarrow-up-right.

A commentarrow-up-right, a ".txtarrow-up-right" file, or a TXT Recordarrow-up-right generally contains only plain text (without formatting) intended for humans to read.

The best format for storing knowledge persistently is plain text, rather than some binary formatarrow-up-right.[2]arrow-up-right

Character encodings[editarrow-up-right]

Main article: Character encodingarrow-up-right

Before the early 1960s, computers were mainly used for number-crunching rather than for text, and memory was extremely expensive. Computers often allocated only 6 bits for each character, permitting only 64 characters—assigning codes for A-Z, a-z, and 0-9 would leave only 2 codes: nowhere near enough. Most computers opted not to support lower-case letters. Thus, early text projects such as Roberto Busaarrow-up-right's Index Thomisticusarrow-up-right, the Brown Corpusarrow-up-right, and others had to resort to conventions such as keying an asterisk preceding letters actually intended to be upper-case.

Fred Brooksarrow-up-right of IBMarrow-up-right argued strongly for going to 8-bit bytes, because someday people might want to process text; and won. Although IBM used EBCDICarrow-up-right, most text from then on came to be encoded in ASCIIarrow-up-right, using values from 0 to 31 for (non-printing) control charactersarrow-up-right, and values from 32 to 127 for graphic characters such as letters, digits, and punctuation. Most machines stored characters in 8 bits rather than 7, ignoring the remaining bit or using it as a checksumarrow-up-right.

The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign ("$") was not as useful in England, and the accented characters used in Spanish, French, German, Portuguese, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control characters, or using values in the range from 128 to 255. Using values above 128 conflicts with using the 8th bit as a checksum, but the checksum usage gradually died out.

These additional characters were encoded differently in different countries, making texts impossible to decode without figuring out the originator's rules. For instance, a browser might display ¬A rather than ` if it tried to interpret one character set as another. The International Organisation for Standardisation (ISOarrow-up-right) eventually developed several code pagesarrow-up-right under ISO 8859arrow-up-right, to accommodate various languages. The first of these (ISO 8859-1arrow-up-right) is also known as "Latin-1", and covers the needs of most (not all) European languages that use Latin-based characters (there was not quite enough room to cover them all). ISO 2022arrow-up-right then provided conventions for "switching" between different character sets in mid-file. Many other organisations developed variations on these, and for many years Windows and Macintosh computers used incompatible variations.

The text-encoding situation became more and more complex, leading to efforts by ISO and by the Unicode Consortiumarrow-up-right to develop a single, unified character encoding that could cover all known (or at least all currently known) languages. After some conflict,[citation neededarrow-up-right] these efforts were unified. Unicodearrow-up-right currently allows for 1,114,112 code values, and assigns codes covering nearly all modern text writing systems, as well as many historical ones, and for many non-linguistic characters such as printer's dingbatsarrow-up-right, mathematical symbols, etc.

Text is considered plain text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data.

Perhaps the most common way of explicitly stating the specific encoding of plain text is with a MIME typearrow-up-right. For email and HTTParrow-up-right, the default MIME type is "text/plainarrow-up-right" -- plain text without markup. Another MIME type often used in both email and HTTP is "text/htmlarrow-up-right; charset=UTF-8" -- plain text represented using the UTF-8 character encoding with HTML markup. Another common MIME type is "application/json" -- plain text represented using the UTF-8 character encoding with JSONarrow-up-right markup.

When a document is received without any explicit indication of the character encoding, some applications use charset detectionarrow-up-right to attempt to guess what encoding was used.

Control codes[editarrow-up-right]

Main article: C0 and C1 control codesarrow-up-right

ASCIIarrow-up-right reserves the first 32 codes (numbers 0–31 decimal) for control charactersarrow-up-right known as the "C0 set": codes originally intended not to represent printable information, but rather to control devices (such as printersarrow-up-right) that make use of ASCII, or to provide meta-informationarrow-up-right about data streams such as those stored on magnetic tape. They include common characters like the newlinearrow-up-right and the tab characterarrow-up-right.

In 8-bit character sets such as Latin-1arrow-up-right and the other ISO 8859arrow-up-right sets, the first 32 characters of the "upper half" (128 to 159) are also control codes, known as the "C1 set". They are rarely used directly; when they turn up in documents which are ostensibly in an ISO 8859 encoding, their code positions generally refer instead to the characters at that position in a proprietary, system-specific encoding, such as Windows-1252arrow-up-right or Mac OS Romanarrow-up-right, that use the codes to instead provide additional graphic characters.

Main article: Unicode control charactersarrow-up-right

Unicodearrow-up-right defines additional control characters, including bi-directional textarrow-up-right direction override characters (used to explicitly mark right-to-left writing inside left-to-right writing and the other way around) and variation selectorsarrow-up-right to select alternate forms of CJK ideographsarrow-up-right, emojiarrow-up-right and other characters.

Last updated