Data Conversion

Data conversion is the conversion of one form of computer data to another - the changing of bits from being in one format to a different one, usually for the purpose of application interoperability or the capability of using new features. At the simplest level, data conversion can be exemplified by conversion of a text file from one character encoding to another. More complex conversions are those of office file formats, and conversions of image and audio file formats.

Information basics

Before any data conversion is carried out, the user or application programmer should keep a few basics of computing and information theory in mind. These include: * Information can easily be discarded using the computer, but adding information takes effort. * The computer can be used to add information only in a rule-based fashion; most additions of information that users want can be done only with human judgement. * Upsampling the data or converting to a more feature-rich format does not add information; it merely makes room for that addition, which usually a human must do. For example, a truecolor image can easily be converted to grayscale or black and white, while the opposite conversion is a painstaking process. Converting a Unix text file to a Microsoft (DOS/Windows) text file involves adding information, namely a CR (hexadecimal 0D) byte before each LF (0A) byte, but that addition is easily done with a computer, since it is rule-based; whereas the addition of color information to a grayscale image cannot be done programmatically, since only a human knows which colors are needed for each section of the picture - there are no rules that can be used to automate that process.

Pivotal conversion

Data conversion can be directly from one format to another, but many applications that convert between multiple formats use a pivotal encoding by way of which any source format is converted to its target.

Office applications, when employed to convert between office file formats, use their internal, default file format as a pivot. For example, a word processor may convert an RTF file to a WordPerfect file by converting the RTF to OpenDocument and then that to WordPerfect format.

Lossy and inexact data conversion

For any conversion to be carried out without loss of information, the target format must support the same features and data constructs present in the source file.Conversion of a word processing document to a plain text file necessarily involves loss of information, because plain text format does not support word processing constructs such as marking a word as boldface.

Data conversion can also suffer from inexactitude, the result of converting between formats that are conceptually different. As an example, converting from PDF to an editable word processor format is a tough chore, because PDF records the textual information like engraving on stone, with each character given a fixed position and line breaks hard-coded, whereas word processor formats accommodate text reflow. PDF does not know of a word space character - the space between two letters and the space between two words differ only in quantity. Therefore, a title with ample letter-spacing for effect will usually end up with spaces in the word processor file, for example INTRODUCTION with spacing of 1 pt as I N T R O D U C T I O N on the word processor.

Open vs. secret specifications

Successful data conversion requires thorough knowledge of the workings of both source and target formats. In the case where the specification of a format is unknown, reverse engineering will be needed to carry out conversion. Reverse engineering can achieve close approximation of the original specifications, but errors and missing features can still result. The binary format of Microsoft Office documents (DOC, XLS, PPT and the rest) is undocumented, and anyone who seeks interoperability with those formats needs to reverse-engineer them. Such efforts have so far been fairly successful, so that most Microsoft Word files open without any ill-effect in the competing OpenOffice.org Writer, but the few that don't, usually very complex ones, utilizing more obscure features of the DOC file format, serve to show the limits of reverse-engineering.

Image Scanning and OCR

Another kind of data conversion is document scanning or image scanning. It is the action or process of converting text and graphic paper documents, photographic film, photographic paper or other files to digital images.

The images can be converted into editable texts by using OCR technology. The accurate recognition of typewritten text is now considered largely a solved problem but recognition of hand printing, cursive handwriting, and even the printed typewritten versions of some other scripts (especially those with a very large number of characters), are still the subject of active research.

category: