Character set and encoding converter

0. Contents

This is the documentation of charconv-1.1.2.7.
   1. Purpose
   2. PHP and character sets
   3. Supported character sets and encodings
   4. Copying
   5. Examples
   6. Requirements
   7. See also
   8. Downloading

1. Purpose

Converts text stream in one character set and encoding to another.

2. PHP and character sets

Did you came here looking for character set conversions using PHP?
I'm planning to write a sort of howto of that some day. Meanwhile, you can try my japcharset.php, which is an extempore character set converter to be used in php scripts handling multinational texts. It depends on the htmlrecode program, which you can get here too. You can also try the recode extension of php, although it has never worked for me.

3. Supported character sets and encodings

Usage: charconv [-h] <incharset> <outcharset>

Reads stdin, outputs stdout. Does incharset->outcharset conversion via unicode.

-h = Input is html (THIS BUGS)
Available character sets/encodings:
- unihtml (&#number; codes)
- utf8linux (with vt100 escape codes)
- utf7mod (imap modified)
- koi8r          - jis-x-0201      - shift_jis       - big5
- iso-8859-1      - iso-8859-2      - iso-8859-3      - iso-8859-4
- iso-8859-5      - iso-8859-6      - iso-8859-7      - iso-8859-8
- iso-8859-9      - iso-8859-10     - iso-8859-13     - iso-8859-14
- iso-8859-15     - cp437           - cp737           - cp775
- cp850           - cp852           - cp855           - cp857
- cp860           - cp861           - cp862           - cp863
- cp864           - cp865           - cp866           - cp869
- cp874           - cp1250          - cp1252          - cp1254
- cp1256          - cp1258          - cp1251          - cp1253
- cp1255          - cp1257          - cp856           - cp1006
- cp424           - roman           - romanian        - iso-2022-jp
- utf8            - utf7            - euc-jp    

Typoes are allowed to some degree in the character set names,
and some general aliases like latin* and iso* are known.

4. Copying

charconv has been written by Joel Yliluoma, a.k.a. Bisqwit,
and is distributed under the terms of the General Public License (GPL).

If you want to make your own converter or just study how something works, you might still want to download this program. The package contains plain TXT files describing the character sets, and there are .cc files for each different encoding.

5. Examples

oktober:~/src/charconv$ echo 'Äiti tykkää oliiviöljystä'|charconv latin1 utf7
+AMQ-iti tykk+AOQA5A oliivi+APY-ljyst+AOQ

oktober:~/src/charconv$ echo '+AMQ-iti tykk+AOQA5A oliivi+APY-ljyst+AOQ'|charconv utf7 unihtml
&Auml;iti tykk&auml;&auml; oliivi&ouml;ljyst&auml;

oktober:~/src/charconv$ echo 'pikachu' | sed -f /WWW/src/kr2k.sed | charconv sjis utf8
恓恋恔悅

oktober:~/src/charconv$ echo -e '\33$B$P$+\33(B' | charconv iso-2022-jp unihmtl
Charconv: Warning: Assuming 'unihmtl' means 'unihtml'
&#12400;&#12363;

oktober:~/src/charconv$ echo 'Ōčķė’ķäč’' | charconv cp1251 koi8r
Charconv: Warning: Assuming 'koi8r' means 'koi8-r'
ęÉĪĢŃĪÄÉŃ

6. Requirements

charconv has been written in C++, utilizing the standard template library.
The hashes the program uses have been heavily optimized for both size and speed, with size being the top priority. The compilation takes lots of memory and time therefore.
GNU make is required.
I have g++ version 3.0.1, and charconv compiles without warnings (except some signed/unsigned mismatches).
Some parts of makefiles have been generated with a php script (included in the archive). If you want to regenerate them, you need PHP 4 too.

7. See also

GNU Recode: This recoding library converts files between various coded character sets and surface encodings. When this cannot be achieved exactly, it may get rid of the offending characters or fall back on approximations. The library recognises or produces more than 300 different character sets and is able to convert files between almost any pair. Most RFC 1345 character sets, and all `libiconv' character sets, are supported. The `recode' program is a handy front-end to the library.
I have made an online version of it available for use for converting short amounts of data between encodings.

If you are converting HTML pages, use htmlrecode instead. It handles them (and changes the character set) losslessly.

8. Downloading

The official home page of charconv is at http://oktober.stc.cx/source/charconv.html.
Check there for new versions.

Generated from progdesc.php (last updated: Mon, 2 Sep 2002 04:24:50 +0300)
with docmaker.php (last updated: Tue, 13 Aug 2002 14:17:29 +0300)
at Mon, 2 Sep 2002 04:24:56 +0300