Haskell の 文字コード

ODBC 接続した SQL Server から取得した文字列が化けるので、文字コードについて調べてみます。
kcode という各種文字コードを表示するツールがありますので文字コードを表示してみます。

$ echo ''|kcode
euc-jp        : あ                 ()
======================================================================
cp932         : 82A0               "\x82\xa0"
euc-jp        : A4A2               "\xa4\xa2"
iso-2022-jp   : 1B244224221B2842   "\x1b\x24\x42\x24\x22\x1b\x28\x42"
ucs-2be       : 3042               "\x30\x42"
utf8          : E38182             "\xe3\x81\x82"
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=3042
 
                                                          using Encode
$ echo ''|nkf --jis|hexdump -C
00000000  1b 24 42 24 22 1b 28 42  0a                       |.$B$".(B.|
00000009
$ echo 'あ'|nkf --sjis|hexdump -C
00000000  82 a0 0a                                          |...|
00000003
$ echo 'あ'|nkf --utf8|hexdump -C
00000000  e3 81 82 0a                                       |....|
00000004
$ echo 'あ'|nkf --utf16|hexdump -C
00000000  30 42 00 0a                                       |0B..|
00000004
$ echo 'あ'|nkf --euc|hexdump -C
00000000  a4 a2 0a                                          |...|
00000003
$ echo ''|kcode
euc-jp        : い                 ()
======================================================================
cp932         : 82A2               "\x82\xa2"
euc-jp        : A4A4               "\xa4\xa4"
iso-2022-jp   : 1B244224241B2842   "\x1b\x24\x42\x24\x24\x1b\x28\x42"
ucs-2be       : 3044               "\x30\x44"
utf8          : E38184             "\xe3\x81\x84"
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=3044

                                                          using Encode
$ echo '0'|kcode
euc-jp        : 0                  (0)
======================================================================
cp932         : 30                 "\x30"
euc-jp        : 30                 "\x30"
iso-2022-jp   : 30                 "\x30"
ucs-2be       : 0030               "\x00\x30"
utf8          : 30                 "\x30"
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=0030

                                                          using Encode
-- utf8 環境
> :m + Numeric
> :m + Data.Char

> readHex "3042"                                  -- => [(12354,"")]
> (fst.head) (readHex "3042")                     -- => 12354
> putStrLn  [chr $ (fst.head) (readHex "3042")]   -- => あ

> let hexChr str = chr $ (fst.head) (readHex str)
> putStrLn $ map hexChr ["3042", "3044", "30"]    -- => あい0
> map ((\x->showHex x "").ord) "あい0"            -- => ["3042","3044","30"]
ghci> "あい0"                                     -- => "\12354\12356\&0"

> :m + Codec.Binary.UTF8.String
> encode "あい0" -- => [227,129,130,227,129,132,48]
> map (\x->showHex x "") $ encode "あい0" -- => ["e3","81","82","e3","81","84","30"]

【20140905 追記】
UTF-8 で記述された文字列はコンパイルされた時点で UCS-4 に変換されます。
ソースはUTF-8で記述されていますが、その文字コードを出力するとUCS-4 のものが出力されます。

$ cat utf8.hs | nkf -g
UTF-8
module Main where

import Numeric
import Data.Char
import Data.Word
import Codec.Binary.UTF8.String

str :: String
str = "あいうえお0123"

utf8string :: String
utf8string = encodeString str

utf8word8 :: [Word8]
utf8word8 = encode str

main :: IO ()
main = do
	print    (map  ((\x->showHex x "").ord) str)
    -- http://www.unicode.org/charts/nameslist/c_3040.html
    --     あ    い     う      え    お       0     1    2    3
    -- > ["3042","3044","3046","3048","304a","30","31","32","33"]
	print    (map  ((\x->showHex x "").ord) utf8string)
    -- http://ash.jp/code/unitbl21.htm
    --     あ            い             う              え            お             0     1    2    3
    -- > ["e3","81","82","e3","81","84","e3","81","86","e3","81","88","e3","81","8a","30","31","32","33"]
	print    (map  (\x->showHex x "") utf8word8)
    -- > ["e3","81","82","e3","81","84","e3","81","86","e3","81","88","e3","81","8a","30","31","32","33"]

コンパイル時にUTF-8からUCS-4に変換されますので、UTF-8にない文字が出現するとコンパイルエラーになります。

$ cat utf8.hs|nkf -s > sjis.hs

$ ghc -Wall sjis.hs
[1 of 1] Compiling Main             ( sjis.hs, sjis.o )

sjis.hs:9:8:
    lexical error in string/character literal (UTF-8 decoding error)