mIRC's lack of UTF-8 support has been an issue for quite some time. The author promised to 'look at it', but in the meantime, chatting in UTF-8 is not possible. This is problematic for any language that uses more than the occasional accented letter.
So I decided to make a temporary fix myself. The result is a flexible conversion mechanism between UTF-8 and the ANSI codepages. The user sees and types regular ANSI characters, but all data which is sent to and received from the IRC server is UTF-8 encoded. You are still limited to one ANSI codepage though: making mIRC support real Unicode is not possible without an mIRC rewrite.
The script performs a real UTF-8 encoding/decoding, so unlike a simple 'find and replace' approach, characters which do not fit into the current codepage are indicated as such.
I included conversion tables for all of the Windows ANSI codepages:
- 1250 (ANSI - Central Europe)
- 1251 (ANSI - Cyrillic)
- 1252 (ANSI - Western Europe / Latin I)
- 1253 (ANSI - Greek)
- 1254 (ANSI - Turkish)
- 1255 (ANSI - Hebrew)
- 1256 (ANSI - Arabic)
- 1257 (ANSI - Baltic)
- 1258 (ANSI/OEM - Viet Nam)
There is also a little utility (with source) for generating conversion tables for more codepages.
For instructions on how to use it, check the top of the utf-8.mrc file. You can download the script here (19 KB).
Important: This script is provided as-is without any guarantees. Use it if you like it, but don't bug me if you can't get it to work. If you find bugs, feel free to report them, but try to give a little more information than just 'it doesn't work'.

UTF-8 petition
You might be interested in my UTF-8 petition: http://petitiononline.com/utf8/ - Not that I think Khaled reads it, but I might post it on the mIRC boards again soon despite the reactions it's getting.
And check this page from time
And check this page from time to time, too, if you're interested in trying out klients that already support UTF-8:
http://freedesktop.org/Software/IrcClients
Unfortunately
If I could use another client, I would, but I have some very custom scripts in mIRC which I cannot port easily.
Hey Steven, for some reason,
Hey Steven,
for some reason, mIRC with your script loaded colors 'old' German umlauts (non-unicode) red. Stripping color codes by the menu option doesn't help. Any ideas?
flashing
hi, if you set channel windows to "show at desktop" and enable flashing while using the utf-8 - script, the flashing of the taskbar-buttons for the channels dont work!
Intentional
Yes, that's to make you aware the person is not using UTF-8. The same happens to characters outside your ansi codepage: a red questionmark.
Display issues
Nice script, but unfortunately the output to mIRC windows is missing some information: Mode prefixes (@, +) and nick colors ($cnick(nick).color) are ignored when the appropriate settings in mIRC are used.
fok
I don't use those
..so if you want them in, make a patch and send it to me.
Script compatiblity
Have you tested it for compatiblty with such poplay scripts as Sysreset?
Nope
Given the crappy scripting language that mIRC has, I doubt it will work. Overriding user supplied input is a one-shot thing. If more than one script tries it, I believe mIRC freaks out, or sends data twice.
This has been like this for ages. In fact, I was involved in a project to try and build a cleaner layer on top of mIRC several years ago, but it sort of died out. Did work though.
In any case, I don't use any "popular scripts", so I'm not going to test them either.
Transparent operation
I don't have mIRC (or Windows, for that matter), so I can't test it. However, I'd like to know if it is possible to configure the script so that all sent text is Latin-1, but that received text character set is autodetected (trying UTF-8 and if that doesn't work, then assuming Latin-1). And they should be the same color, not red (user doesn't need to know).
Maybe allow per-channel configuration of the character coding sent, and of the 8 bit character set to assume if received data is not UTF-8.
No
No, as this would encourage Latin-1 usage rather than UTF-8.
Wrong philosophy
Without freedom to select what to talk we can never switch to UTF-8 as most people won't install this script at all. Latin-1 works pretty damn well together with UTF-8 autodetection. Irssi has had this autodetection for months (and longer as separately installable scripts) and it works fine & actually allows some people (me included) among Irssi users to talk UTF-8 (otherwise nobody would have UTF-8 support at all and those people talking it would instantly get kicked off the channel).
If you want to push UTF-8, do it in the easy way (for the users), with compatibility.
In fact, I'd go even further. I say that Latin-1 should be the default (for now). This makes transparent transition even easier. People considering whether to install the script or not will consider whether it makes their text to look incorrect for some and will refuse to install if there is too much work involved.
Don't worry about people not actually talking UTF-8... They will, once enough people have this transparent support installed. The superiority of the character encoding is enough reason to get people "converted" at that phase. This is the very reason why we even want to push UTF-8, right?
Latin-1
Encouraging Latin-1 means all the other legacy encodings are at a disadvantage. Latin-1 is not special in any way. In fact, when many people think they are using Latin-1, they're in fact using Windows-1252, the Microsoft-specific version.
In an environment like XML or HTTP, where encodings can be specified, I don't care what people use. But on IRC, where there is no encoding identification mechanism at all, UTF-8 is the only way. There is no such thing as a smooth transition when it comes to encodings: you either use one method, or the other.
Who cares about 8 bit ...
You do realize that the majority of Finnish people will NEVER switch to UTF-8 with that approach? Most of them are using some Windows charset (cp1252, I think) and the rest are using Latin-1 or Latin-9, all of which is somewhat compatible with each-other (the umlauts we use have the same codes). I've been talking UTF-8 for few years on all channels I can and on all /msgs etc. Still, the only channels where I can do that without getting kicked or scaring away all the other users are #irssi.fi and English channels (where non-ASCII is only rarely used) - because Irssi happens to be the only client that can do autodetection (AFAIK).
Latin-1 vs. other 8 bit ones
Latin-1 was just an example. Users can configure which 8 bit charset to use, just like they currently do. Maybe Latin-1/9 and CP125x have some advantage, being the program defaults for the majority, but that's the case even today. Of course the goal is to get rid of all of them and I personally couldn't care less about the different 8 bit character sets.
Autodetection
There is no way to signal character set in the protocol, and it would not be wise to include it in the content either (like having a charset identifier in the beginning of every PRIVMSG). However, the difference between UTF-8 and 8 bit charsets is easy to detect and this method produces very few false positives.
As you most certainly know, every UTF-8 sequence (character) begins with an octet (byte) that contains bits 0xxxxxxx or 11xxxxxx. If the beginning octet contains other bit combination, the string is not UTF-8. If the beginning octet was the former, the next octet will also be a beginning octet. If it was the latter, the next one will contain bits 10xxxxxx and the ones after that either are 10xxxxxx too or the sequence ends.
Any deviation from these rules at any part of the string means that the string is invalid UTF-8 and thus it uses some 8 bit charset. Of course there is a small probability that a 8 bit string gets detected as UTF-8. While this is quite unlikely and never happens when writing western languages, it is still possible. But then, 8 bit charsets have all those compatibility problems of their own anyway, so that's not a big deal. 7 bit content will always get detected as UTF-8, so it is interpreted as ASCII, which is not a problem as nearly all 8 bit charsets are ASCII-based anyway.
Typical UTF-8 discussion
-(something containing non-ASCII)
-Hey, your machine is broken, please fix it
-No, this is UTF-8 and it's technically superior charset that will resolve all character set incompatibilities forever. It supports practically every character you may ever need.
-But it displays as awful garbage! Fix your client. Everybody here uses (insert charset).
-Not if you install the latest version of Irssi, which makes them work perfectly.
Now we have few options:
a) the user has Irssi
-Yeah, as if I'd use unstable software or bother to install anything. I'm happy with my Irssi (insert ancient version) without any plugins.
-It's so little work that you might just as well install it and have the characters of everybody always displaying correctly, no matter if they use UTF-8 or (insert the 8 bit charset of the channel).
-But I don't want my text to display incorrectly for others.
-That's the best part. The Irssi default setting is to keep outputting the same charset you currently do. It will not talk UTF-8 by default. Your text will still be (insert 8 bit charset) afterwards, but the only difference is that you'll be able to see the text of others properly, with no extra work.
(at this point the user either agrees to install it and everybody is happy, and might later even get another UTF-8 zealot once the user begins to like UTF-8, or we continue the discussion)
-Nah, it's too much work. Besides, I'm using a shell machine and thus I can't upgrade the Irssi.
b) the user has xchat (or ChatZilla)
-I'm using xchat and I'm not going to switch.
-But you can use UTF-8 in xchat as well.
-I don't want my umlauts to be broken.
-Well, it's sad that xchat doesn't have autodetection, but this is the future. Sooner or later, UTF-8 will replace this and I'm only trying to make it the sooner.
-Bullshit. I'll keep using (insert charset) forever. It works for everybody, so why bother. It will only create compatibility problems. I am not going to be talking in cyrillics, you know...
-There are incompatibilities even with characters that are used often, such as the euro sign. Of course it also doesn't have proper mathematical notation or anything.
-Who needs those anyway?
-Or how about people and channels where languages need to be mixed?
-Well, let them use whatever. Not my problem.
c) the user has mIRC
-I'm on Windows
-You could use xchat
-Why would I? I like my mIRC.
-Well, there is a plugin for mIRC too. It will let you use UTF-8. See (insert the URL of this website).
-Why should I bother installing that just because your text is broken? The text of everybody else displays correctly, so it must be your client that is broken.
-(insert the usual stuff about how UTF-8 is superior and how there eventually will be a transition anyway)
-With this script I would see the text of everybody else marked in red and everybody else would see my text incorrectly? You must be out of your mind.
From there it advances by the other user calling me a rebellion, arguing with me with no further result and eventually ignoring me, me getting kickbanned, etc.
Notice how nobody using those inflexible clients got any UTF-8 support enabled, and thus they still won't tolerate UTF-8. In fact, they'll hate it even more every time they see someone talking it.
The Irssi user on the other hand got UTF-8 working transparently. He can keep ircing as usual, but from time to time he'll notice how other people on the channel tell some newcomer that his umlauts are broken, at which point the Irssi users publicly points out that they are not broken but just UTF-8 and that they display correctly with a recent version of Irssi. This might lead to some other people upgrading their Irssis.
The Irssi user gets happier and happier about his UTF-8 support every time when that happens and finally he might become an active UTF-8 zealot (i.e. one talking UTF-8 even if the majority doesn't) too ;)
Conclusion: by forcing people to choose you never get UTF-8 supported widely. By slipping them a piece of software that doesn't get in their way, you manage to make them UTF-8 supporters. Once there are enough supporters, the majority will also be talking UTF-8 (as there are clear advantages, having a larger selection of characters). At some point programs can be modified so that they start emitting warnings (such as text painted red) about someone using non-UTF-8, etc. Also, new versions of programs may start emitting UTF-8 by default. This will eventually get the rest converted.
I believe this applies to most other western people (all those who are somewhat happy with simple 8 bit charsets) too, and not just us Finns.
Why
Why do I need to know that the person is not using UTF-8?
How can I turn the red color off?
Edit the script
Because whatever non-ASCII characters you type, won't be read by that person.
You'll have to edit the script if you don't want that.
I already turned
I already turned utf-encoding off since I wanted to type latin characters, but I couldn't find where the red color comes from.
That's your problem.
Which part of "provided as is" don't you understand?
Thanks Steven
Script works a treat. That shut my Mac friends up *rolls eyes*
A conservative version of the script
I modified this script to _not_ recode outgoing stuff and to _not_ color invalid incoming utf-8, so as to better get the more conservative mirc users off my back. The modified script is available at http://mjr.iki.fi/software/mirc-utf8-conservative.zip
channel mode prefixes
for all of you who want to see the channel mode prefixes in front of the nick (e.g. <%Certus> instead of ), add the following to your remote section:
alias -l lookupnickprefix {var %i = 1
while (%i <= $nick($1,0)) {
if ($nick($1, %i) == $2) {
return $nick($1, %i).pnick
}
inc %i
}
}
then search for this line:
on ^*:text:*:#:{ echo $color(normal text) -lmt $chan < $+ $nick $+ > $utf8decode($1-) | haltdef }replace $nick with $lookupnickprefix($chan,$nick)
voila, it should work.
greetings,
Certus
Flashing feature
Yes, I noticed it too and I miss the feature very much... Hopefully somekind of a patch would appear, otherwise going to unload UTF-8 -script for mIrc for a little while.
I wonder
I wonder if there are any plans to update this further (like the autodetection suggested, yet rejected before) or if there are other solution to UTF8 on mIRC? (considering khaled isnt going to do this, his last post on the weblog and mirc's forum is 07/2004)
mIRC 6.17 has UTF-8 support
Subject says pretty much everyting – 6.17 is out and it has support. Not sure how perfect (I'm linux user), but it has.
Obviously there is no need
Obviously there is no need for a script to decode or encode now, but i miss the red letter. Partly because I can no longer see who has unicode or not, but also because it looked a bit funny.
Could someone make a modified script (or a new one from scratch) that doesn't do anything but makeing the non-UTF-letters red?
I'd understand if no-one bothers of course, but it'd be nice.
Now all this is useless, as
Now all this is useless, as the newest version renders utf-8 perfectly now ... just wanted to tell that again, coz the linux man who allready posted that, didn't test it, but i did. so bye bye utf-8 script ... we had a short but funny time =D
my new version of mirc does
my new version of mirc does not render it properly
Post new comment