admin管理员组

文章数量:1593971

外文文献取自:
http://www.joelonsoftware/articles/Unicode.html

外文原文(后接中文版):

Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

每个软件开发人员绝对、肯定地必须了解Unicode和字符集(没有借口!)

Ever wonder about that mysterious Content-Type tag? You know, the one you’re supposed to put in HTML and you never quite know what it should be?

有没有想过那个神秘的内容类型标签?就是你应该在HTML中加入但你不知道它应该是什么?

Did you ever get an email from your friends in Bulgaria with the subject line “??? ??? ??? ???”?

你有没有收到过保加利亚朋友的邮件,主题栏是“???”??? ?? ? ?”

I’ve been dismayed to discover just how many software developers aren’t really completely up to speed on the mysterious world of character sets, encodings, Unicode, all that stuff. A couple of years ago, a beta tester for FogBUGZ was wondering whether it could handle incoming email in Japanese. Japanese? They have email in Japanese? I had no idea. When I looked closely at the commercial ActiveX control we were using to parse MIME email messages, we discovered it was doing exactly the wrong thing with character sets, so we actually had to write heroic code to undo the wrong conversion it had done and redo it correctly. When I looked into another commercial library, it, too, had a completely broken character code implementation. I corresponded with the developer of that package and he sort of thought they “couldn’t do anything about it.” Like many programmers, he just wished it would all blow over somehow.

我沮丧地发现,有多少软件开发人员并没有完全跟上字符集、编码、Unicode等神秘世界的速度。几年前,FogBUGZ的一个测试版测试者想知道它是否能处理日文邮件。日语吗?他们有日文邮件吗?我不知道。当我仔细观察我们用来解析MIME电子邮件消息的商业ActiveX控件时,我们发现它在字符集上做了完全错误的事情,所以我们实际上不得不编写英雄代码来撤销它所做的错误转换并重新正确地进行。当我研究另一个商业库时,它也有一个完全破碎的字符代码实现。我与该程序包的开发者通信,他认为他们“对此无能为力”。和许多程序员一样,他只是希望这一切能以某种方式烟消云散。

But it won’t. When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is enough.

但它不会。当我发现流行的web开发工具PHP几乎完全忽略了字符编码问题,轻率地使用8位字符,这使得开发优秀的国际web应用程序几乎不可能,我想,这就够了。

So I have an announcement to make: if you are a programmer working in 2003 and you don’t know the basics of characters, character sets, encodings, and Unicode, and I catch you, I’m going to punish you by making you peel onions for 6 months in a submarine. I swear I will.

所以我要宣布一件事:如果你是一个在2003年工作的程序员,你不知道基本的字符,字符集,编码和Unicode,而我抓住了你,我要惩罚你,让你在潜艇里剥洋葱6个月。我发誓我会的。

And one more thing:

IT’S NOT THAT HARD.(这并不难)

In this article I’ll fill you in on exactly what every working programmer should know. All that stuff about “plain text = ascii = characters are 8 bits” is not only wrong, it’s hopelessly wrong, and if you’re still programming that way, you’re not much better than a medical doctor who doesn’t believe in germs. Please do not write another line of code until you finish reading this article.

在本文中,我将准确地告诉您每个工作的程序员都应该知道的东西。所有关于“纯文本= ascii =字符是8位”的东西不仅是错误的,而且是无可救药的错误,如果你仍然这样编程,你不比一个不相信细菌的医生好多少。在读完这篇文章之前,请不要再写一行代码。

Before I get started, I should warn you that if you are one of those rare people who knows about internationalization, you are going to find my entire discussion a little bit oversimplified. I’m really just trying to set a minimum bar here so that everyone can understand what’s going on and can write code that has a hope of working with text in any language other than the subset of English that doesn’t include words with accents. And I should warn you that character handling is only a tiny portion of what it takes to create software that works internationally, but I can only write about one thing at a time so today it’s character sets.

在开始之前,我应该提醒您,如果您是了解国际化的少数人之一,您会发现我的整个讨论有点过于简单。我只是想在这里设置一个最小的标准,这样每个人都能理解发生了什么,并且能够编写代码,希望能够处理除不包含重音单词的英语子集以外的任何语言的文本。我需要提醒你的是,角色处理只是创造能够在国际上运行的软件的一小部分,但我一次只能写一件事,所以今天我只写字符集。

A Historical Perspective(历史角度)

The easiest way to understand this stuff is to go chronologically.

最简单的方法是按时间顺序来理解。

You probably think I’m going to talk about very old character sets like EBCDIC here. Well, I won’t. EBCDIC is not relevant to your life. We don’t have to go that far back in time.

您可能认为我将在这里讨论非常古老的字符集,如EBCDIC。好吧,我不会的。EBCDIC与你的生活无关。我们没必要回到那么久远的过去。

ASCII tableBack in the semi-olden days, when Unix was being invented and K&R were writing The C Programming Language, everything was very simple. EBCDIC was on its way out. The only characters that mattered were good old unaccented English letters, and we had a code for them called ASCII which was able to represent every character using a number between 32 and 127. Space was 32, the letter “A” was 65, etc. This could conveniently be stored in 7 bits. Most computers in those days were using 8-bit bytes, so not only could you store every possible ASCII character, but you had a whole bit to spare, which, if you were wicked, you could use for your own devious purposes: the dim bulbs at WordStar actually turned on the high bit to indicate the last letter in a word, condemning WordStar to English text only. Codes below 32 were called unprintable and were used for cussing. Just kidding. They were used for control characters, like 7 which made your computer beep and 12 which caused the current page of paper to go flying out of the printer and a new one to be fed in.
And all was good, assuming you were an English speaker

在Unix刚被发明出来,K&R还在编写C编程语言的时候,一切都非常简单。EBCDIC即将出局。唯一重要的字符是古老的无重音英文字母,我们有一种编码,叫做ASCII,可以用32到127之间的数字来表示每个字符。空格是32,字母A是65,等等。这可以方便地以7位存储。大多数计算机在那些日子里使用8位字节,所以不仅可以存储每一个可能的ASCII字符,但你有一个整体

本文标签: 字符集Unicode