Convert.FromBase64String有时返回unicode，或UTF

Convert.FromBase64String有时返回unicode，或UTF-8(Convert.FromBase64String returns unicode sometimes, or UTF-8)

有时字节数组b64是UTF-8，其他时间是UTF-16。我一直在网上看到C＃字符串总是UTF-16，但这不是我的情况。为什么会发生这种情况，我该如何解决？我有一个简单的方法将base64字符串转换为普通字符串：

public static string FromBase64(this string input) { String corrected = new string(input.ToCharArray()); byte[] b64 = Convert.FromBase64String(corrected); if (b64[1] == 0) { return System.Text.Encoding.Unicode.GetString(b64); } else { return System.Text.Encoding.UTF8.GetString(b64); } }

我的base 64编码器也发生了同样的事情：

public static string ToBase64(this string input) { String b64 = Convert.ToBase64String(input.GetBytes()); return b64; } public static byte[] GetBytes(this string str) { byte[] bytes = new byte[str.Length * sizeof(char)]; System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length); return bytes; }

示例：在我的计算机上，“cABhAHMAcwB3AG8AcgBkADEA”解码为：

'p','\0','a','\0','s','\0','s','\0','w','\0','o','\0','r','\0','d','\0','1','\0'

但在我的同事计算机上它是：

'p','a','s','s','w','o','r','d','1'

编辑：

我知道我创建的字符串来自一个文本框，我保存到的文件总是为UTF-8，所以一切都指向导致我的编码开关的Convert方法。

更新：

在进一步深入挖掘之后，我的同事看起来在他的代码版本中有一条非常重要的一行，该代码将从文件中读取的值保存到哈希表中。我使用的默认值是UTF-8 base64值，所以我要将默认值更正为utf-16值，然后我可以清除删除任何UTF8引用的代码。

此外，我一直天真地使用我从网站上检索到的UTF-8 base64编码，没有意识到我自己进入了什么。有趣的是，如果我的同事没有评论保存文件中值的行，我将永远不会发现这个事实。

代码的最终版本：

public static string FromBase64(this string input) { byte[] b64 = Convert.FromBase64String(input); return System.Text.Encoding.Unicode.GetString(b64); } public static string ToBase64(this string input) { String b64 = Convert.ToBase64String(input.GetBytes()); return b64; } public static byte[] GetBytes(this string str) { return System.Text.Encoding.Unicode.GetBytes(str); }

Sometimes the byte array b64 is UTF-8, and other times is UTF-16. I keep reading online that C# strings are always UTF-16, but that is not the case for me here. Why is this happening, and how do I fix it? I have a simple method for converting a base64 string to a normal string:

public static string FromBase64(this string input) { String corrected = new string(input.ToCharArray()); byte[] b64 = Convert.FromBase64String(corrected); if (b64[1] == 0) { return System.Text.Encoding.Unicode.GetString(b64); } else { return System.Text.Encoding.UTF8.GetString(b64); } }

The same thing is happening to my base 64 encoder:

public static string ToBase64(this string input) { String b64 = Convert.ToBase64String(input.GetBytes()); return b64; } public static byte[] GetBytes(this string str) { byte[] bytes = new byte[str.Length * sizeof(char)]; System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length); return bytes; }

Example: On my computer, "cABhAHMAcwB3AG8AcgBkADEA" decodes to:

'p','\0','a','\0','s','\0','s','\0','w','\0','o','\0','r','\0','d','\0','1','\0'

But on my coworkers computer it is:

'p','a','s','s','w','o','r','d','1'

Edit:

I know that the string I create comes from a textbox, and that the file where I am saving it to is always going to be UTF-8, so everything is pointing to the Convert method causing my encoding switch.

Update:

After digging in further, it appears that my coworker had a very important line commented in his version of the code, the one that saves the value read from file to the hashtable. The default value I was using is a UTF-8 base64 value, so I am going to correct the default, to a utf-16 value, then I can clean up the code removing any UTF8 references.

Also, I had been naively using the UTF-8 base64 encoding I had retrieved from a website, not realizing what I was getting myself into. The funny part is I would never have found that fact if my coworker hadn't commented the line that saves the values from the file.

Final version of the code:

public static string FromBase64(this string input) { byte[] b64 = Convert.FromBase64String(input); return System.Text.Encoding.Unicode.GetString(b64); } public static string ToBase64(this string input) { String b64 = Convert.ToBase64String(input.GetBytes()); return b64; } public static byte[] GetBytes(this string str) { return System.Text.Encoding.Unicode.GetBytes(str); }

最满意答案

首先，我想揭穿问题的标题：

Convert.FromBase64String()有时返回Unicode，或UTF-8

事实并非如此。然后给出相同的输入，有效的base64编码文本， Convert.FromBase64String()总是返回相同的输出。

继续前进，只能通过检查有效负载（用于字符串的编码）来确定无法确定。你试图这样做

if (b64[1] == 0) // encoding must be UTF-16

不是这种情况。绝大多数UTF-16字符元素都无法通过该测试。如果你试图写这个测试并不重要，它注定要失败。这是因为存在字节数组，当解释为不同的编码时，它们是定义良好的字符串。换句话说，例如，可以构造在被视为UTF-8或UTF-16时有效的字节数组。

因此，您必须先了解有效载荷是否编码为UTF-16，UTF-8或其他编码。

解决方案是在base64编码之前跟踪原始编码。将该信息与base64编码的有效负载一起传递。然后在解码时，您可以确定要使用哪个Encoding来解码回字符串。

它非常适合我，你的字符串都来自UTF-16 .net字符串。在这种情况下，您将永远不会有UTF-8字符串，并且应始终使用UTF-16解码。那就是你使用Encoding.Unicode.GetString() 。

此外，代码中的GetBytes方法很差。它应该是：

public static byte[] GetBytes(this string str) { return Encoding.Unicode.GetBytes(str); }

另一个奇怪之处：

String corrected = new string(input.ToCharArray());

这是一个无操作。

最后，当编码为UTF-8时，您的文本很可能会更紧凑。因此，在应用base64编码之前，您应该考虑这样做。

关于您的更新，您声明的内容不正确。这段代码：

string str = Encoding.Unicode.GetString( Convert.FromBase64String("cABhAHMAcwB3AG8AcgBkADEA"));

将password1分配给str无论它在哪里运行。

First of all, I want to debunk the title of the question:

Convert.FromBase64String() returns Unicode sometimes, or UTF-8

That is not the case. Give then same input, valid base64 encoded text, Convert.FromBase64String() always returns the same output.

Moving on, you cannot determine definitively, just by examining the payload, the encoding used for a string. You attempt to do this with

if (b64[1] == 0) // encoding must be UTF-16

This is not the case. The overwhelming majority of UTF-16 character elements fail that test. It does not matter how you try to write this test it is doomed to fail. And that is because there exist byte arrays that are well-defined strings when interpreted as different encodings. In other words it is possible, for instance, to construct byte arrays that are valid when considered as either UTF-8 or UTF-16.

So, you have to know a priori whether the payload is encoded as UTF-16, UTF-8 or indeed some other encoding.

The solution will be to keep track of the original encoding, before the base64 encoding. Pass that information along with the base64 encoded payload. Then when you decode, you can determine which Encoding to use to decode back to a string.

It looks to me very much that your strings are all coming from UTF-16 .net strings. In which case you won't have UTF-8 strings ever, and should always decode with UTF-16. That is you use Encoding.Unicode.GetString().

Also, the GetBytes method in your code is poor. It should be:

public static byte[] GetBytes(this string str) { return Encoding.Unicode.GetBytes(str); }

Another oddity:

String corrected = new string(input.ToCharArray());

This is a no-op.

Finally, it is quite likely that your text will be more compact when encoded as UTF-8. So perhaps you should consider doing that before applying the base64 encoding.

Regarding your update, what you state is incorrect. This code:

string str = Encoding.Unicode.GetString( Convert.FromBase64String("cABhAHMAcwB3AG8AcgBkADEA"));

assigns password1 to str wherever it is run.

更多推荐