为什么java.net.URLEncoder对相同的字符串给出不同的结果？

编程入门行业动态更新时间:2024-10-25 00:26:27

本文介绍了为什么java.URLEncoder对相同的字符串给出不同的结果？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

在网络应用服务器上，当我尝试使用 java.URLEncoder 编码 médicaux_Jérôme.txt时，以下字符串：

me％CC％81dicaux_Je％CC％81ro％CC％82me.txt

当我在后端服务器上尝试编码相同的字符串时，它给出以下内容：

m％C3％A9dicaux_J％C3％A9r％C3％B4me.txt

有人可以帮助我理解同一输入的不同输出吗？另外，每次解码相同的字符串时，如何获得标准化的输出？

解决方案

结果取决于平台，如果您不这样做的话

请参见 java.URLEncoder javadocs ：

encode（String s）

已弃用。

根据平台的默认编码，结果字符串可能会有所不同。而是使用 encode（String，String）方法指定编码。

因此，使用建议的方法并指定编码：

String urlEncodedString = URLEncoder.encode（stringToBeUrlEncoded， UTF-8 ;）

关于同一字符串的不同表示形式（如果指定了 UTF-8 ：

您在问题中输入的两个URL编码字符串虽然编码不同，但它们表示相同的未编码值，因此那里并没有什么天生的错误。通过将两个都写在解码工具中，我们可以验证它们是否相同。

这是因为我们在这种情况下看到的事实是，有多种方法可以对同一字符串进行URL编码，特别是当它们带有重音符号时（由于合并重音符号，这正是您所遇到的情况。

具体来说，第一个字符串将é编码为 e + ´（拉丁文小写字母e

同样，这两种表示形式都没有问题。两者都是 Unicode标准化的形式。众所周知，Mac OS X倾向于使用组合的重音符号进行编码。最后，这是编码器的偏好问题。在您的情况下，必须有不同的JRE，或者，如果该文件名是用户生成的，则用户可能使用了生成该编码的其他OS（或工具）。

On the webapp server when I try encoding "médicaux_Jérôme.txt" using java.URLEncoder it gives following string:

me%CC%81dicaux_Je%CC%81ro%CC%82me.txt

While on my backend server when I try encoding the same string it gives following:

m%C3%A9dicaux_J%C3%A9r%C3%B4me.txt

Can someone help me understanding the different output for the same input? Also how can I get standardized output each time I decode the same string?

解决方案

The outcome depends on the platform, if you don't specify it.

See the java.URLEncoder javadocs:

encode(String s)

Deprecated.

The resulting string may vary depending on the platform's default encoding. Instead, use the encode(String,String) method to specify the encoding.

So, use the suggested method and specify the encoding:

String urlEncodedString = URLEncoder.encode(stringToBeUrlEncoded, "UTF-8")

About different representations for the same string, if you specified "UTF-8":

The two URL encoded strings you gave in the question, although differently encoded, represent the same unencoded value, so there is nothing inherently wrong there. By writing both in a decode tool, we can verify that they are the same.

This is due, as we are seeing in this case, to the fact that there are multiple ways to URL encode the same string, specially if they have acute accents (due to the combining acute accent, precisely what happens in your case).

To your case, specifically, the first string encoded é as e + ´ (latin small letter e + combining acute accent) resulting in e%CC%81. The second encoded é directly to %C3%A9 (latin small letter e with acute - two % because in UTF-8 it takes two bytes).

Again, there is no problem with either representation. Both are forms of Unicode Normalization. It is known that Mac OS Xs tend to encode using the combining acute accent; in the end, it is a matter of preference of the encoder. In your case, there must be different JREs or, if that file name was user generated, then the user may have used a different OS (or tool) that generated that encoding.