如何在不必编写多个str替换语句的情况下清理一堆URL？(How to clean up bunch of urls without having to write multiple str repl

如何在不必编写多个str替换语句的情况下清理一堆URL？(How to clean up bunch of urls without having to write multiple str replace statements?)

以下是几个需要清理的示例网址：

http://example.com//path/应成为： http ： //example.com/path/ http://example.com/path/?&应该成为： http ： //example.com/path/ http://example.com/path/?&param=one应成为： http://example.com/path/?param=one http://example.com///?&应该成为： http ： //example.com/ http://example.com/path/subpath///?param=one&应该成为： http ： //example.com/path/subpath/？param = one

无需多次编写str_replace() ，有没有办法清理网址？

Following are few example urls which need to be cleaned up:

http://example.com//path/ Should become: http://example.com/path/ http://example.com/path/?& Should become: http://example.com/path/ http://example.com/path/?&param=one Should become: http://example.com/path/?param=one http://example.com///?& Should become: http://example.com/ http://example.com/path/subpath///?param=one& Should become: http://example.com/path/subpath/?param=one

Without having to write str_replace() multiple times, is there a way to clean up the url?

最满意答案

Sahil的方法令人费解，有6个替换元素，模式中不必要的字符转义，以及有害复制字符的有限量词。事实上，这个简单的网址无法纠正： http://example.com//path1 ： http://example.com//path1

在项目中实现这个更短，更快，更清晰，更易读的方法：

代码（演示）：

$urls=array( "http://example.com//path/", "http://example.com/path/?&", "http://example.com/path/?&param=one", "http://example.com///?&", "http://example.com/path/subpath///?param=one&"); $urls=preg_replace( ['/(?<!:)\/{2,}/','/\?&/','/[?&]$/'],['/','?',''],$urls); var_export($urls);

输出：

array ( 0 => 'http://example.com/path/', 1 => 'http://example.com/path/', 2 => 'http://example.com/path/?param=one', 3 => 'http://example.com/', 4 => 'http://example.com/path/subpath/?param=one', )

图案说明：

/(?<!:)\/{2,}/匹配2个或更多斜杠，前面没有冒号; 用单斜线替换。

/\?&/匹配一个问号后跟一个＆符号; 用问号代替。

/[?&]$/匹配最后一个字符，如果是问号或＆符号; 去掉。

另外，这是我对url解析方法的看法:( 演示）

码：

$urls=array( "http://example.com//path//to///dir////4/ok", "http://example.com/path/?&&", "http://example.com/path/?&param=one", "http://www.example.com///?&", "http://example.com/path/subpath///?param=one&"); foreach($urls as $url){ $a=parse_url($url); $clean_urls[]="{$a["scheme"]}://{$a["host"]}". // no problems expected from these elements preg_replace('~/+~','/',$a["path"]). // reduce multiple consecutive slashes to single slash (isset($a["query"]) && trim($a["query"],'&')!=''?'?'.trim($a["query"],'&'):''); // handle querystring } var_export($clean_urls);

输出：

array ( 0 => 'http://example.com/path/to/dir/4/ok', 1 => 'http://example.com/path/', 2 => 'http://example.com/path/?param=one', 3 => 'http://www.example.com/', 4 => 'http://example.com/path/subpath/?param=one', )

url组件处理的说明：

path元素上的preg_replace()模式将匹配1个或多个斜杠，并用单个斜杠替换它们。这也可以使用~/+(?=/)~或~(?<=/)/+~和一个空替换字符串来实现，但外观比无外观模式慢至少2.5倍。

query处理行有一个内联条件，首先检查query元素是否存在，然后......

如果是这样，它将从两端修剪无限的＆符号并检查修剪后的值是否为空。任何符合条件的字符串都将被修剪为＆符号，并带有问号。

如果没有，则将一个空字符串附加到要推送到$clean_urls的字符串中。

Sahil's method is horribly convoluted with 6 replacement elements, unnecessary character escaping in the pattern, and finite quantifiers on unwanted duplicated characters. In fact, this simple url fails to be corrected: http://example.com//path1

Implement this much shorter, faster, cleaner, more readable method in your project instead:

Code (Demo):

$urls=array( "http://example.com//path/", "http://example.com/path/?&", "http://example.com/path/?&param=one", "http://example.com///?&", "http://example.com/path/subpath///?param=one&"); $urls=preg_replace( ['/(?<!:)\/{2,}/','/\?&/','/[?&]$/'],['/','?',''],$urls); var_export($urls);

Output:

array ( 0 => 'http://example.com/path/', 1 => 'http://example.com/path/', 2 => 'http://example.com/path/?param=one', 3 => 'http://example.com/', 4 => 'http://example.com/path/subpath/?param=one', )

Pattern explanations:

/(?<!:)\/{2,}/ Match 2 or more slashes not preceded by a colon; replace with single slash.

/\?&/ Match a question mark followed by an ampersand; replace with question mark.

/[?&]$/ Match the last character if a question mark or ampersand; remove.

Also, here is my take on the url parsing approach: (Demo)

Code:

$urls=array( "http://example.com//path//to///dir////4/ok", "http://example.com/path/?&&", "http://example.com/path/?&param=one", "http://www.example.com///?&", "http://example.com/path/subpath///?param=one&"); foreach($urls as $url){ $a=parse_url($url); $clean_urls[]="{$a["scheme"]}://{$a["host"]}". // no problems expected from these elements preg_replace('~/+~','/',$a["path"]). // reduce multiple consecutive slashes to single slash (isset($a["query"]) && trim($a["query"],'&')!=''?'?'.trim($a["query"],'&'):''); // handle querystring } var_export($clean_urls);

Output:

array ( 0 => 'http://example.com/path/to/dir/4/ok', 1 => 'http://example.com/path/', 2 => 'http://example.com/path/?param=one', 3 => 'http://www.example.com/', 4 => 'http://example.com/path/subpath/?param=one', )

Explanation of url component handling:

The preg_replace() pattern on the path element will match 1 or more slashes and replace them with a single slash. This can also be achieved using ~/+(?=/)~ or ~(?<=/)/+~ and an empty replacement string, but the lookarounds are at least 2.5x slower than the no-look pattern.

The query handling line has an inline conditional that first checks if the query element exists, then...

If so, it will trim unlimited ampersands from both ends and check that the trimmed value is not empty. Any qualifying strings will be trimmed of ampersands, and prepended with a question mark.

If not, an empty string is appended to the string to be pushed into $clean_urls.

更多推荐