PHP中文分词算法及代码实现

编程入门 行业动态 更新时间:2024-10-27 02:29:33

简单的中文分词算法,二元分词的PHP代码:

$str = '苏杭,人间的天堂paradise!';
//$str = iconv('GB2312','UTF-8',$str);
$result = spStr($str);
print_r($result);

/**
 * UTF-8版 中文二元分词
 */
function spStr($str)
{
    $cstr = array();

    $search = array(",", "/", "\\", ".", ";", ":", "\"", "!", "~", "`", "^", "(", ")", "?", "-", "\t", "\n", "'", "<", ">", "\r", "\r\n", "{1}quot;", "&", "%", "#", "@", "+", "=", "{", "}", "[", "]", ":", ")", "(", ".", "。", ",", "!", ";", "“", "”", "‘", "’", "[", "]", "、", "—", " ", "《", "》", "-", "…", "【", "】",);

    $str = str_replace($search, " ", $str);
    preg_match_all("/[a-zA-Z]+/", $str, $estr);
    preg_match_all("/[0-9]+/", $str, $nstr);

    $str = preg_replace("/[0-9a-zA-Z]+/", " ", $str);
    $str = preg_replace("/\s{2,}/", " ", $str);

    $str = explode(" ", trim($str));

    foreach ($str as $s) {
        $l = strlen($s);

        $bf = null;
        for ($i= 0; $i< $l; $i=$i+3) {
            $ns1 = $s{$i}.$s{$i+1}.$s{$i+2};
            if (isset($s{$i+3})) {
                $ns2 = $s{$i+3}.$s{$i+4}.$s{$i+5};
                if (preg_match("/[\x80-\xff]{3}/",$ns2)) $cstr[] = $ns1.$ns2;
            } else if ($i == 0) {
                $cstr[] = $ns1;
            }
        }
    }

    $estr = isset($estr[0])?$estr[0]:array();
    $nstr = isset($nstr[0])?$nstr[0]:array();

    return array_merge($nstr,$estr,$cstr);
}

"苏杭,人间的天堂paradise“经过二元分词之后,转变为:

Array ( [0] => paradise [1] => 苏杭 [2] => 人间 [3] => 间的 [4] => 的天 [5] => 天堂 ) 
接下来,是将以上结果转换为单字节的,可以用md5,base64,sha1等,但是用这些转换后的字符都太长,占用太多的存储空间,我们可以用区位码来表示汉字,汉字转区位码的PHP代码:

foreach ($result as $s) {
    $s = iconv('UTF-8','GB2312',$s);
    $code[] = gbCode($s);
}
$code = implode(" ", $code);
echo $code;

function gbCode($str) {
    $return = null;

    if (!preg_match("/^[\x80-\xff]{2,}$/",$str)) return $str;

    $len = strlen($str);
    for ($i= 0; $i< $len; $i=$i+2) {
        $return .= sprintf("%02d%02d",ord($str{$i})-160,ord($str{$i+1})-160);
    }

    return $return;
}

转换后的结果为:

paradise 43532628 40432868 28682136 21364476 44764435

最后将得到的结果入库,插入全文索引表,结果应该插入索引表的fulltext索引字段。



更多推荐

PHP中文分词算法及代码实现

本文发布于:2023-06-14 07:42:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1452897.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:分词   中文   算法   代码   PHP

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!