MySQL括号字符串计数

编程入门行业动态更新时间:2024-10-09 22:21:40

MySQL<a href=https://www.elefans.com/category/jswz/34/1767470.html style= 括号字符串计数"/>

MySQL括号字符串计数

问题提出：

解决方案：

1. 使用正则表达式

2. 不使用正则表达式

3. 字典表 + 自定义函数

4. 字典表 + 递归查询

总结：

参考：

问题提出：

有一张表 t1 存储用户评论内容，如下所示（只列出相关列）：

现在想得出每种评论字数的个数，每个字符包括标点、空格、表情符号都算一个字，但每对中括号连同其中的内容只算一个字。对于上面的数据行，结果为：

解决方案：

1. 使用正则表达式

第一感觉这是使用正则表达式的场景。只要将每对中括号连同其中的内容替换为单个字符，再用char_length函数求长度即可。查询语句如下：

select char_length(regexp_replace(Content,'\\[.*?\\]', 'A')) r,count(*) from t1 group by char_length(regexp_replace(Content,'\\[.*?\\]', 'A')) order by r;

\\[ 和 \\] 用于将中括号转义为普通字符。正则表达式中，“.”表示表示匹配除换行符 \n 之外的任何单字符，“*”表示零次或多次。所以 “.*” 连在一起就表示任意字符出现零次或多次。没有“?”表示贪婪模式。比如a.*b，它将会匹配最长的以a开始，以b结束的字符串。如果用它来搜索aabab的话，它会匹配整个字符串aabab。这被称为贪婪匹配。又比如模式src=`.*`，它将会匹配以 src=` 开始，以`结束的最长的字符串。用它来搜索 <img src=``test.jpg` width=`60px` height=`80px`/> 时，将会返回 src=``test.jpg` width=`60px` height=`80px`

“?”跟在“*”后边用时，表示懒惰模式，也称非贪婪模式，就是匹配尽可能少的字符。这就意味着匹配任意数量的重复，但是在能使整个匹配成功的前提下使用最少的重复。a.*?b匹配最短的，以a开始，以b结束的字符串。如果把它应用于aabab的话，它会匹配aab（第一到第三个字符）和ab（第四到第五个字符）。又比如模式 src=`.*?`，它将会匹配 src=` 开始，以 ` 结束的尽可能短的字符串，且开始和结束中间可以没有字符，因为*表示零到多个。用它来搜索 <img src=``test.jpg` width=`60px` height=`80px`/> 时，将会返回 src=``。

2. 不使用正则表达式

MySQL 5.6版本中还没有提供正则表达式功能，无捷径可循，只能用常规SQL解决。查询语句如下：

select f,count(*) from (select commentid,sum(case when l1=0 then char_length(s)when locate('[',s) = 0 then char_length(s)when substring(s,1,1)='[' then 1 else char_length(substring(s,1,locate('[',s)-1)) + 1 end) ffrom (select commentid,content,substring_index(substring_index(content,']',id),']',-1) s,l1,id from (select commentid,content,char_length(content)-char_length(replace(content,']','')) l1 from (select commentid,case when locate(']',content) and substring(content,-1,1)<>']' then concat(content,']') else content end content from t1) t) t1,nums where id<=(case when l1=0 then 1 else l1 end)) t group by commentid) t group by f order by f;

没有正则表达式的加持，实现起来比较麻烦，但整个思路还是很清晰。总的想法是，首先对评论字符串以“]”为分隔符转多行，然后针对不同情况对每行求字符长度，之后按每条评论ID分组求和，得到符合规则的每条评论的长度，最后按评论长度分组进行二次聚合，得到每种长度的个数。

下面我们一层层分析。

9-11行中的子查询为每个带有“]”符号，并且最后一个字符不是“]”的评论尾部拼接一个“]”字符。这是针对类似ID为44132703的这种中括号出现在评论字符串中间的情况，只有这样才能用统一方法进行转多行的操作。数字辅助表nums是只有一列ID的1、2、3......数列，关联它用笛卡尔积由原表的一行制造出多行。

8-11行中的子查询，得出每条评论中成对中括号的个数（l1列），0表示评论字符串中没有成对的中括号，结果如下：

7-12行中的子查询，结果为使用以“]”为分隔符转的多行：

2-13行中的子查询，针对不同情况对每行求字符长度。l1=0 时直接求长度，如“舞姿优美”、“[礼物b，永远支持你 [礼物b，，”；否则，字符串中没有出现“[”的，也直接求长度，如“ 赞赞赞赞赞”；否则，“[”是第一个字符的，表示是中括号中的字符串，按规则其长度为1，如“[满分'”、“[握手'”、“[手套”；否则，取“[”前面字符串的长度加1，如“谢谢友友的支持和鼓励[握手'”。之后按每条评论ID分组求和，得到符合规则的每条评论的长度，结果如下：

最外层查询按评论长度分组进行二次聚合，得到每种长度的个数。

3. 字典表 + 自定义函数

-- 创建字典表
create table dict as
select distinct concat(case when locate('[',s) = 1 then s else substring(s,locate('[',s)) end,']') s from (select commentid,content,substring_index(substring_index(content,']',id),']',-1) s,l1,id from (select commentid,content,char_length(content)-char_length(replace(content,']','')) l1 from (select commentid,case when locate(']',content) and substring(content,-1,1)<>']' then concat(content,']') else content end content from t1) t) t1,nums where id<=(case when l1=0 then 1 else l1 end)) t where l1>0 and instr(s,'[')>0;-- 创建替换函数
delimiter //create function translate(ps varchar(6000)) returns varchar(6000)
begindeclare rs varchar(6000);declare done int default 0;declare cs varchar(200);declare c cursor for select s from dict;declare continue handler for not found set done=1;set rs=ps;if instr(ps,'[')>0 and instr(ps,']')>0 thenopen c;while done=0 do fetch c into cs;set rs=replace(rs,cs,'A');end while;close c;end if;return rs;
end;
//delimiter ;-- 查询
select char_length(translate(content)) f,count(*) from t1 group by char_length(translate(content))order by f;

4. 字典表 + 递归查询

-- 创建字典表（同3）-- 递归查询
with recursive cte (commentid,content, cnt) as 
( select distinct commentid,content, 1 as cnt from t1 left join dict on instr(content,s) > 0 union all select commentid,replace(content,s,'A'), cnt + 1 from cte left join dict on instr(content,s) > 0 where content is not null
) 
select char_length(content) f,count(*) from (select distinct t1mentid,t1.content from cte t1,(select commentid,max(cnt) rn from cte where content is not null group by commentid) t2 where t1mentid=t2mentid and t1t=t2.rn and t1.content is not null) t group by char_length(content) order by f;