简单函数单调区间怎么画图像性,画圈的不太理解,第二种情况怎么写?求详解

为什么写这篇文章最近在回顾复盘自己的工作经历,发现窗口函数工作中经常用,但自始至终没有系统总结过,特别是写成文章记录下来。而自己的经历是,文字记录会反过来去加深人对一个概念的理解,虽然常常不能知到便做到,但有热情的时候,还是要趁机将所思所想写下来。一、什么是窗口函数窗口函数,也叫OLAP函数(Online Anallytical Processing,联机分析处理),可以对数据库数据进行实时分析处理。语法大致如下<窗口函数>over(partition by 分组字段 order by 排序字段)
--分组和排序字段不是必须项,视问题情况而定主要分为3类,分别是聚合窗口函数、排序窗口函数、偏移窗口函数聚合窗口函数是avg、sum、count、max、min等;排序窗口函数是rank、dense_rank、row_number;偏移窗口函数是lag、lead聚合和排序窗口函数应用比较多,需要着重掌握,偏移窗口函数虽然应用场景不多,但建议还是至少明白是做什么的,什么场景下能用。二、窗口函数有什么用最主要的作用是在不减少原表行数的情况下进行分组排序等计算,主要应用场景有组内比较问题组内topN问题累计求和问题连续登录、活跃N天问题连续出现N次问题三、窗口函数怎么用窗口函数怎么用,要结合实际场景,接下来就以上面的应用场景为例,用数据挨个实现一下。注意:为了尽量说清楚计算原理,做了分步拆解,完整代码都在每个问题的最后一个小点呈现,或者是3,或者是2。3.1 组内比较问题通常情况是组内的某个指标和指标平均值来做比较,既然是平均数,也就很好理解,用到是聚合窗口函数。比如,找出某个班级内某科分数大于该科平均分的学生;找出某产品线活跃用户数比产品组内平均活跃用户数多的日期等题目:找出某产品线活跃用户数比产品组内平均活跃用户数多的日期下面就用以下数据来实现一下,字段意思分别为统计日期、产品id、用户id,这些字段来自用户行为日志表(user_log),事件名称等就不放出来了1、审题可知【某产品线活跃用户数】,需要先根据日期和产品线分组,计算某天某产品线的活跃用户数select concat(substr(dt,1,4),'-',substr(dt,5,2),'-',substr(dt,7,2)) as stat_date
,product_id
,count(distinct user_id) as active_user_num
from user_log
where dt between '20230404' and '20230430'
and product_id is not null
group by stat_date,product_id2、关键步骤,【某产品线的平均活跃用户数】,审题可知,需要用avg窗口函数求产品线组内该段时间的平均活跃用户数,这样可以不减少原表行数,便于最后将每天日活和平均日活做对比select *
,round(avg(active_user_num)over(partition by product_id),2) as avg_active_user_num
--切记要用产品分组
from
(
select concat(substr(dt,1,4),'-',substr(dt,5,2),'-',substr(dt,7,2)) as stat_date
,product_id
,count(distinct user_id) as active_user_num
from user_log
where dt between '20230404' and '20230430'
and product_id is not null
group by stat_date,product_id
) as t13、【某产品线活跃用户数比产品组内平均活跃用户数多的日期】,审题知,只要过滤出active_user_num≥avg_active_user_num行记录便可达成目的--完整代码
select stat_date
,product_id
,active_user_num
,avg_active_user_num
from
(
select *
,round(avg(active_user_num)over(partition by product_id),2) as avg_active_user_num
from
(
select concat(substr(dt,1,4),'-',substr(dt,5,2),'-',substr(dt,7,2)) as stat_date
,product_id
,count(distinct user_id) as active_user_num
from user_log
where dt between '20230404' and '20230430'
and product_id is not null
group by stat_date,product_id
) as t1
) as t2
where active_user_num>=avg_active_user_num
--关键条件
order by product_id,stat_date3.2 组内topN问题题目:产品组内活跃用户数top3的日期是哪几天1、日期、产品线分组聚合计算日活的步骤和上面一样2、组内排序,涉及排序窗口函数,这里有3个,分别是rank、dense_rank、row_number,顺便解释一下他们的区别 rank:值相等时会重复,会产生空位,比如某列有13,12,12,11这4个数,用rank排序该列时,排序结果为1、2、2、4dense_rank:值相等时会重复,不会产生空位,还是上面的例子,用dense_rank排序该列时,排序结果为1、2、2、3row_number:值相等时不会重复,不会产生空位,还是上面的例子,用row_number排序该列时,排序结果为1、2、3、4造点数据实现一下,更直观create table ta
(student varchar(255),
score int);
insert into ta values('A',13),
('A',12),
('A',12),
('A',11);
select *
,rank()over(partition by student order by score desc) as rk1
,dense_rank()over(partition by student order by score desc) as rk2
,row_number()over(partition by student order by score desc) as rk3
from ta具体选用哪一个,根据问题的需求来定3、根据题目意思,这里选用dense_rank更符合一点,组内日活排序后,选取rank前3的即可select *
from
(
select *
,dense_rank()over(partition by product_id order by active_user_num desc) as dau_rank
--组内排序
from
(
select concat(substr(dt,1,4),'-',substr(dt,5,2),'-',substr(dt,7,2)) as stat_date
,product_id
,count(distinct user_id) as active_user_num
from user_log
where dt between '20230404' and '20230430'
and product_id is not null
group by stat_date,product_id
) as t1
) as t2
where dau_rank<=3
--关键条件
order by product_id,dau_rank3.3 累计求和问题比如,求2023年各个月的网站访问次数和累计访问次数。这时候就需要用到 sum 窗口函数。还是用上述的 user_log 表实际演示一下。题目:求2023年4月各个产品每天的访问次数和累计访问次数1、审题可知,2023年4月“每天”,“各个产品”,访问次数,group by 产品线id,日期,然后count(user_id)select concat(substr(dt,1,4),'-',substr(dt,5,2),'-',substr(dt,7,2)) as stat_date
,product_id
,count(user_id) as active_cnt
from user_log
where dt between '20230404' and '20230430'
and product_id is not null
group by stat_date,product_id2、每天的累计访问次数,可知,需要按日期正序排序后,每天的累计访问次数都需要用今天的访问次数+昨天的累计访问次数。值得注意的是,累计求和,一定需要 order by,如果不用 order by,求出的是组内这段时间的总访问次数,就是最后一列,sum_active_cntselect *
,sum(active_cnt)over(partition by product_id order by stat_date) as acc_active_cnt
--有排序,得出的是累计求和,无排序,得出的是组内总和
,sum(active_cnt)over(partition by product_id) as sum_active_cnt
from
(
select concat(substr(dt,1,4),'-',substr(dt,5,2),'-',substr(dt,7,2)) as stat_date
,product_id
,count(user_id) as active_cnt
from user_log
where dt between '20230404' and '20230430'
and product_id is not null
group by stat_date,product_id
) as t1计算结果如上,其他任何类似的累计求和问题,都可以这样计算,sum(累计求和指标列)over(partition by分组列(不必需)order by 排序列(必须,通常都是日期列))3.4 连续登录、活跃N天问题经常会遇到产品或者运营会想知道,连续登录N天的用户是哪些。这次用一个登录明细表 login_user 来演示一下,一行数据代表,谁(who)在什么时间(when)什么地方(where)登录了哪个产品线(what)题目:找出连续登录次数大于3或者在3天以上的用户这种问题有两种解法,一种是排序窗口函数,一种是偏移窗口函数排序窗口函数1、这种问题的第一步首先都是去重,一个用户一天可能登录很多次,只保留一条即可,用 group by 去重select
substr(event_time,1,10) as stat_date
,user_id
from login_user
where dt between '20230401' and '20230430'
and event_type like '%login%'
and is_success=1
and user_id is not null
group by stat_date,user_id2、去重后,用排序窗口函数,用户分组,然后排序他的登录日期,记得是正序select user_id
,login_date
,row_number()over(partition by user_id order by login_date) as login_date_rk
from (
select
substr(event_time,1,10) as login_date
,user_id
from login_user
where dt between '20230401' and '20230430'
and event_type like '%login%'
and is_success=1
and user_id is not null
group by login_date,user_id
) AS t13、用登录日期减去日期排序的序号,如果是连续登录的话,多行之间的这个差应该是相等的select *
,date_sub(login_date,login_date_rk) as datedff
from (
select user_id
,login_date
,row_number()over(partition by user_id order by login_date) as login_date_rk
from (
select
substr(event_time,1,10) as login_date
,user_id
from login_user
where dt between '20230401' and '20230430'
and event_type like '%login%'
and is_success=1
and user_id is not null
group by login_date,user_id
) AS t1
) as t24、用户id和差值日期分组,count(1)即可,结果就为连续登录的天数,取出大于3的即可select user_id
--
,datedff
,min(login_date) as start_login_date
,max(login_date) as end_login_date
,count(1) as login_days
from (
select *
,date_sub(login_date,login_date_rk) as datedff
from (
select user_id
,login_date
,row_number()over(partition by user_id order by login_date) as login_date_rk
from (
select
substr(event_time,1,10) as login_date
,user_id
from login_user
where dt between '20230401' and '20230430'
and event_type like '%login%'
and is_success=1
and user_id is not null
group by login_date,user_id
) AS t1
) as t2
) as t3
group by user_id,datedff
having count(1)>=3
order by login_days desc偏移窗口函数还有一种解法是用偏移窗口函数,只不过稍微复杂一点1、还是去重,一个用户一天只保留一条select
substr(event_time,1,10) as login_date
,user_id
fromlogin_user
where dt between '20230401' and '20230430'
and event_type like '%login%'
and is_success=1
and user_id is not null
group by login_date,user_id2、用户分组,登录日期倒序排序后,根据偏移窗口函数,找出每个日期后两行数据对应的日期select user_id
,login_date
,lead(login_date,2)over(partition by user_id order by login_date desc) as next3_date
from (
select
substr(event_time,1,10) as login_date
,user_id
from login_user
where dt between '20230401' and '20230430'
and event_type like '%login%'
and is_success=1
and user_id is not null
group by login_date,user_id
) as t13、根据2的结果,再计算登录日期减去2天,也就是2天前的日期是多少select *
,date_sub(login_date,2) as datedff3
from (
select user_id
,login_date
,lead(login_date,2)over(partition by user_id order by login_date desc) as next3_date
from (
select
substr(event_time,1,10) as login_date
,user_id
from login_user
where dt between '20230401' and '20230430'
and event_type like '%login%'
and is_success=1
and user_id is not null
group by login_date,user_id
) as t1
) as t24、2的结果和3的结果相等时,便是连续登录3天及以上的记录。很好理解,日期倒序排序,如果是连续登录,则后两行对应的日期,肯定是和日期减去2天后的日期是相等的,否则,就不是连续登录select *
from (
select *
,date_sub(login_date,2) as datedff3
from (
select user_id
,login_date
,lead(login_date,2)over(partition by user_id order by login_date desc) as next3_date
from (
select
substr(event_time,1,10) as login_date
,user_id
from login_user
where dt between '20230401' and '20230430'
and event_type like '%login%'
and is_success=1
and user_id is not null
group by login_date,user_id
) as t1
) as t2
) as t3
where next3_date=datedff3如果只是找出连续3天登录的用户,则只取 user_id 字段去重即可,login_date 可看作 end_date,next3_date 可看作 start_date虽然这个方法也可以算连续登录问题,但显然它的弊端是,需要根据连续登录天数的需求不断的去增加lead()函数的个数,不灵活。不如上面的方法简单3.5 连续出现N次问题这个问题出现的频率不是很高。但也举个例子来实操一下,就拿我喜欢的金州勇士为例,勇士和湖人的西部半决赛正在进行中。有这样一张球员得分表 basketball_scorecreate table basketball_score
(team varchar(255),
number int,
player_name varchar(255),
score int,
get_score_time varchar(255)
);
insert into basketball_score values
('勇士',30,'库里',3,'2023-05-11 11:01:14'),
('勇士',30,'库里',2,'2023-05-11 11:02:17'),
('湖人',23,'詹姆斯',1,'2023-05-11 11:06:14'),
('勇士',30,'库里',3,'2023-05-11 11:07:12'),
('湖人',11,'戴维斯',1,'2023-05-11 11:08:12'),
('勇士',30,'库里',3,'2023-05-11 11:08:45'),
('勇士',11,'汤普森',3,'2023-05-11 11:09:11'),
('勇士',22,'维金斯',3,'2023-05-11 11:09:45'),
('湖人',1,'拉塞尔',1,'2023-05-11 11:10:27'),
('湖人',23,'詹姆斯',1,'2023-05-11 11:11:12'),
('湖人',23,'詹姆斯',1,'2023-05-11 11:14:12');题目是:找出连续4次(及以上)为球队得分的球员1、首先肯定是需要按球队分组2、我们很容易能想到,如果按照得分时间排序后,能将得分球员的名字也组成单独的列,如果一行数据中,4列的球员名字都相同,该球员就是连续4次为球队得分的球员3、根据这个思路,我们用偏移窗口函数lead(),分别取后一行,2行,3行的球员名字值,分别组成单独的列select team
,player_name
,lead(player_name,1)over(partition by team order by get_score_time) as player_name1
-- 球队分组,得分时间排升序后,后面1行和2行的球员姓名字段分别是什么值,组成单独的列
,lead(player_name,2)over(partition by team order by get_score_time) as player_name2
,lead(player_name,3)over(partition by team order by get_score_time) as player_name3
from basketball_score4、如果球员=球员1=球员2=球员3,则是我们需要找的球员select
distinct team
,player_name
-- distinct语法规定对单字段、多字段去重,必须放在第一个查询字段前
-- 如果对表中多列字段进行去重,去重的过程就是将多字段作为整体去重
from
(
select team
,player_name
,lead(player_name,1)over(partition by team order by get_score_time) as player_name1
,lead(player_name,2)over(partition by team order by get_score_time) as player_name2
,lead(player_name,3)over(partition by team order by get_score_time) as player_name3
from basketball_score
) as t1
where t1.player_name=t1.player_name1
and t1.player_name=t1.player_name2
and t1.player_name=t1.player_name3总结:这种连续出现问题,都是将连续出现的值放到单独的列,连续出现几次,就新增几列,同一行值相等,则满足需求好了,上述就是我能想到的所有应用窗口函数的例子,它不难,只要理解了它的原理,根据实际业务需求灵活选用即可。最后,写作不易,如果觉得对你有用,请点个大大的喜欢或者赞同,也许对我以后换工作也有帮助,感谢感谢

我要回帖

更多关于 函数单调区间怎么画图像 的文章

 

随机推荐