Understanding Radio Resource Management

Wireless
— A screenshot of an access point diagram taken from the video. There is a central AP labeled 6, with six additional APs arranged around it. These are labeled either 11 or 1.

Get an in-depth look at Juniper Mist AI RRM

Radio Resource Management (RRM) is a key tool for large multi-site organizations to efficiently manage their radio frequency (RF) spectrum. This video provides a technical deep dive into the problems Juniper Mist™ AI RRM solves by taking a user-experience approach.

Show more

You’ll learn

  • How Juniper Mist uses RRM to manage the RF spectrum and maximize the Wi-Fi end-user experience

  • Why picking static values and letting the system run isn’t feasible and scalable

Who is this for?

Network Professionals

Transcript

0:00 radio resource management or RM is a key

0:03 tool for large multi-site organizations

0:05 to efficiently manage their RF Spectrum

0:09 Legacy controller-based implementations

0:11 build their Channel plan on how the aps

0:14 hear each other

0:15 usually late at night and decisions on

0:18 channel and power are then made and

0:19 implemented

0:21 the frustration is we hear from our

0:23 large customers is that these systems

0:25 Focus solely on channel reuse and don't

0:28 take into account changing conditions

0:30 during the day and then would overreact

0:32 for no clear reason mislistened

0:36 and about two years ago we completely

0:37 redesigned our

0:40 instead of just following the how the

0:42 aps hear each other Vector we wanted to

0:45 take into account the user experience

0:47 so we already had the capacity SLE

0:50 service level expectation which is an

0:53 actual measurement of every user minute

0:55 whether they had enough usable RF

0:58 capacity available taking into account

1:00 client count client usage AKA

1:04 bandwidthogs Wi-Fi and non-wifi

1:06 interference

1:08 so we implemented a reinforcement

1:10 learning based feedback model we monitor

1:13 the capacity SLE to see if a channel

1:15 change and or power change actually made

1:19 things better for the users or if it

1:21 didn't have any impact we trained the

1:23 system on these types of changes and

1:26 validate it with the capacity SLE to

1:29 make sure there was a measurable

1:30 Improvement

1:31 this Auto tuning will continue on an

1:34 ongoing basis

1:36 rather than setting 50 or more different

1:38 thresholds based on Raw metrics

1:41 available from some vendors

1:44 controller-based system from experience

1:46 we know there is no perfect value that

1:48 works across all environments

1:51 each environment is different and

1:53 probably not even consistent during the

1:55 course of a single day picking static

1:58 values and letting the system just run

2:00 isn't feasible and won't scale if the

2:03 capacity SLE is showing 90 percent then

2:06 there isn't much to gain by making

2:08 changes the client usage classifier

2:11 tracks excess bandwidth hogging by

2:13 certain clients

2:14 so if we see a two Sigma deviation for

2:17 bandwidth usage among clients then the

2:21 higher usage clients would get flagged

2:22 in the client usage classifier if the

2:25 bandwidth usage is pretty much

2:26 ubiquitous across all clients then the

2:29 client count classifier is where that

2:31 would be counted

2:34 these two events would not cause a

2:36 channel change but they would be visible

2:38 in Marvis the capacity as the lead is

2:41 taking a hit not based on client usage

2:44 but on Wi-Fi or non-wifi interference

2:47 then your end user experience is taking

2:50 a hit our system is agile and dynamic

2:52 rather than just setting min max ranges

2:56 and being purely focused on channel

2:58 reuse

2:59 we can let the system learn and adapt

3:02 based on what the end users are

3:04 experiencing

3:06 this is the underlying architecture for

3:08 mist AI driven RRM let's take a look at

3:12 the available configuration options

3:14 you can choose power range your list of

3:17 channels these are the only things

3:19 exposed as everything else is auto

3:22 Baseline so you don't need to set a

3:24 bunch of thresholds on each of your

3:26 different sites the system will

3:28 self-learn per site based on the

3:30 capacity SLE

3:32 Mist has implemented RRM as a two-tier

3:34 model first being Global optimization

3:37 which runs once a day it collects data

3:39 throughout the day on an ongoing basis

3:41 and creates a long-term Trend Baseline

3:44 then every day around two or three a.m

3:47 local time it will make changes if those

3:49 changes are worn

3:51 the second tier is event driven RRM or

3:53 as we call internally local RM

3:56 this is monitored by the capacity SLE

3:59 and will act immediately upon any

4:02 deviation from Baseline so both of these

4:04 are running in parallel conventional

4:06 systems aren't able to Leverage The

4:08 compute available in the cloud to

4:10 constantly crunch the long-term Trend

4:12 data and the ability to cross-pollinate

4:15 information from all your different

4:16 sites different client types and

4:19 different RF environments an example

4:21 would be buildings around an airport

4:23 where we have seen radar hits triggering

4:25 DFS events the cloud learns the

4:28 geolocation and the specific frequencies

4:30 of these events and then

4:32 cross-pollinates that learning to other

4:33 sites that may also be close to that

4:35 airport

4:36 existing systems have no memory and no

4:39 concept of long-term Trend data they

4:42 just make changes once a day here you

4:44 can see events happening throughout the

4:46 day all of the events with a description

4:48 are event driven and the scheduled are

4:52 the optimizations that happen at night

4:54 some systems try to implement a

4:57 pseudo-local event type RRM usually

5:00 interference based but the problem we

5:02 run into over time is drift and as

5:05 there's no learning going on eventually

5:08 you'll need to manually rebalance the

5:10 system and clear the drift and start all

5:12 over again the reason for this there is

5:14 no memory of what happened or the

5:16 compute space to understand context and

5:18 learn from it

5:21 Mr RM might also try to make a similar

5:24 channel change but first we're going to

5:25 go back and look at the last 30 days

5:28 and even though these three available

5:30 channels look great now we know one has

5:34 had multiple issues in the past so we

5:36 move that one to the bottom of the

5:37 pecking order this makes our Global RRM

5:40 less disruptive than any Legacy

5:42 implementation

5:45 using DFS as an example clients don't

5:48 respond well to DFS hits

5:50 they might not scan certain channels and

5:53 they might make poor AP choices in our

5:56 implementation we reorder the channels

5:58 in a pecking order based on what we've

6:00 seen in that environment over time so

6:03 certain channels are automatically

6:05 prioritized

6:07 so you might see channels that appear to

6:09 be a good choice based on current

6:12 Channel and Spectrum utilization but we

6:14 know there exists the high degree of

6:16 risk of DFS hits based on what we've

6:19 learned over time so these channels are

6:21 de-prioritized this is truly a

6:24 self-driving system and it's not solely

6:27 focused on channel reuse

6:29 stepping back Legacy RRM systems lack

6:32 the tools to measure if things actually

6:34 got better for your users with mist the

6:37 capacity SLE is exactly that measurement

6:40 that you've never had

6:42 if the capacity SLE takes a hit

6:45 and it's due to Wi-Fi or non-wifi

6:47 interference and RRM is not able to make

6:50 any changes then you obviously know

6:52 there's something in your environment

6:53 you need to take a look at

6:55 or if RRM is making changes and things

6:59 are not getting better then you have

7:01 some other issue that needs to be

7:03 addressed

7:04 but at least you know being able to

7:07 quantify the system is getting better is

7:10 super important especially once you

7:13 start deploying a lot more devices

7:15 today's requirements may not warrant

7:18 this level of sophistication but once

7:20 you start throwing a lot of iot devices

7:22 and other unsophisticated RF devices on

7:25 the network our system will learn to

7:27 accommodate them to see the channel

7:30 distribution you can take a look at this

7:31 graph

7:32 this is from our office and it's not the

7:35 perfect RF environment

7:37 this graph shows you what the channel

7:38 distribution looks like but when you

7:41 have hundreds of thousands of APS and

7:43 thousands of sites you need automations

7:45 that Baseline and monitor using metrics

7:49 that you trust

7:50 what we've done is added this top level

7:52 metric into RRM

7:54 so instead of pulling all of your APS

7:57 and manually inspecting Channel

7:59 assignments you can simply use our API

8:01 to pull a single metric

8:04 we have a distribution and a density

8:06 score

8:07 we have an average co-channel Neighbors

8:10 average number of neighbors so if you

8:12 have a standard deployment policy which

8:14 an installer did not follow you will see

8:17 the site isn't in compliance based on

8:19 these values immediately

8:21 you can pull this from the API and

8:24 create a post deployment report

8:26 so if any of these metrics are deviating

8:29 you will know exactly where to focus

8:30 these slues and metrics are available on

8:33 an ongoing basis compare this with

8:36 existing vendors where you would have to

8:38 pull raw metrics to create your own

8:40 formula to see if you need to take any

8:43 actions we don't want you to pull raw

8:45 data we just want you to use site level

8:48 metrics

8:49 but if you want to maintain your own

8:50 reports we already have done the dedupe

8:53 and aggregation for you from a deep

8:55 troubleshooting perspective

8:57 why is this APN a particular channel is

9:00 a common question asked when chasing an

9:02 RF issue that you suspect to be due to

9:05 Wi-Fi interference each missed AP has a

9:08 dedicated radio that scans all the

9:10 channels all the time and continually

9:12 maintain a score for each of the

9:14 channels that it scans this is the data

9:16 that RM uses to score the channel

9:19 so whenever it gets a trigger from the

9:21 capacity SLE to make a change it uses

9:24 this AP and site score to determine the

9:26 channel to assign

9:28 if an AP is on a channel that doesn't

9:30 seem optimal you can look right here and

9:33 then at the capacity SLE to see if the

9:35 decision making makes sense

9:37 if the SLE doesn't show a user hit that

9:40 explains why the AP hasn't changed

9:42 Channel yet it will defer to the global

9:45 plan and make the change at night

9:49 if there were user impact the system

9:51 would have made a change right away

9:53 in short we have a self-driving

9:55 reinforcement learning based RM

9:57 implementation

9:59 at the same time we're also providing

10:01 you with the visibility into the

10:03 decision making process so you can

10:04 validate decisions made by RRM

10:07 you also have the ability to pull

10:09 information at scale via our apis and

10:12 maintain a Baseline and Trend data for

10:15 all your sites this is valuable if

10:18 you're asked to deploy a bunch of new

10:20 devices and the question comes up hey do

10:22 we have the capacity to support this

10:25 with the Baseline and Trend information

10:26 you can make informed decisions without

10:29 having to pull all kinds of raw data and

10:31 make a guess typically you want to make

10:34 adjustments in two to three dbm

10:36 increments so you have enough wiggle

10:38 room unlike Cisco and Meraki we will go

10:40 up and down in increments of one so

10:42 there's more granularity but as best

10:45 practices suggest we always give it a

10:47 range plus minus three dbm from a median

10:51 value typically the target used by your

10:54 site survey predictive design we had one

10:57 customer ask us why their coverage SLE

10:59 was 99 when they had excellent coverage

11:02 in their warehouse which was full of

11:04 Wi-Fi driven robots in the past when

11:07 there was a robot problem the client

11:09 team would inevitably blame the

11:11 infrastructure team the infrastructure

11:13 team would request detailed logs from

11:15 the client team and most of the time

11:17 that led to no actions when Mist was

11:20 installed and we saw the 99 coverage SLE

11:23 we looked at the affected clients and it

11:26 always seemed to be the same robot when

11:29 they asked the client team about it they

11:31 said yeah that robot has always been a

11:33 little quirky so when they took the

11:35 robot apart they found a damage antenna

11:37 cable

11:38 this was eye-opening to this customer

11:40 and their quote to us was you guys

11:42 solved the needle in the haystack

11:44 problem

11:46 coverage SLE is a powerful tool in

11:49 another customer a driver update was

11:52 pushed to some of their older laptops

11:54 they have over a hundred thousand

11:56 employees so they did do a slow rollout

11:58 but they started getting Wi-Fi

12:00 complaints almost right away

12:03 their laptops are configured with 5 gig

12:05 and 2.4 gig profiles

12:08 already installed because each of their

12:11 sites are a little different in their

12:12 capabilities

12:14 what happened in this update it caused

12:16 laptops to choose 2.4 when they normally

12:19 would have chosen 5 gig so the sles

12:22 immediately showed a significant

12:24 deviation from Baseline that correlated

12:27 to those specific device types and the

12:29 sites that were having the problem

12:31 they immediately stopped the push

12:32 because the correlation was obvious this

12:35 customer told us that in the past they

12:37 would have asked a user to reproduce the

12:39 problem so they could collect the

12:41 Telemetry they needed to diagnose the

12:43 problem

12:44 now they realize Mr Ernie has the

12:46 Telemetry needed to tell them they have

12:48 a growing problem what that problem was

12:50 and save them a ton of time

12:52 that is the power of missed AI RRM

Show more