Monitoring CPU performance of Lyft’s Android applications

Mobile performance at Lyft

Pavlo Stavytskyi

Published in

Lyft Engineering

9 min readApr 13, 2022

🇺🇦 #StandWithUkraine

Android applications such as Lyft’s apps are developed by a large number of contributors. This means that the codebase grows and changes very quickly. Features are constantly being added or improved, and all these modifications can potentially impact the performance of the application. Thus, it is important to understand how the application consumes CPU resources and to see the dynamics of such metrics across product releases.

What we are trying to achieve

The main idea of the CPU analysis is to track how much load the app puts on the device’s CPU while it’s running. Based on this data, it is possible to obtain average values across product releases, compare them and act accordingly.

The primary metric that we are targeting is the relative average CPU usage. This value shows a percentage of how much load the application has put on the CPU between 2 points in time.

How the data is collected

To compute the above value, we must gather many intermediary metrics. There are 2 primary sources for collecting CPU metrics on Android:

Reading Linux system files. Since Android is based on the Linux kernel, it keeps the system directory structure from the kernel.
Using the native system API. This can be called either directly with the C language or using JVM wrappers provided by the Android SDK.

Note: The latest Android versions introduced permission limitations for accessing some Linux system files because of security reasons. Therefore, not all CPU monitoring approaches that are available on Linux work on Android.

CPU specifications

Gathering CPU hardware information is important for multiple reasons. Knowing CPU details such as architecture, its supported ABI list and model can help to categorize data and behaviors (or misbehaviors) across a wide range of devices, and to identify patterns and trends. All of this can help to debug a number of performance-related issues more precisely.

However, details such as the number of cores and clock speed are even more important as they are used in calculations of primary CPU usage metrics.

Number of cores

Number of cores (numCores) is pretty self-explanatory as it shows the number of physical CPU cores. Getting this value programmatically requires using the native sysconf(_SC_CLK_TCK) call either directly or through a JVM wrapper as shown below.

Getting the number of CPU cores programmatically

Clock speed

Clock speed (clockSpeedHz) is the number of clock ticks per second, measured in Hertz. The value is static and does not change over time. It is not very useful as a standalone metric, but it is crucial for calculations described below.

Usually, for the majority of ARM CPUs, this value is 100Hz. However, for x86_64 architectures it is 1000Hz.

To obtain this value, we must use the native sysconf(_SC_CLK_TCK) call, again either directly or through a JVM wrapper:

Getting CPU clock speed (Hz) programmatically

CPU performance metrics

It is possible to divide all CPU metrics into two categories:

Frequency-based metrics: Real-time data based on CPU frequency values.
Time-based metrics: Based on time values provided by the system. They are helpful when calculating the average CPU usage percentage over time.

Frequency-based metrics tend to have wide ranges of values over short periods of time and are thus better suited for real-time monitoring. Since we’re aggregating analytics over time, time-based metrics would be a better fit as they allow for computing average percentages over any time interval during an app session with only a couple of data points. Thus, it would significantly reduce the amount of data we need to collect.

The rest of this blog post will focus on collecting time-based metrics.

CPU time

CPU time (cpuTimeSec) is the time CPU spent doing work for a given application process, and is measured in seconds.

Usually, the CPU is not doing the work for the application during 100% of its lifetime. The graph below shows that CPU time can be represented as many individual time segments on a timeline. The goal is to find the sum of all these segments.

There is one Linux system file present on Android that is especially useful for identifying CPU times for individual apps. This file is located at /proc/[pid]/stat, where [pid] is the ID of an application process. In order to get [pid] programmatically, android.os.Process.myPid() function may be used.

Alternatively, it is possible to use /proc/self/stat file path which serves the same purpose.

Reading this file yields something like this:

13382 (package.example) R 733 733 0 0 -1 4194624 36257 0 29 0 212 24 0 0 20 0 29 0 109604942 15316271104 33934 18446744073709551615 400721997824 400722001824 549509190800 0 0 0 4612 1 1073775864 0 0 0 17 0 0 0 1 0 0 400722001920 400722003192 401529204736 549509195107 549509195206 549509195206 549509197790 0

As it turns out, this is just a string with a bunch of space-separated values. Official Linux documentation helps to shed light on the meaning of these numbers. Below are the values useful for calculating CPU time metrics:

(14) utime — the amount of time that this process has been scheduled in user mode, measured in clock ticks.
(15) stime — the amount of time that this process has been scheduled in kernel mode, measured in clock ticks.
(16) cutime — the amount of time that this process’ waited-for children have been scheduled in user mode, measured in clock ticks.
(17) cstime — the amount of time that this process’ waited-for children have been scheduled in kernel mode, measured in clock ticks.

The numbers in brackets above represent a position in the /proc/[pid]/stat file, starting from 1, and they are measured in clock ticks which means they must be converted to seconds. Here is where CPU clock speed comes in handy. In order to convert clock ticks to seconds, the former must be divided by the clock speed as shown below.

seconds = clockTicks / clockSpeedHz

Therefore, to calculate CPU time, all 4 numbers above must be added and then converted to seconds as shown below.

cpuTimeSec = (utime + stime + cutime + cstime) / clockSpeedHz

Uptime

Uptime (uptimeSec) is the time since the device booted, measured in seconds.

To programmatically get this value, the SystemClock API from the Android SDK must be used. By default, the returned elapsedRealtime value is measured in milliseconds, so we must convert it to seconds.

Getting uptime metric programmatically

Process time

Total process time (processTimeSec) is the time since the application launched, measured in seconds.

In order to calculate this value, we need to use the /proc/[pid]/stat system file again. This time we need another value:

(22) starttime — the time the process started after system boot, measured in clock ticks.

Therefore, the formula below allows for calculating the process time metric (don’t forget to convert clock ticks to seconds using the CPU clock speed).

processTimeSec = uptimeSec - (starttime / clockSpeedHz)

Here’s the visual timeline representation:

**Process time** metric on a timeline graph

Average CPU usage

Average CPU usage (avgUsagePercent) is one of the main metrics that we want to calculate. It shows how much load our app has put on the CPU since it launched, and we measure it in percent.

It’s important to emphasize that this metric (as well as all those above) shows the CPU usage for a specific application, not the total CPU load on the device.

In order to get this value, we need to use the CPU time and process time metrics that we retrieved previously. The idea is to find the ratio between the CPU time and the total application (process) time. Refer to the timeline graph below for a visualization.

**Average CPU usage** metric on a timeline graph

The CPU time must always be smaller than the process time, so dividing the former by the latter yields a value between 0 and 1. The percentage can then be calculated by multiplying the result by 100 as shown in the formula below.

avgUsagePercent' = 100 * (cpuTimeSec / processTimeSec)

When doing these calculations in a real app, one might notice that the percentage can sometimes exceed 100. In order to fix this, that value must be divided by the number of cores as shown below.

avgUsagePercent = avgUsagePercent' / numCores

However, there is a problem. It is great to know the average CPU usage but this metric does not provide enough precision. This is due to the fact that it only considers the entire timeline of the application (from launch until now). To be able to perform a more sophisticated CPU performance analysis, the next metric comes to the rescue.

Relative average CPU usage

Relative average CPU usage (relAvgUsagePercent) is the primary metric we need for our CPU performance analysis. It is conceptually the same as the previous one but allows for more precise analysis by calculating CPU usage values between two points in time.

Let’s say we want to measure CPU usage every 1 minute. By doing a measurement every 1 minute it is possible to calculate the average CPU usage percent during that last minute only (or any other selected time period).

In order to calculate this metric, we need to use the CPU time and process time values. However, this time it is required to get snapshots with these values twice: at the beginning and at the end of the time interval being measured.

Let’s imagine that 1 minute ago we measured cpuTimeSec1 and processTimeSec1.

**First** metrics snapshot on a timeline graph

Now, we can measure the cpuTimeSec2 and processTimeSec2 values.

**Second** metrics snapshot on a timeline graph

As a result, we need to know the values that represent a period of time between those 2 points (1 minute ago and now). To do this, we need to find delta values both for the CPU time and for the process time as shown below.

cpuTimeDeltaSec = cpuTimeSec2 - cpuTimeSec1processTimeDeltaSec = processTimeSec2 - processTimeSec1

Metrics of a given timespan on a timeline graph

Now, we can use the same formula that we used for calculating the average CPU usage metric:

relAvgUsagePercent' = 100 * (cpuTimeDeltaSec / processTimeDeltaSec)

As with the previous metric, we must divide the result by the number of cores to get the correct percentage:

relAvgUsagePercent = relAvgUsagePercent' / numCores

That’s it. We found the target metric that identifies the average CPU usage between selected points in time!

Testing produced values

In order to test the correctness of values, it is possible to run the measurement in a loop and take relAvgUsagePercent snapshots once or twice a second. This way the numbers can be compared with the Android Studio CPU profiler in real-time, and from our experience, they are pretty close.

Conclusion

We considered a number of CPU-related metrics that might be useful in CPU performance monitoring. The most important value is the relative average CPU usage which helps to determine how much load the application put on the CPU between two points in time.

Additionally, we have discovered different ways to retrieve intermediate system data either using Linux system files or Android native API wrappers. This data is crucial when calculating the primary CPU usage metric.

Knowing how much CPU resources our app requires is very important. It helps to identify and mitigate various performance-related regressions and to evaluate how much each product release impacts the overall application’s performance.

Acknowledgments

Special thanks to Michael Rebello for the editorial support.

Lyft is hiring! If you’re interested in mobile performance, check out our careers page.