Optimizing Air Quality Alerts

Key Insights & Actionable Recommendations

  • Optimizing the ranking of air quality recommendations sent to users is critical to maximizing the value of the product for users.
  • Air quality patterns differ substantially for homes and offices for all key metrics. This user segmentation is important for ranking recommendations.
  • About 1/5 of users are not labeled as a home or office. Using temporal trends in air quality, I can apply machine learning to accurately identify which users are homes or offices.
  • Air quality recommendations can be ranked by comparing a specific user’s air quality metrics to the norms for their user segment (home or office).
  • The product team agrees that this data-driven approach to prioritizing user alerts is a valuable improvement and is recommending that it be put into production.
  • Moving forward, segmenting users further (e.g., by location) could provide additional improvements in recommendation ranking.

Opportunity

AWAIR builds indoor air quality monitors that send metrics and personalized recommendations for improving air quality to your smart phone. They asked me to mine their data for actionable insights. Given a relatively open ended request, my first step was to decide how I as a data scientist could bring the most value to the company. I decided to focus my efforts on optimizing the ranking of recommendations that are sent to users. Because the time that users are engaging with the app is relatively limited, there is a small window of opportunity to present the user with information that will maximize the value of the product for them. Thus, it is important that the first recommendation sent will have high value for the user. I leveraged the dataset to identify which recommendations should have highest priority for a given user.

The Data

The dataset contains four metrics: temperature, humidity, CO2, chemicals, and dust. We have one sample every 10 seconds from about 12,000 users. That works out to be about 6.5 billion rows of data.

are there segments of users with distinct air quality patterns?

The first question is whether I can identify unique segments or clusters of users that have distinct air quality patterns. If so, I may need to take that information into account when optimizing the air quality alerts. A key conclusion from my analyses was that there are two very different kinds of users: users who have their device in a home and users who have their device in an office. Although we only have these labels for 4/5 of users, they prove to be a highly informative subset.

The differences between homes and offices are most apparent in temporal trends. The plot below shows how CO2 levels change over the course of the day. Each line is a different day of the week. For homes, CO2 peaks in the evenings and falls when people are out during the data. For offices, CO2 peaks during the day while people are working. CO2 is much lower on the weekends, particularly on Sunday. These effects occur because humans exhale CO2 when they breath. When there are fewer people at home or in the office, CO2 levels drop.

There are also some more subtle effects that are worth noting. For homes, the CO2 shifts to the right on weekends, perhaps because people wake up later on the weekends. CO2 levels are also slightly higher on the weekend in homes, probably because people are more likely to be home during the day on the weekend. For offices, you see the carry over effects from the previous day during the early predawn hours. Predawn Saturday looks just like predawn Friday because both are preceded by a work day. Predawn Monday looks just like predawn Sunday because both are preceded by a weekend day.

How much certainty do I have in these estimates? If you click the right arrow, you’ll see a plot of C02 that also includes the margin of error (95% confidence interval from 1,000 bootstrap samples). You can see that there is some uncertainty in the data, particularly for offices because they have a smaller sample size. Nonetheless, the key trends that I have identified fall well outside of the margin of error. Thus, I am confident that these trends are real and did not occur by chance alone.

In fact, I see differences between homes and offices for the other key metrics as well. In the next plot, you can click through the remaining metrics by clicking the right arrow. For homes, chemicals peak in the evening. In offices, chemical levels are lower overall and peak during the day and early evening — except on the weekend, where they remain low all day. There is quite a bit of uncertainty in estimates of chemical levels, especially for offices. Nonetheless, I have strong confidence that average chemical levels differ for homes and offices. Finally, dust levels are lower overall for offices, but there is less systematic change in dust levels over the course of the day or between weekdays and weekends.  Although there is some uncertainty in these estimates of dust levels, I have strong confidence that dust levels are higher in homes on average.

Can we easily identify whether a device is in a home or an office?

It is clear that there are some pretty striking differences in the temporal trends seen for homes and offices. Unfortunately, these labels are only available for 4/5 of users because the company has had some difficulty in getting users to enter this information. Therefore, in order to take full advantage of this user segmentation, it would be helpful if we could easily infer whether a device is in a home or an office based on these temporal trends. To that end, I implemented a machine learning model that took temporal trends for CO2, chemicals, and dust as features and used the subset of users with labels as training data.

DETAILS ON THE MACHINE LEARNING MODEL

model_chart.png

For all the data science nerds among you, here are some more details on the machine learning model. The chart below provides more technical details (you can also check out the code on github). I used a random forest model, which has several advantages in this application. This approach is computationally cost effective, can scale easily, and can be implemented quickly by the engineering team. This approach can model non-linear effects and interactions between features. Another important consideration is that the model is relatively interpretable. It is not a black box: we can peek under hood and see what is driving the model. For instance, we can see which features are important in driving the models decisions.

how well does the model perform?

ROC.png

The first question I want to ask is whether the model performed better than we would expect from random guessing. The next plot shows the performance of the model relative to random chance. The green line shows the model’s performance. The closer the line gets to the top left corner of the plot, the better the model is doing. The grey cloud shows the range of performance I could expect by random chance. (I estimated this with a permutation test, where I randomly shuffled the “home” and “office” labels 1,000 times and applied the model to see how it performs on random data.) As you can see, the green line is way outside the range of performance I would expect from chance alone. Thus, I can be confident that the model is performing better than chance.

confusion.png

Another way we can assess the performance of the model is to see how many homes and offices it correctly identified. The table below shows this data. The overall numbers are pretty low because these numbers come from the 20% of the labeled data that was left out for final validation. Further, the entire dataset we are using for this model is about 4/5 of the total dataset, because about 1/5 of users are missing home/office labels.

We can compare the new model to a very simple model in which we treat all of the users as homes, since about 90% of users are homes. This is actually implicitly what the company has been doing so far so it makes sense to compare the new model to this simple model. Compared to that model, we will lose 20 homes that are now being incorrectly classified as offices. However, we will gain 49 offices that are now correctly identified. So overall, this is an improvement on the status quo. How big of an improvement? Relative to a model in which we assume that all users are homes, we will see a 3% boost in correct labelings.

What metrics drive the model?

Let’s now look under the hood and consider what is driving the model. The next plot shows the relative importance of each feature included in the model. The model is being heavily driven by CO2 on Sunday, the predawn hours on Monday. This outcome makes sense given what we saw in the temporal trends before. Although there are differences between homes and offices in all of the metrics, differences in CO2 levels on Sunday and predawn Monday provide the most reliable means of differentiating whether a user is a home or an office. (Note that because the features are correlated, importance values can be suppressed so that low values do not mean that no information is present).

feature_importance_legend.png

Which metric has the highest priority for the user?

The next question to consider is which air quality metric has the highest priority for a given user. This would be the metric that, if addressed, would bring the greatest improvement in air quality for the user. One way to identify that metric is to find the metric where the user has unusually poor air quality, relative to other users. In order to do that, I need to look at how air quality metrics are statistically distributed for our two user segments (homes and offices).

The plot below shows how two of the metrics are distributed for homes (left) and offices (right). CO2 is on the y-axis and chemicals are on the x-axis. Darker regions of the plot are regions where more users are clustered. A lot of users are at the center of their user segment. That is, they have average levels of CO2 and average levels of chemicals. A good number of users, however, deviate from those norms, with CO2 levels or chemical levels that are well outside of the average for their segment. I can use this multidimensional representation of the data to identify which metric is most abnormal for a given user, relative to other users in their segment.

Let’s walk through a specific example. If you click on the right arrow, you’ll see a grey dot appear on the plot for homes. This is our example user. Note that this user has totally normal levels of chemicals for a home — no cause for concern there. However, their CO2 levels are abnormally high, relative to other homes. Therefore, the first alert we should send them when the open the app should tell them about their CO2 levels and offer suggestions on how to reduce CO2. This is the air quality metric where we can move the needle the most. Therefore, a CO2 alert will bring more value to the user than an alert about any other air quality metric.

I showed you earlier that air quality patterns are very different for homes and offices. This difference is also apparent in the current plot:  the distributions are different for homes and offices. Since I am identifying high priority metrics by comparing users to other users, it is important that I'm not comparing apples and oranges. I want to compare users to other similar users. For instance, what is unusually high CO2 for an office is not necessarily unusually high CO2 for a home. Therefore, I can improve the optimization by taking user segmentation into account when prioritizing the air quality alerts.

This optimization approach provides a data-driven method for selecting which recommendations to send to users. Critically, the protocol is easy to implement and is scalable. The product team agrees that this data-driven approach to prioritizing user alerts is a valuable improvement and is recommending that it be put into production.