Beginning in early April 2020, the Delphi group has conducted a major survey to track COVID-19 across the United States. With the support of Facebook Data for Good, we have been able to recruit tens of thousands of active Facebook users every day to take our voluntary survey. Concurrently, a University of Maryland team has conducted a parallel international effort covering over 100 countries worldwide. Every day, we aggregate our survey results to produce estimates of symptoms for counties and states across the United States, making these estimates available through our COVIDcast map and our public API. This augments the numerous other data sources available in our COVIDcast API to form a more complete picture of the pandemic.
In previous posts, we demonstrated how self-reported symptoms on our survey can correlate strongly with COVID cases and how symptom survey data could improve forecasts of COVID cases. Our data is also powering the COVID-19 Symptom Data Challenge, which asked participants to use our symptom survey data to enable earlier detection of outbreaks and help public health authorities and the general public respond to the pandemic. (The first submissions were due on October 6th, and we look forward to sharing the results.)
Over the summer, we began the process of revising our survey to address new topics that had arisen as key problems for public health:
What groups and occupations have been most affected by COVID-19?
Who is being tested for COVID-19, and what are their test results?
Do people who want COVID-19 tests have access to tests?
Are people with symptoms seeking medical care? If so, what kind of care?
Is the public complying with public health recommendations, such as the recommendation to wear masks in public to reduce the spread of COVID-19?
Are people participating in activities that may spread COVID-19, such as attending public events, working indoors with others, or going to school in person?
Some of these topics were partially addressed by our previous survey, but we knew that as the pandemic changes and public health priorities adapt, our survey must change also. We began a consultation process: in discussions with researchers using our data, public health officials, and team members at Carnegie Mellon and at Facebook, we established the most important topics to address in the new survey, and began to draft questions. Through several rounds of revisions, we finalized a new version of the survey that includes new questions on testing, mask use, activities, and demographics.
(Meanwhile, questions about the public’s beliefs, behaviors, and norms surrounding COVID-19 are being addressed by a separate survey conducted by researchers at MIT, also in partnership with Facebook to recruit respondents.)
The full text of the survey can be found on our documentation site—this version is Wave 4.
Between the new survey’s deployment on September 8, 2020, and October 7th, we collected 1,220,000 valid responses from respondents across the United States. Each response includes questions about symptoms, mask wearing, testing, and the other important topics described above, along with demographic details about the respondent. These demographics include age, gender, race, occupation, and education, allowing us to understand how different groups have been affected and which groups are currently most vulnerable to COVID-19. (As we described before, Facebook does not receive any individual survey responses; we at Carnegie Mellon collect this data and prepare aggregate data, which does not identify any individual respondent, for public release.)
Let’s explore just a few of the questions on the survey to see what they can tell us. All of the plots below were made using our covidcast R package using data we make publicly available in the COVIDcast API—click the Code button to see the full code for each example.
First, a simple question: What percentage of respondents say that they wear a mask most or all of the time when they’re in public? For comparison, we’ll map this next to map of the percentage of respondents who personally know someone in their local community who is sick (with a fever and at least one other symptom, such as cough or difficulty breathing). This percentage correlates very well with COVID case rates as reported by state agencies.
library(covidcast)
library(ggplot2)
library(grid)
library(gridExtra)
options(covidcast.auth = Sys.getenv("API_KEY")) # for more on API keys, see: https://cmu-delphi.github.io/delphi-epidata/api/api_keys.html
start_day <- "2020-09-08"
end_day <- "2020-10-07"
vec <- strsplit(format(as.Date(end_day), "%B %e %Y"), split = "\\s+")[[1]]
date_formatted <- paste(paste(vec[1], vec[2]), vec[3], sep = ", ")
plot_label <- labs(caption = "Data from Delphi COVIDcast, covidcast.cmu.edu")
date_label <- labs(subtitle = date_formatted)
grid_label <- textGrob("Data from Delphi COVIDcast, covidcast.cmu.edu",
hjust = 1, x = 1, gp = gpar(fontsize = 9))
mask <- covidcast_signal("fb-survey", "smoothed_wwearing_mask",
start_day = end_day, end_day = end_day,
geo_type = "state")
cmnty_cli <- covidcast_signal("fb-survey", "smoothed_whh_cmnty_cli",
start_day = end_day, end_day = end_day,
geo_type = "state")
g1 <- plot(mask, range = c(55, 100),
title = "% wearing masks in public most or all the time",
choro_col = c("#d9f0c2", "#bfe6b5", "#1f589f")) +
date_label
g2 <- plot(cmnty_cli, range = c(5, 40),
title = "% who know someone who is sick") +
date_label
grid.arrange(g1, g2, nrow = 1, bottom = grid_label)
We see that in New England, which had numerous COVID cases during the first wave early in 2020, most respondents report wearing masks in public. The lowest rates of mask wearing are in the central United States, such as in North and South Dakota, where case rates are currently increasing rapidly. We can also directly compare the rates of mask-wearing with officially reported case counts, as reported by Johns Hopkins University:
library(ggrepel)
library(dplyr)
library(plotly)
case_rates <- covidcast_signal("usa-facts", "confirmed_7dav_incidence_prop",
start_day = end_day, end_day = end_day,
geo_type = "state")
joined_data <- mask %>%
inner_join(case_rates, by = "geo_value", suffix = c(".mask", ".case"))
outlying_states <- c("wy", "id", "wi", "mt", "sd", "nd", "vt", "dc", "me",
"ut", "ri", "mn")
mask_fit <- lm(value.mask ~ value.case, data = joined_data)
best_fit_data <- data.frame(value.case = range(joined_data$value.case))
joined_data %>%
plot_ly(
type = "scatter",
mode = "markers") %>%
add_trace(
name = "",
x = best_fit_data$value.case,
y = predict(mask_fit, newdata = best_fit_data),
mode = "lines",
hoverinfo = "skip",
line = list(color = "rgb(102, 102, 102)", width = 2, dash = "dash")) %>%
add_trace(
name = "",
x = ~value.case,
y = ~value.mask,
text = ~abbr_to_name(toupper(geo_value)),
marker = list(color = "rgb(187, 0, 0)"),
hovertemplate = paste0("<b>%{text}</b><br>",
"Mask use: %{y:.0f}%<br>",
"Cases per 100,000: %{x:.1f}"),
showlegend = FALSE) %>%
layout(xaxis = list(title = "New cases per 100,000 people (7-day average)",
showline = TRUE, mirror = "ticks", zeroline = FALSE),
yaxis = list(title = "% wearing masks most/all the time in public",
showline = TRUE, mirror = "ticks", zeroline = FALSE),
title = "Current COVID case rates and mask usage, by state",
subtitle = date_formatted,
annotations = list(
list(x = 0.5, y = 1,
text = date_formatted,
showarrow = FALSE, xref = "paper", yref = "paper",
xanchor = "center", yanchor = "auto",
xshift = 0, yshift = 0),
list(x = 1, y = 0,
text = "Data from Delphi COVIDcast, covidcast.cmu.edu",
showarrow = FALSE, xref = "paper", yref = "paper",
xanchor = "right", yanchor = "auto",
xshift = 0, yshift = 0)),
dragmode = FALSE) %>%
config(modeBarButtonsToRemove = c("zoom2d", "pan2d", "select2d", "lasso2d",
"zoomIn2d", "zoomOut2d", "resetScale2d",
"autoScale2d", "toggleSpikelines",
"hoverCompareCartesian",
"hoverClosestCartesian"))
The relationship is striking. (Hover over or click each point to see which state it is.) Of course, correlation is not causation, and there are many differences between these states beyond their use of masks. For example, people in more rural states may spend less time close to others, and hence feel less need to wear a mask; also, many states with high mask usage had major outbreaks earlier in the pandemic. Nonetheless, this data could be very useful to epidemiological researchers studying the public reaction to the pandemic and its spread.
Mask-wearing surveys have been done before—for example, the New York Times commissioned a large survey during July 2020 and produced detailed maps—but because our survey runs continuously, we will be able to track how the prevalence of mask-wearing changes over time. Our partner survey effort at the University of Maryland has also been tracking mask-wearing worldwide. This up-to-date global data may help public health officials target public awareness campaigns, and also aid researchers trying to understand where the virus is most likely to spread and what measures are most effective in preventing it.
Access to COVID testing is important for many reasons: readily available tests make it possible for cases to be detected sooner, before the virus can spread, and help contact tracers do their work of tracking down new cases. Shortages of tests can impede contact tracing and can also be harmful in settings like nursing homes, where many residents are particularly vulnerable to COVID-19 and staff must be tested regularly to ensure residents aren’t exposed.
Testing data is published by state health authorities and aggregated by organizations such as the COVID Tracking Project; our own COVIDcast map also includes results from antigen tests produced by Quidel, a major test product manufacturer. But publicly reported data can be inconsistent between states, may not include all tests, and can be subject to delays or testing backlogs. A survey allows us to ask people directly whether they have been tested, bypassing these problems.
For example, we can ask: What percentage of people in each state say they were tested in the last 14 days?
tested <- covidcast_signal("fb-survey", "smoothed_wtested_14d",
start_day = end_day, end_day = end_day,
geo_type = "state")
plot(tested, range = c(6, 20),
title = "% tested in the last 14 days") +
plot_label + date_label
We can see a strong variation across the country. But it’s worth interpreting this result more carefully. In North Dakota, nearly a fifth of adults surveyed reported being tested in the last 14 days, but if we take North Dakota’s adult population, 18% of that population is several times more people than North Dakota’s public testing data says were tested during that time period. However, North Dakota’s published testing figures only include PCR tests, not COVID antigen tests, meaning the public figures leave out an important part of testing. News reporting suggests North Dakota makes use of Abbott’s ID NOW antigen test, but there is little public data on how many of these tests are conducted per day.
At the same time, we can’t conclude that the true figure is definitely 18%. Our survey is subject to sampling biases, both because it samples from Facebook’s active user population (which does not include all adults in the United States) and because it is voluntary, and some people may be more likely to volunteer than others. It’s plausible these biases could affect rural states more strongly than others. As we've explained before, we try to adjust for these biases using demographic and other data, but without reliable ground-truth testing data, it’s impossible to tell if these adjustments successfully remove all bias.
Antigen tests, and new rapid tests, will only become more important in the coming months, making it important for epidemiologists to have information about their use. While our survey is not a substitute for reliable public reporting of their test results and may be subject to biases, it does suggest that testing volume varies strongly across the United States and illustrates how important accurate test reporting is.
But testing volume is only one part of the issue. We can also ask what fraction of tests results are positive. Early in the pandemic, when there were testing shortages, this test positivity rate could indicate areas where tests were only available to people already experiencing symptoms. Tests are more widely available now, but test positivity can still help us understand where the virus is most active and where people are being tested. When survey respondents say they were tested in the past 14 days, we also ask if they tested positive:
positivity <- covidcast_signal("fb-survey", "smoothed_wtested_positive_14d",
start_day = end_day, end_day = end_day,
geo_type = "state")
plot(positivity, range = c(1, 26),
title = "% of tests that were positive, last 14 days") +
plot_label + date_label
We can see quite a large variation in test positivity across the United States, with the highest values in areas that currently have many COVID cases. (A few states, in gray, do not have a reported value. If fewer than 100 survey respondents in one state say they were tested, we do not have sufficient data to accurately estimate the test positivity rate, and hence do not report an estimate for that state.)
Test positivity only indirectly answers a crucial question: Does test availability meet demand for tests? When our survey respondents say they have not been tested in the past 14 days, we ask whether they wanted to be tested in that time. This also varies across the United States:
wanted_test <- covidcast_signal("fb-survey", "smoothed_wwanted_test_14d",
start_day = end_day, end_day = end_day,
geo_type = "state")
plot(wanted_test, range = c(5, 12),
title = "% not tested who wanted a test, last 14 days") +
plot_label + date_label
In short, while the survey data does not replace official public health reporting of COVID testing and case counts, it can provide insights not available any other way. By providing these signals to the public in the COVIDcast API, alongside numerous other data streams, we hope to provide researchers, public health officials, and journalists the information they need to form a more complete picture of the pandemic.
All of the maps and graphs above were built using data we make publicly available through our COVIDcast API, and the exact technical details are described in our signal documentation. The symptom and mask data is already featured on our COVIDcast interactive map, and other survey signals will soon be added as well. More information about the surveys, including preliminary results, is given on our survey site.
But we also know that many researchers will have questions that can’t be answered from these simple county- and state-level averages. What occupations have the highest rate of COVID-19? What age groups engage in the most behavior at risk of transmitting COVID-19? What symptoms are most common in people who report testing positive for COVID-19? Many of these questions require full access to the survey data, including demographics about each respondent, to answer. That is why we share de-identified survey data with academic and nonprofit researchers studying the pandemic. We encourage all researchers to submit a simple form to get access.
Of course, we require all those with access to protect the confidentiality of survey respondents, and permit the public release of only aggregate data that can’t be used to identify any respondent. And while Facebook supports this project by helping recruit participants via a notification at the top of their News Feed, the survey is conducted off of Facebook—they never receive individual respondent’s answers to survey questions, only aggregate data that does not identify any individual respondents.
Between the data we make available through COVIDcast, and the many researchers using our raw data to answer important questions, we hope this survey can serve important roles informing our national pandemic response. Armed with the right data, we can make decisions needed to protect public health and permit safe reopening.
For more information about Delphi’s symptom surveys, and for media contact details, see our surveys page. For updates, you can follow CmuDelphi on Twitter.
Note. This post was updated on October 17, 2020 to correct an error in the scatterplot of mask usage and reported case rates. An error in our data processing system meant our reported case rates were half the size they should have been. This did not affect the trend in the scatterplot, only the scale of the X axis.
Acknowledgements: Many items in Wave 4 of our survey are based on work by the team at the Joint Program in Survey Methodology at the University of Maryland led by Frauke Kreuter; Adrianne Bradford and Samantha Chiu helped design items and gave feedback. Facebook’s survey team, including Sarah LaRocca and Katherine Morris, gave extensive feedback on Wave 4 of the survey and assisted in its deployment. Kelsey Mulcahy at Facebook Data For Good helped coordinate data access for numerous researchers. Taylor Arnold, Alex Reinhart, and Kathryn Mazaitis developed the code needed to process Wave 4. Delphi members Roni Rosenfeld and Ryan Tibshirani suggested survey revisions. We thank Jean Cox-Ganser, Paul Henneberger, Danielle Iuliano, Robert Kraut, and Jordan Weber for suggesting survey items and revisions for Wave 4.