Motivation

HIV (human immunodeficiency virus) is a crucial global public health concern. Since the beginning of the epidemic, 85.6 million people have been infected with the HIV virus and about 40.4 million people have died of HIV. Therefore, monitoring HIV surveillance data and understanding the risk factors of HIV infection is essential for HIV prevention and ending the HIV epidemic. This project aims to provide evidence-based information for public health practitioners to develop tailored HIV interventions, available to researchers from all over the world.

Initial Question

What is the global incidence of HIV? Which populations are inappropriately influenced by HIV? What are the risk factors for HIV infection? To answer these questions, we utilized the global HIV database provided by the US Census to understand trends in HIV from all countries. Since this database does not include information in the US, we utilized National Health and Nutrition Examination Survey (NHANES) datasets. An advantage of the NHANES dataset is that it includes variables such as demographic, socioeconomic status, etc., which also allows us to delve more in-depth into the potential risk factors associated with HIV infection.

Data Sources and Cleaning

HIV surveillance

Data Source

The United States Census HIV/AIDS Surveillance Data Base contains information for all countries and areas of the world with at least 5,000 population, with the exception of Northern America (including the United States) and U.S. territories. Data included in the Data Base are drawn from medical and scientific journals, professional papers, official statistics, State Department cables, newspapers, magazines, materials from the World Health Organization, conference abstracts, and the Internet. Sources are scanned and reviewed for data.

Data Cleaning

Since the dataset has inconsistency in sample size (some come with a sample size and some come without), we assigned the median value to those without sample size so that they could be included in the study without significantly influencing the sample size variable. Since the Data Base includes data from all countries worldwide with 5000+ population, we utilized an external country classification dataset from the World Bank to categorize the Data Base for analysis purposes.

incidence <- read.csv("surveillance/data/hiv_incidence.csv") |>
  mutate(
    # Replace missing INC_RATE with 0 or other appropriate value
    INC_RATE = replace(INC_RATE, is.na(INC_RATE), 0), 
    # Replace missing SAMPSIZE with median or other appropriate value
    SAMPSIZE = as.numeric(SAMPSIZE),
    SAMPSIZE = replace(SAMPSIZE, is.na(SAMPSIZE), median(SAMPSIZE, na.rm = TRUE)) 
  )

country_info <- readxl::read_excel("surveillance/CLASS.xlsx", sheet = "List of economies", col_names = TRUE)

incidence_grouped <- left_join(incidence, country_info, by = c("Country.Code" = "Code")) |> 
  janitor::clean_names()

categorize_decade <- function(year) {
  paste0(substr(year, 1, 3), "0s")
}

incidence_grouped <- incidence_grouped |> 
  mutate(decade = sapply(year, categorize_decade))

To prepare the dataset for trend analysis, we first ensure all case counts are numerical and then rectify any data inaccuracies by substituting missing or placeholder values with zeros. Next, we organize the data by continental regions, income groups, and decade intervals, which allows us to compute average HIV incidence rates and total case counts within these categories. Similarly, we categorize the data by specific population subgroups to identify and quantify HIV trends within those demographics.

# For trend analysis involving region (continental), income_group (i.e., high-income, low-income, middle-income countries), and decade (1980s, 1990s, 2000s, 2010s, 2020s).
trend_continental <- incidence_grouped |>
  # Convert 'no_cases' to numeric (if it's not already)
  mutate(no_cases = as.numeric(no_cases)) |>
  # -1 is a placeholder in the no_cases and no_deaths column for NA. Replace -1 with 0 here.
  mutate(no_cases = ifelse(no_cases == -1, 0, no_cases)) |>
  # Group and summarize data by region, income_group, and decade
  group_by(region, income_group, decade) |>
  summarize(
    avg_inc_rate = mean(inc_rate, na.rm = TRUE),
    total_cases = sum(no_cases, na.rm = TRUE),
    .groups = 'drop'  # to drop grouping structure after summarization
  )

# For trend analysis involving population subgroup, e.g., MSM, IVDU, Transgender individuals, etc.
trend_subgroup <- incidence_grouped |>
  # Convert 'no_cases' to numeric (if it's not already)
  mutate(no_cases = as.numeric(no_cases)) |>
  # -1 is a placeholder in the no_cases and no_deaths column for NA. Replace -1 with 0 here.
  mutate(no_cases = ifelse(no_cases == -1, 0, no_cases)) |>
  # Group and summarize data by population subgroup
  group_by(population_subgroup) |>
  summarize(
    avg_inc_rate = mean(inc_rate, na.rm = TRUE),
    total_cases = sum(no_cases, na.rm = TRUE),
    .groups = 'drop'  # to drop grouping structure after summarization
  )

Risk factor investigation

Data Source

The National Health and Nutrition Examination Survey is designed to evaluate the health and nutritional status of both adults and children. In 1999, the NHANES study transitioned into a continuous format, enabling investigators to conduct longitudinal studies. NHANES encompasses both interview data and physical examination data, which involved information from different perspectives and fields, providing researchers with a comprehensive dataset to examine the prevalence and risk factors associated with various diseases. In this study, we used the demographic data, HIV testing, and sexual behavior data from 1999 to 2016 to identify the potential sociodemographic factors associated with HIV infection status.

Data Cleaning

We will load the following libraries for utilization this section:

haven
tidyverse
purrr
plotly
modelr

The following variables are included in the analysis:

seqn: sequence
lbdhi/lbxhivc: HIV test results
ridageyr: age
riagendr: gender
ridreth1: race
xdmdeduc2: education
dmdmartl: martial status
sxq220 / sxq550: In the past 12 months, with how many men have you had anal or oral sex? (Only to men)
sxq150 / sxq490: In the past 12 months, with how many women have you had anal or oral sex? (Only to women) # Exploratory analysis

import NHANES datafiles

The analyzed data sets was downloaded from the National Health and Nutrition Examination Survey Website. Specifically, demographic, HIV test, and sexual behavior datas from 1999 to 2016 utilized to conducted the analysis. Participants aged between 20 and 49 with a completed survey were eligible and included to our study.

demo_list = 
  list.files(path = "NHANES/DEMO", full.names = TRUE) 
hiv_list = 
  list.files(path = "NHANES/HIV", full.names = TRUE) 
sxq_list = 
  list.files(path = "NHANES/SXQ", full.names = TRUE) 


load_files = function(x){
  
  filetype =
    str_extract(x, "DEMO|HIV|SXQ")
  
  year = 
    str_extract(x, "(?<=_)[A-Z]")
  
data = 
    read_xpt(x) |>
    janitor::clean_names()|>
    mutate(filetype = filetype, year = year)
    
}


demo_list_output = 
  map(demo_list, load_files) 

hiv_list_output = 
  map(hiv_list, load_files) 

sxq_list_output = 
  map(sxq_list, load_files)

join_pair = function(df1, df2) {
  left_join(df1, df2, by = c("seqn", "year"))
}


hiv_demo_list = 
  map2(hiv_list_output, demo_list_output, ~join_pair(.x, .y))
  
hiv_demo_sxq_df = 
  map2(hiv_demo_list, sxq_list_output, ~join_pair(.x, .y)) |>
  bind_rows() 

cleaned_joined_df = 
  hiv_demo_sxq_df |>
  filter(ridageyr >= 20 & ridageyr <= 49) |>
  mutate(hiv = case_when(
    lbdhi == 1 | lbxhivc == 1 ~ 1,
    lbdhi == 2 | lbxhivc == 2 ~ 2)
    )|> 
  mutate(MSM = case_when(
    year %in% c("A","B","C") ~ sxq220,
    year %in% c("D","E","F","G","H","I") ~ sxq550)
    )|>
   mutate(WSW = case_when(
    year %in% c("A","B","C") ~ sxq150,
    year %in% c("D","E","F","G","H","I") ~ sxq490)
    )|>
  mutate(year = recode(year, 
                       "A" = "1999-2000",
                       "B" = "2001-2002",
                       "C" = "2003-2004",
                       "D" = "2005-2006",
                       "E" = "2007-2008",
                       "F" = "2009-2010",
                       "G" = "2011-2012",
                       "H" = "2013-2014",
                       "I" = "2015-2016"),
         hiv = recode(hiv,
                            "1" = "Reactive",
                            "2" = "Non-reactive"),
         gender = recode(riagendr, 
                         "1" = "Male",
                         "2" = "Female"),
         age = ridageyr,
         race = recode(ridreth1,
                       "1" = "Mexican American",
                       "2" = "Other Hispanic",
                       "3" = "Non-Hispanic White",
                       "4" = "Non-Hispanic Black",
                       "5" = "Other Race - Including Multi-Racial"),
         education = recode(dmdeduc2,
                            "1" = "Less than 9th grade",
                            "2" = "9-11th grade",
                            "3" = "High school graduate or equivalent",
                            "4" = "Some college or AA degree",
                            "5" = "College graduate or above",
                            "7" = NA_character_,
                            "9" = NA_character_),
         marriage = recode(dmdmartl,
                           "1" = "Married", 
                           "2" = "Widowed",
                           "3" = "Divorced",
                           "4" = "Separated",
                           "5" = "Never married",
                           "6" = "Living with partner",
                           "77" = NA_character_,
                           "99" = NA_character_)
         ) |>
  mutate(samesexcontact = case_when(
    (MSM %in% c(0, 7777, 77777,9999, 99999) | WSW %in% c(0, 7777, 77777, 9999, 99999)) ~ 0,
    (MSM >= 1 & MSM <= 600) | (WSW >= 1 & WSW <= 600) ~ 1
  )) |>
  select(seqn, hiv, age, gender, race, education, marriage, year, samesexcontact)

HIV surveillance

Incidence map

we showed HIV incidence by country.

# A map reports the HIV incidence rates in each country
map_inc = incidence_grouped |>
  plot_ly(type = 'choropleth', locations = ~country_code,
               z = ~inc_rate, text = ~paste(country, ': ', inc_rate, '%'),
               hoverinfo = 'text',color = ~inc_rate, colors = "Reds") |>
  layout(title = 'HIV Incidence Rates by country',
         geo = list(projection = list(type = 'orthographic')))

map_inc

Regions such as the Middle East, Sub-Saharan Africa, and South Asia demonstrate relatively higher HIV incidence rates compared to other areas. It is important to consider that regions depicted with no or minimal HIV incidence rates might not necessarily indicate a true low prevalence; this could be attributed to insufficient research or underreporting in those regions. Additionally, the database from which this map is generated does not include data from North American countries, hence their HIV incidence rates are not represented in this visualization.

Heatmap for Incidence Rate by Region, Decade, and National Income Level

we categorized data by their continental region, their studied country’s national income level, and decade of which the study is done. We produced a heatmap for reported HIV incidence rate grouped by these categories.

trend_continental <- trend_continental |> 
  mutate(avg_inc_rate = as.numeric(avg_inc_rate)) 

heatmap_plot <- trend_continental |> 
  ggplot(aes(x = decade, y = region, fill = avg_inc_rate)) + 
  geom_tile(color = "white") +  
  scale_fill_gradient(low = "white", high = "red") +  
  facet_wrap(~income_group) +  
  theme_minimal() + 
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  
    axis.title = element_text(size = 12),  
    legend.title = element_text(size = 10)  
  ) +
  labs(
    title = "Heatmap of HIV Incidence Rate by Region, Decade, and Income Level",
    x = "Decade",
    y = "Region",
    fill = "Avg Inc Rate"
  )

print(heatmap_plot)

High, Upper middle income, and Lower middle income regions in Europe and Central Asia has the highest average incidence rate during 2000s and 2010s, followed by High income regions in Middle East & North Africa during 1990s and Lower middle income regions in South Asia during 2020s. Sub-Saharan Africa shows consistently higher incidence rates across all income levels and decades.

Bar Plot for Total Cases by Population Subgroup

we recoded the population subgroup according to the codebook on website(https://www2.census.gov/programs-surveys/international-programs/guidance/user-guide/HIV_AIDSGeneralTerms.pdf).
- Population Subgroups:
  - Children: Includes all children except for those who are patients with tuberculosis or transfusion recipients.
  - Intravenous Drug Users/Needle Sharers: Includes drug addicts.
  - Military/Armed Forces: Includes police forces.
  - Other Non-Representative: Includes high-risk individuals such as transgender individuals, and clients of sex workers who do not fall into any of the other subpopulations.
  - Other Patients: Includes all other patients except for pediatric patients, STI patients, TB patients, and Transfused patients.
  - Sex workers: Includes other highly sexually active women such as bar workers.
  - Two Known Mixed Groups: Includes two different subpopulations reported with one number. An example one number represents both pregnant women and TB patients.

We produced a barchart for reported number of HIV cases grouped by these recoded subgroups. I excluded those of general population not in this six categories.

## recode subpopulation
trend_subgroup = trend_subgroup |>
  mutate(
    subpop_pooled = ifelse(
     str_detect(population_subgroup, "(?i)police|military"), "Military/Armed Forces",
     ifelse(
       str_detect(population_subgroup, "(?i)child|pediatric"), "Children",
     ifelse(
       str_detect(population_subgroup, "(?i)drug|IVDU") & !str_detect(population_subgroup, "(?i)STI|homo|prisoner|partner|sex worker"),  "Intravenous Drug Users/Needle Sharers",
     ifelse(
       str_detect(population_subgroup, "(?i)sex worker|bar") & !str_detect(population_subgroup,"(?i)trans|IVDU|homo|MSM|drug|client"),"Sex workers",
     ifelse(
       str_detect(population_subgroup, "(?i)transgender|homo|MSM|gay|TB|STI|sexually|prisoner|infants born|HIV2+|high risk|of HIV") & !str_detect(population_subgroup, "(?i)sex worker|IVDU|drug|Testing center attendees"), "Other Non-Representative",
      ifelse(
       str_detect(population_subgroup, "(?i)pts.|pregnant|Mothers") & !str_detect(population_subgroup, "(?i)sex worker|IVDU|STI|homo|TB"), "Other Patients",
      ifelse(
       str_detect(population_subgroup, "(?i)Blood|port|Testing center attendees|employer|contractor|textile|Wives|HIV-") , "General Population","Two Known Mixed Groups"))))))))

# unique(incidence$population_subgroup[str_detect(incidence$population_subgroup,"patient")]) for categorization
# unique(incidence$population_subgroup[incidence$population_subgroup_pooled=="Two Known Mixed Groups"])
trend_subgroup = trend_subgroup |>
  mutate(subpop_pooled = ifelse(
    population_subgroup %in% c("Bisexuals","Clients of sex workers","High risk individuals","Partners of drug users","Blood transfusion recipients"),"Other Non-Representative",subpop_pooled),
    subpop_pooled = ifelse(
      population_subgroup == "Sex workers & clients","Sex workers",subpop_pooled),
    subpop_pooled = ifelse(
      population_subgroup == "Patients","Other Patients",subpop_pooled),
    subpop_pooled = ifelse(
      population_subgroup %in% c("Heterosexuals","Adults","Individuals","Women","Various groups","Residents","Unspecified population","Truck drivers","Others","Adults - circumcised","Adults - uncircumcised","Factory workers", "General population", "Low risk groups","Fishermen","Workers","Esquineros","Trucking companies employees","Youths","Hospitality girls","Adolescents","Volunteers","Controls","Rural population","Employees","Commercial bank employees"),"General Population",subpop_pooled))

#check if I recoded it correctlt
check=trend_subgroup |>
  filter(subpop_pooled != "Two Known Mixed Groups") |>
  group_by(subpop_pooled,population_subgroup) |>
  summarise(n=n())


trend_subgroup |> 
  filter(subpop_pooled != 'General Population')|>
  group_by(subpop_pooled) |>
  summarise(total = sum(total_cases))|>
  mutate(subpop_pooled = fct_reorder(subpop_pooled,total)) |>
  ggplot(aes(y = subpop_pooled, x = total, fill = subpop_pooled)) + 
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Total HIV Cases by Population Subgroup",
       y = "Population Subgroup",
       x = "Total Cases") +
  theme(axis.text.y = element_text(size = 5, angle = 0),
        legend.position = "none")

This horizontal bar chart displays the distribution of total HIV cases by different population subgroups, with “Other Non-Representative” having the highest count and “Children” having the lowest.

Risk Factor investigation

To visualize the association between the selected potential risk factors and HIV test results, we specifically calculated the percent of participants whose HIV-antibody test showing (lbdhi or lbxhivc) ‘positive’ among all the participants who reported to have a ‘positive’ or ‘negative’ value in the antibody test. We further excluded the participants whose status is ‘missing’ or ‘indeterminate’ in the antibody test in our data visualization section.

Then, we plot each bar chart illustrating the integrated (from 1999 to 2016) percentage of ‘positive’ participants on the y-axis and the risk factor of interest on the x-axis, correlated with a line plot indicating the change of the ‘positive’ participants for each risk factor to assess the association between HIV and each risk factor in a rough way. Besides, the mean age of the positive clusters among each sub-cohort was also calculated for each characteristic/risk factor.

combine_plot = function(bar, line, indicator){
  title_g = paste("Percent of HIV-Antibody Positive Patients from 1999 to 2016 by", indicator)
  
  combine_bar = 
    subplot(bar, line, margin = 0.1) %>% 
    layout(title = title_g)
  
  annotations = list( 
  list( 
    x = 0.17,  
    y = 0.98,  
    text = "Total Percent",  
    xref = "paper",  
    yref = "paper",  
    xanchor = "center",  
    yanchor = "bottom",  
    showarrow = FALSE 
  ),  
  list( 
    x = 0.86,  
    y = 0.98,  
    text = "Percent by Year",  
    xref = "paper",  
    yref = "paper",  
    xanchor = "center",  
    yanchor = "bottom",  
    showarrow = FALSE 
  )) 
  
  combine_bar = combine_bar %>% 
  layout(annotations = annotations)
  
}

bar_race = 
  cleaned_joined_df %>% 
  drop_na(hiv) %>% 
  group_by(race) %>%
  summarize(total_par = n(), 
            age_mean = round(mean(age[hiv == "Reactive"]), 2), 
            hiv_percent = round(sum(hiv == "Reactive") / total_par * 100, 3)) %>%
  ungroup() %>% 
  mutate(race = fct_reorder(race, hiv_percent)) %>% 
  mutate(text_label = str_c("mean age: ", age_mean, ", Positive HIV Percent: ", hiv_percent, "%")) %>% 
  plot_ly(x = ~race, y = ~hiv_percent, color = ~race, type = "bar", text = ~text_label)

line_race =
  cleaned_joined_df %>% 
  drop_na(hiv) %>% 
  group_by(year, race) %>% 
  summarize(total_par = n(), 
            age_mean = round(mean(age[hiv == "Reactive"]), 2), 
            hiv_percent = round(sum(hiv == "Reactive") / total_par * 100, 3)) %>%
  ungroup() %>% 
  mutate(race = fct_reorder(race, hiv_percent)) %>% 
  mutate(text_label = str_c("mean age: ", age_mean, ", Positive HIV Percent: ", hiv_percent, "%")) %>% 
  plot_ly(x = ~year, y = ~hiv_percent, color = ~race, type = "scatter", mode = "lines+markers",
          text = ~text_label) 


combine_race = combine_plot(bar_race, line_race, "Race")

combine_race

The bar plot shown the percent of patients with a positive result from the HIV antibody test in each race when we integrated all datasets from 1999 to 2016. It indicated that the race Non-Hispanic Black has reported the highest percent, while the ‘Other Race’ had the lowest percent. From the bar plot, disparity among the races.
The line plot shows the change in the reported, positive percent of HIV-antibody test from 1999 to 2016. We could observe that the trend in each race are different, but the Non-Hispanic Black and Other Hispanic had a decreasing trend over 1999 to 2016. From the two plots, we found that the mean age in each race and each year were also different.

Race and the Percent of ‘Positive’ in HIV Antibody Test

bar_race = 
  cleaned_joined_df %>% 
  drop_na(hiv) %>% 
  group_by(race) %>%
  summarize(total_par = n(), 
            age_mean = round(mean(age[hiv == "Reactive"]), 2), 
            hiv_percent = round(sum(hiv == "Reactive") / total_par * 100, 3)) %>%
  ungroup() %>% 
  mutate(race = fct_reorder(race, hiv_percent)) %>% 
  mutate(text_label = str_c("mean age: ", age_mean, ", Positive HIV Percent: ", hiv_percent, "%")) %>% 
  plot_ly(x = ~race, y = ~hiv_percent, color = ~race, type = "bar", text = ~text_label)

line_race =
  cleaned_joined_df %>% 
  drop_na(hiv) %>% 
  group_by(year, race) %>% 
  summarize(total_par = n(), 
            age_mean = round(mean(age[hiv == "Reactive"]), 2), 
            hiv_percent = round(sum(hiv == "Reactive") / total_par * 100, 3)) %>%
  ungroup() %>% 
  mutate(race = fct_reorder(race, hiv_percent)) %>% 
  mutate(text_label = str_c("mean age: ", age_mean, ", Positive HIV Percent: ", hiv_percent, "%")) %>% 
  plot_ly(x = ~year, y = ~hiv_percent, color = ~race, type = "scatter", mode = "lines+markers",
          text = ~text_label) 


combine_race = combine_plot(bar_race, line_race, "Race")

combine_race

The bar plot displays the percentage of patients with a positive result from the HIV antibody test for each racial group when we integrated all datasets from 1999 to 2016. It indicates that the Non-Hispanic Black race reported the highest percentage, while the ‘Other Race’ had the lowest percentage. This bar plot highlights disparities among the races. The line plot illustrates the change in the reported positive percentage of HIV antibody tests from 1999 to 2016. We can observe that the trends in each racial group are different, with Non-Hispanic Black and Other Hispanic showing a decreasing trend over this period. From the two plots, it is evident that the mean age varies for each race and each year were also different.

Sex and the Percent of ‘Positive’ in HIV Antibody Test

bar_gender = 
  cleaned_joined_df %>% 
  drop_na(hiv) %>% 
  group_by(gender) %>%
  summarize(total_par = n(), 
            age_mean = round(mean(age[hiv == "Reactive"]), 2), 
            hiv_percent = round(sum(hiv == "Reactive") / total_par * 100, 3)) %>%
  ungroup() %>% 
  mutate(text_label = str_c("mean age: ", age_mean, ", Positive HIV Percent: ", hiv_percent, "%")) %>% 
  plot_ly(x = ~gender, y = ~hiv_percent, color = ~gender, type = "bar", text = ~text_label)


line_gender = 
  cleaned_joined_df %>% 
  drop_na(hiv) %>% 
  group_by(year, gender) %>% 
  summarize(total_par = n(), 
            age_mean = round(mean(age[hiv == "Reactive"]), 2), 
            hiv_percent = round(sum(hiv == "Reactive") / total_par * 100, 3)) %>%
  ungroup() %>% 
  mutate(text_label = str_c("mean age: ", age_mean, ", Positive HIV Percent: ", hiv_percent, "%")) %>% 
  plot_ly(x = ~year, y = ~hiv_percent, color = ~gender, type = "scatter", mode = "lines+markers",
          text = ~text_label)

combine_gender = combine_plot(bar_gender, line_gender, "Gender")
combine_gender

Both the bar plot and the line plot indicate that males have reported a significantly higher percentage than females. Furthermore, the line plot reveals a decreasing trend in positive percentages for both sex. From the two plots, it is evident that the mean age in females appears to be slightly higher that among males among patients with a positive test result.

Education Level and the Percent of ‘Positive’ in HIV Antibody Test

bar_edu = 
cleaned_joined_df %>% 
  drop_na(hiv) %>% 
  group_by(education) %>%
  summarize(total_par = n(), 
            age_mean = round(mean(age[hiv == "Reactive"]), 2), 
            hiv_percent = round(sum(hiv == "Reactive") / total_par * 100, 3)) %>%
  ungroup() %>% 
  mutate(education = forcats::fct_relevel(education, c("Less than 9th grade", "9-11th grade", 
                                         "High school graduate or equivalent", 
                                         "Some college or AA degree",
                                         "College graduate or above"))) %>% 
  mutate(text_label = str_c("mean age: ", age_mean, ", Positive HIV Percent: ", hiv_percent, "%")) %>% 
  plot_ly(x = ~education, y = ~hiv_percent, color = ~education, type = "bar", text = ~text_label)


line_edu =
  cleaned_joined_df %>% 
  drop_na(hiv) %>% 
  group_by(year, education) %>% 
  summarize(total_par = n(), 
            age_mean = round(mean(age[hiv == "Reactive"]), 2), 
            hiv_percent = round(sum(hiv == "Reactive") / total_par * 100, 3)) %>%
  ungroup() %>% 
  mutate(education = forcats::fct_relevel(education, c("Less than 9th grade", "9-11th grade", 
                                         "High school graduate or equivalent", 
                                         "Some college or AA degree",
                                         "College graduate or above"))) %>% 
  mutate(text_label = str_c("mean age: ", age_mean, ", Positive HIV Percent: ", hiv_percent, "%")) %>% 
  plot_ly(x = ~year, y = ~hiv_percent, color = ~education, type = "scatter", mode = "lines+markers",
          text = ~text_label) 


combine_edu = combine_plot(bar_edu, line_edu, "Education")
combine_edu

The bar plot indicates that the group with an education level of ‘9-11th grade’ has reported the highest percentage, while ‘College graduate or above’ had the lowest percentage. Disparity among education levels was also observed. From the line plot, the trends from 1999 to 2016 in each education level are distinct, but with an overall decreasing trend. These two plots also indicate that the mean age varies for each race and each year, and the overall trend is unclear.

Marital status and the Percent of ‘Positive’ in HIV Antibody Test

bar_mar = 
  cleaned_joined_df %>% 
  drop_na(hiv) %>% 
  group_by(marriage) %>%
  summarize(total_par = n(), 
            age_mean = round(mean(age[hiv == "Reactive"]), 2), 
            hiv_percent = round(sum(hiv == "Reactive") / total_par * 100, 3)) %>%
  ungroup() %>% 
  mutate(marriage = fct_reorder(marriage, hiv_percent)) %>% 
  mutate(text_label = str_c("mean age: ", age_mean, ", Positive HIV Percent: ", hiv_percent, "%")) %>% 
  plot_ly(x = ~marriage, y = ~hiv_percent, color = ~marriage, type = "bar", text = ~text_label) 


line_mar =
  cleaned_joined_df %>% 
  drop_na(hiv, marriage) %>% 
  group_by(year, marriage) %>% 
  summarize(total_par = n(), 
            age_mean = round(mean(age[hiv == "Reactive"]), 2), 
            hiv_percent = round(sum(hiv == "Reactive") / total_par * 100, 3)) %>%
  ungroup() %>% 
  mutate(marriage = fct_reorder(marriage, hiv_percent)) %>% 
  mutate(text_label = str_c("mean age: ", age_mean, ", Positive HIV Percent: ", hiv_percent, "%")) %>% 
  plot_ly(x = ~year, y = ~hiv_percent, color = ~marriage, type = "scatter", mode = "lines+markers",
          text = ~text_label) 

combine_mar = combine_plot(bar_mar, line_mar, "Marital Status")
combine_mar

The bar plot shows that ‘widowed’ individuals have reported the highest percentage, while ‘married’ individuals had the lowest percentage. From the line plot, it is evident that the trends from 1999 to 2016 in each marital status group are different, with an overall decreasing trend (except for ‘widowed’, which appears to be an outlier). These two plots also indicate that the mean age varies for each race and each year, and the overall trend is unclear.

Same-sex Sexual Behavior and the Percent of ‘Positive’ in HIV Antibody Test

bar_sexbehave = 
  cleaned_joined_df %>%
  drop_na(hiv) %>% 
  mutate(samesexcontact = recode(samesexcontact, "0" = "No same-sex sexual behavior", 
                                  "1" = "Has same-sex sexual behavior")) %>% 
  group_by(samesexcontact) %>% 
  summarize(total_par = n(), 
            age_mean = round(mean(age[hiv == "Reactive"]), 2), 
            hiv_percent = round(sum(hiv == "Reactive") / total_par * 100, 3)) %>% 
  ungroup() %>% 
  mutate(text_label = str_c("mean age: ", age_mean, ", Positive HIV Percent: ", hiv_percent, "%")) %>% 
  plot_ly(x = ~samesexcontact, y = ~hiv_percent, color = ~samesexcontact, type = "bar", text = ~text_label) 

line_sexbehave =
  cleaned_joined_df %>% 
  drop_na(hiv, samesexcontact) %>% 
  mutate(samesexcontact = recode(samesexcontact, "0" = "No same-sex sexual behavior", 
                                  "1" = "Has same-sex sexual behavior")) %>% 
  group_by(year, samesexcontact) %>% 
  summarize(total_par = n(), 
            age_mean = round(mean(age[hiv == "Reactive"]), 2), 
            hiv_percent = round(sum(hiv == "Reactive") / total_par * 100, 3)) %>%
  ungroup() %>% 
  mutate(text_label = str_c("mean age: ", age_mean, ", Positive HIV Percent: ", hiv_percent, "%")) %>% 
  plot_ly(x = ~year, y = ~hiv_percent, color = ~samesexcontact, type = "scatter", mode = "lines+markers",
          text = ~text_label) 


combine_sexbehave = combine_plot(bar_sexbehave, line_sexbehave, "Sexual Behavior")
combine_sexbehave

Both the bar plot and the line plot indicate that the population with same-sex sexual behavior has reported a much higher percentage than the population without same-sex sexual behavior. The line plot indicates that the positive percentage fluctuated significantly among the group with same-sex sexual behavior, whereas in the group without same-sex sexual behavior, the positive percentage in each year did not exhibit as much fluctuation. These two plots also suggest that the mean age in the group with same-sex sexual behavior appears to be higher than that in the group without same-sexual behavior.

Additional analysis

HIV surveillance-Time-Series Analysis(ARIMA Model)

Time series analysis is appropriate for examining how data points indexed in time order (like yearly HIV incidence rates) change over time. It is suitable for analyzing trends, seasonal patterns, or forecasting future values. Since we wanted to assess HIV incidence rates by year, time series analysis is a good match.

Data Preparation

To lay the groundwork for our analysis, we arrange our dataset in chronological order to observe the progression of HIV cases over time. We then standardize the case numbers, ensuring they’re all in numerical form and address any missing values by replacing placeholders with zeroes. Following this, we structure the dataset in a granular fashion, with each record representing a unique combination of year, region, income group, and population subgroup. For each of these specific intersections, we calculate the mean incidence rate of HIV and the sum of cases. This methodical organization of data is designed to facilitate an in-depth analysis of HIV trends, allowing us to examine the evolution and patterns of the disease across different regions, economic strata, and demographic groups year by year.

sorted_data <- incidence_grouped |> 
  arrange(year) |>
  # Convert 'no_cases' to numeric (if it's not already)
  mutate(no_cases = as.numeric(no_cases)) |>
  # -1 is a placeholder in the no_cases and no_deaths column for NA. Replace -1 with 0 here.
  mutate(no_cases = ifelse(no_cases == -1, 0, no_cases))

# Transform the dataset to have one observation per time point per subgroup
grouped_data <- sorted_data |> 
  group_by(year, region, income_group, population_subgroup) |> 
  summarize(
    mean_inc_rate = mean(inc_rate, na.rm = TRUE),
    sum_no_cases = sum(no_cases, na.rm = TRUE),
    .groups = 'drop'  # This ensures the resulting tibble is not grouped
  )

Model Building

library(forecast)
library(tseries)
library(purrr)


# Create a list of unique combinations of region, income_group, and population_subgroup
unique_combinations <- grouped_data |>
  distinct(region, income_group, population_subgroup)

# Function to perform time series analysis on a subset of data
perform_time_series_analysis <- function(subset_data) {
  if (length(subset_data$mean_inc_rate) < 50) {
    warning(paste("Not enough data points to fit ARIMA model for", unique(subset_data$region), unique(subset_data$income_group), unique(subset_data$population_subgroup)))
    return(NULL)
  }
  
  ts_data <- ts(subset_data$mean_inc_rate, start = min(subset_data$year), frequency = 1)
  
  # Check for stationarity
  adf_test <- adf.test(ts_data, alternative = "stationary")
  
  # Fit ARIMA model
  arima_model <- auto.arima(ts_data)
  
  # Diagnostics
  diagnostics <- list(
    checkresiduals(arima_model, plot = FALSE)
  )
  
  # Return the results
  return(list(arima_model = arima_model, adf_test = adf_test, diagnostics = diagnostics))
}

# Modify the threshold here
threshold <- 20  # or another number that makes sense for your data

# List to store results
time_series_results <- list()

# Loop over each unique combination
for(i in seq_len(nrow(unique_combinations))) {
  # Filter the data for the current combination
  current_data <- grouped_data |>
    filter(
      region == unique_combinations$region[i],
      income_group == unique_combinations$income_group[i],
      population_subgroup == unique_combinations$population_subgroup[i]
    )
  
  # Perform time series analysis if data points are above the threshold
  if (nrow(current_data) >= threshold) {
    ts_data <- ts(current_data$mean_inc_rate, start = min(current_data$year), frequency = 1)
    
    # Check for stationarity
    adf_test <- adf.test(ts_data, alternative = "stationary")
    
    # Fit ARIMA model
    arima_model <- auto.arima(ts_data)
    
    # Diagnostics
    diagnostics <- list(
      checkresiduals(arima_model, plot = FALSE)
    )
    
    # Store the result with a unique name
    time_series_results[[paste(unique_combinations$region[i], unique_combinations$income_group[i], unique_combinations$population_subgroup[i], sep = "_")]] <- list(arima_model = arima_model, adf_test = adf_test, diagnostics = diagnostics)
  } else {
    warning(paste("Not enough data points to fit ARIMA model for combination", i, ":", unique_combinations$region[i], unique_combinations$income_group[i], unique_combinations$population_subgroup[i]))
  }
}

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,2) with drift
## Q* = 8.4892, df = 3, p-value = 0.03691
## 
## Model df: 2.   Total lags used: 5

# Check how many models were successfully fitted
length(time_series_results)

## [1] 1

# View results if any models were fitted
if (length(time_series_results) > 0) {
  time_series_results
}

## $`Sub-Saharan Africa_Lower middle income_Sex workers`
## $`Sub-Saharan Africa_Lower middle income_Sex workers`$arima_model
## Series: ts_data 
## ARIMA(0,1,2) with drift 
## 
## Coefficients:
##           ma1     ma2    drift
##       -1.5900  0.9355  -1.0622
## s.e.   0.4624  0.5196   0.5354
## 
## sigma^2 = 70.67:  log likelihood = -89.56
## AIC=187.12   AICc=189.12   BIC=192
## 
## $`Sub-Saharan Africa_Lower middle income_Sex workers`$adf_test
## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_data
## Dickey-Fuller = -3.0323, Lag order = 2, p-value = 0.1791
## alternative hypothesis: stationary
## 
## 
## $`Sub-Saharan Africa_Lower middle income_Sex workers`$diagnostics
## $`Sub-Saharan Africa_Lower middle income_Sex workers`$diagnostics[[1]]
## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,2) with drift
## Q* = 8.4892, df = 3, p-value = 0.03691

Results

ARIMA Model:

Model Type: ARIMA(0,1,2) with drift indicates that the model is an ARIMA model with no autoregressive terms (p=0), one differencing (d=1 to make the series stationary), and two moving average terms (q=2).
Coefficients:
- ma1: The first moving average coefficient is -1.5900, with a standard error of 0.4624. This is a significant coefficient given the magnitude compared to its standard error.
- ma2: The second moving average coefficient is 0.9355, with a standard error of 0.5196. This coefficient is also significant.
- drift: The coefficient for the drift term is -1.0622, with a standard error of 0.5354. This suggests a downward trend in the series.
Model Fit:
- sigma^2: The estimated variance of the residuals is 70.67.
- log likelihood: The log-likelihood of the model is -89.56, which is used to calculate AIC and BIC.
- AIC, AICc, BIC: These are information criteria used to compare models, with lower values generally indicating a better fit. The AICc is an adjusted version of the AIC for small sample sizes.
Augmented Dickey-Fuller Test:
- This test checks for the presence of unit roots, which are a sign of non-stationarity in the time series.
- The Dickey-Fuller value is -3.0323 with a p-value of 0.1791. Since the p-value is greater than the common significance level of 0.05, the test does not reject the null hypothesis of a unit root being present, suggesting that the time series may not be stationary despite the differencing.
Ljung-Box Test:
- This test is used to determine if there are significant autocorrelations in the residuals at lag k. Ideally, we want the p-value to be above 0.05 to conclude that there are no significant autocorrelations.
- The test statistic is Q* = 8.4892 with a p-value of 0.03691. Since the p-value is less than 0.05, this suggests that there is evidence of significant autocorrelation at lag k within the residuals, which means the model may not be adequately capturing all the patterns in the data.

Interpretations of results

The ARIMA model suggests a decreasing trend in the incidence rates for the subgroup.
The ADF test suggests that the series may not be completely stationary, indicating that further differencing or a different model specification may be necessary.
The significant Ljung-Box test implies that there are autocorrelations in the residuals that the model is not capturing, which could mean that important predictors or higher-order lags might be missing from the model. Given the limited information on the Data Base we are using, we are unable to progress further. However, future research may consider incorporating other predictors for the model.

Model Diagnosis and Time Series Trend Visualization

Select which model/combination to visualize

# Loop through the results and print the summary of each ARIMA model
for (combination in names(time_series_results)) {
  cat("\nCombination:", combination, "\n")
  print(summary(time_series_results[[combination]]$arima_model))
}

## 
## Combination: Sub-Saharan Africa_Lower middle income_Sex workers 
## Series: ts_data 
## ARIMA(0,1,2) with drift 
## 
## Coefficients:
##           ma1     ma2    drift
##       -1.5900  0.9355  -1.0622
## s.e.   0.4624  0.5196   0.5354
## 
## sigma^2 = 70.67:  log likelihood = -89.56
## AIC=187.12   AICc=189.12   BIC=192
## 
## Training set error measures:
##                      ME     RMSE      MAE      MPE    MAPE      MASE      ACF1
## Training set -0.6193829 7.732668 5.426826 1.645182 139.093 0.7643619 0.4777973

Based on the summary of the ARIMA model for the combination “Sub-Saharan Africa_Lower middle income_Sex workers,” it looks like an interesting case to visualize. The ARIMA(0,1,2) model with drift indicates that the time series is differenced once (to make it stationary) and includes two moving average terms. The drift term suggests a linear trend over time.

Time Series Trend Visualization

# Extracting the data and model for "Sub-Saharan Africa_Lower middle income_Sex workers"
selected_model <- time_series_results[["Sub-Saharan Africa_Lower middle income_Sex workers"]]$arima_model
selected_data <- grouped_data %>%
  filter(
    region == "Sub-Saharan Africa",
    income_group == "Lower middle income",
    population_subgroup == "Sex workers"
  )
# Plotting the actual time series data and the fitted model
ggplot(selected_data, aes(x = year, y = mean_inc_rate)) +
  geom_line() +
  geom_line(aes(y = fitted(selected_model)), color = "red") +
  labs(title = "Time Series Trend and ARIMA Model for Sub-Saharan Africa, Lower Middle Income, Sex Workers",
       x = "Year",
       y = "Mean Incidence Rate") +
  theme_minimal()

Black Line (Actual Data): Represents the actual observed values of the mean incidence rate over time. This line shows the fluctuations in the data across years.
Red Line (Fitted ARIMA Model): Represents the values as fitted by the ARIMA model. The red line seems to follow the general downward trend of the actual data, indicating that the model is capturing the overall trend well.
Interpretation of Trend: There is a notable decrease in the mean incidence rate from the early 1990s, which then stabilizes into a more fluctuating pattern from the 2000s onwards. The red line (fitted values) is capturing this trend, suggesting that the ARIMA model is accounting for the major trend in the data.

Diagnostic Plots for the ARIMA Model

# Diagnostic plots for the ARIMA model of the selected combination
checkresiduals(selected_model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,2) with drift
## Q* = 8.4892, df = 3, p-value = 0.03691
## 
## Model df: 2.   Total lags used: 5

Top Plot (Residuals vs. Time): This plot shows the residuals of the model over time. Ideally, the residuals should be randomly scattered around zero, indicating that the model has captured most of the trend and seasonality. In your graph, residuals do not show any clear pattern over time, which is good, but there seem to be certain years with higher deviations from zero, which could indicate model misspecification or the presence of outliers/anomalies.
Bottom Left Plot (ACF of Residuals): This plot shows the autocorrelations of the residuals at different lags. Ideally, we want to see these autocorrelations fall within the blue dashed lines, which represent the confidence intervals. If they are within the blue lines, it suggests that there is no significant autocorrelation in the residuals, and the model has captured the time series’ structure well. Our plot shows some autocorrelations at different lags crossing the confidence bounds, which might suggest that the model could be improved to better capture the time series dynamics.
Bottom Right Plot (Histogram and Q-Q Plot of Residuals): The histogram with the overlaid density plot and the Q-Q plot assess the normality of the residuals. The histogram seems to show that residuals may be normally distributed, but the Q-Q plot indicates some deviations from normality, particularly in the tails. This suggests that the residuals have some heavier tails than the normal distribution, which can occur if there are outliers or extreme values in our data.

Risk Factor Investigation-logistic regression

Set the reference group

cleaned_regression_df= 
   cleaned_joined_df |> 
   mutate(
      hiv_outcome = case_when(
        hiv == "Reactive" ~ 1,
        hiv == "Non-reactive" ~ 0), 
      education = forcats::fct_relevel(education, c("Less than 9th grade", "9-11th grade", "High school graduate or equivalent", "Some college or AA degree", "College graduate or above")),
      race = forcats::fct_relevel(race, c("Non-Hispanic White", "Mexican American", "Non-Hispanic Black", "Other Hispanic", "Other Race - Including Multi-Racial")),
      marriage = forcats::fct_relevel(marriage, c("Married", "Widowed", "Divorced", "Separated", "Never married", "Living with partner"))
         )

I set “Less than 9th grade” as the reference group for the education variable; “Non-Hispanic White” as the reference group for the race variable; “Married” as the reference group for the marriage variable.

Bivariate analysis

variables = c("age","gender","education","race","samesexcontact","marriage")

fit_and_summarize <- function(var) {
  model = glm(as.formula(paste("hiv_outcome~", var)), data = cleaned_regression_df, family = binomial()) |>
  broom::tidy()
}

model_summaries = 
  map(variables, fit_and_summarize) |>
  bind_rows() |>
  filter(p.value <= 0.008 & term != "(Intercept)")

model_summaries |>
  select(term, estimate, p.value) |>
  knitr::kable(digits = 3)

term	estimate	p.value
age	0.037	0.001
genderMale	1.107	0.000
raceNon-Hispanic Black	2.302	0.000
samesexcontact	2.854	0.000
marriageNever married	1.407	0.000
marriageLiving with partner	1.004	0.001

Because I am conducting 6 logistic tests, the Bonferroni-corrected significance level would be 0.05/6 = 0.008. Variables age, gender, race, samesexcontact, marriage were found significant associated with HIV infection status, at 0.8% level of significance. Therefore, these variables were included to the final model.

The final model

fit_regression = cleaned_regression_df |> 
  glm(hiv_outcome ~ samesexcontact + gender + age + race + marriage, data = _, family = binomial()) 

fit_regression|> 
  broom::tidy() |> 
  mutate(OR = exp(estimate)) |> 
  select(term, estimate, OR, p.value)|> 
  knitr::kable(digits = 3)

term	estimate	OR	p.value
(Intercept)	-10.460	0.000	0.000
samesexcontact	2.834	17.020	0.000
genderMale	1.820	6.171	0.000
age	0.056	1.058	0.000
raceMexican American	0.455	1.576	0.324
raceNon-Hispanic Black	2.072	7.941	0.000
raceOther Hispanic	1.245	3.473	0.009
raceOther Race - Including Multi-Racial	-14.286	0.000	0.978
marriageWidowed	2.036	7.661	0.066
marriageDivorced	1.138	3.121	0.036
marriageSeparated	1.406	4.080	0.020
marriageNever married	1.341	3.823	0.002
marriageLiving with partner	1.033	2.811	0.040

Logistic regression results:

The logistic regression analysis reveals several key risk factors associated with HIV in the US.
- Individuals reporting same-sex contacts exhibit a significant increase in the odds of being HIV positive (OR = 17.020), emphasizing the importance of considering sexual behavior in HIV risk assessment.
- Gender is also a significant predictor, with males having 6.171 times higher odds of being HIV positive compared to females.
- Age demonstrates a positive association, indicating a slight increase in HIV odds with 1-year increase in age (OR = 1.058).
- Among racial categories, Non-Hispanic Black individuals show the highest risk (OR = 7.941), while Mexican Americans do not exhibit a statistically significant association.
- Regarding marital status, divorced, separated, and never married individuals display elevated odds of being HIV positive, suggesting that certain relationship statuses may contribute to increased vulnerability.
These findings underscore the complex interplay of demographic factors in HIV risk and contribute valuable insights for targeted prevention and intervention strategies.

Bootstrap

bootstrap_df =
  cleaned_regression_df |> 
  bootstrap(n = 500) |> 
  mutate(
    models = map(.x = strap, ~glm(hiv_outcome ~ samesexcontact + gender + age + race + marriage, data = .x, family = binomial()) ),
    results = map(models, broom::tidy)) |> 
  select(-strap, -models) |> 
  unnest(results) |> 
  group_by(term) |> 
  mutate(OR = exp(estimate))

Distribution of estimates under repeated sampling

bootstrap_df|> 
  filter(term == "age") |> 
  ggplot(aes(x = estimate))+
  geom_density() +
  facet_wrap(.~ term)

bootstrap_df|> 
  filter(term != "age") |> 
  ggplot(aes(x = estimate))+
  geom_density() +
  facet_wrap(.~ term)

This is the uncertainty that we would get in this dataset if we do repeated sampling

Bootstrap mean OR and 95% CI

bootstrap_df|> 
  summarize(
    mean_OR = mean(OR),
    CI_lower = quantile(OR, 0.025),
    CI_upper = quantile(OR, 0.975)
    )|> 
  knitr::kable(digits = 3)

term	mean_OR	CI_lower	CI_upper
(Intercept)	0.000	0.000	0.000
age	1.059	1.033	1.089
genderMale	6.723	3.314	13.090
marriageDivorced	3.759	0.914	10.648
marriageLiving with partner	3.238	0.857	9.094
marriageNever married	4.566	1.775	11.174
marriageSeparated	4.575	0.812	13.766
marriageWidowed	10.501	0.000	47.665
raceMexican American	1.693	0.546	3.541
raceNon-Hispanic Black	8.629	4.318	16.571
raceOther Hispanic	3.705	1.187	8.266
raceOther Race - Including Multi-Racial	0.000	0.000	0.000
samesexcontact	18.254	10.022	32.351

Bootstrap Confidence Intervals:

The bootstrap analysis yields informative insights into the association between various sociodemographic factors and HIV outcomes.
- Notably, the wide 95% confidence intervals (CI) for most variables, excluding age, suggest substantial uncertainty in the estimated odds ratios. Same-sex behavior stands out with a mean odds ratio of 18.499 and a relatively narrow CI (10.862 to 31.096), indicating a robust positive association with HIV risk.
- In contrast, factors like “Male” “Divorced” “Living with partner” “Never married” “Separated” “Widowed” “Mexican American” “Non-Hispanic Black” and “Other Hispanic” exhibit wide CIs, signifying considerable variability in their associations with HIV.
The broad 95% CI underscores the uncertainty in the magnitude of these associations and suggests a need for caution in drawing definitive conclusions. This may be attributed to sample variability or other unaccounted factors influencing the observed relationships.

Discussion & Conclusion

HIV surveillance

The study has identified meaningful patterns in the incidence rates of HIV among various subgroups in Sub-Saharan Africa. The time series analysis, particularly the ARIMA(0,1,2) model with drift, has captured a notable decreasing trend in HIV incidence rates since the early 1990s. This trend stabilizes somewhat in the 2000s, with some fluctuation, but does not show a significant increase, indicating a potential effectiveness of interventions and awareness over time. Diagnostic plots suggest that while the model fits well overall, there are indications of outliers or anomalous events that could be affecting the residuals, pointing to areas where the model could be refined.

The overall conclusion is that HIV incidence rates in Sub-Saharan Africa show a decreasing trend over the years among sex workers, with some variability that might be explained by regional and economic factors. The analysis underscores the importance of tailored public health interventions and the need for continuous monitoring and adaptation to sustain and enhance the progress made in HIV prevention and treatment.

Risk Factor Investigation

The logistic regression and bootstrap analyses indicate that several demographic factors are associated with HIV outcomes, same-sex behavior showing a particularly strong positive association. However, the wide 95% confidence intervals for most variables, excluding age, show significant uncertainty in the estimated odds ratios, highlighting the limitations of the study. These broad intervals may stem from sample variability or unaccounted confounders, necessitating caution in drawing definitive conclusions.

Therefore, future research should explore larger and more diverse datasets to enhance the generalizability of findings and address potential sources of variability. Additionally, a thorough investigation into the complex interplay between demographic variables and HIV risk, considering potential interactions and subgroup analyses, could provide a more nuanced understanding of these associations. Moreover, the study could benefit from incorporating behavioral and contextual factors that might contribute to HIV risk.

P8105: Project Report

Junjie Hu, Ruixi Li, Xuan Lu, Tianyun Jiang, Qinting Shen

Motivation

Initial Question

Data Sources and Cleaning

HIV surveillance

Data Source

Data Cleaning

Risk factor investigation

Data Source

Data Cleaning

import NHANES datafiles

HIV surveillance

Risk Factor investigation

Additional analysis

HIV surveillance-Time-Series Analysis(ARIMA Model)

Data Preparation

Model Building

Results

Interpretations of results

Model Diagnosis and Time Series Trend Visualization

Risk Factor Investigation-logistic regression

Set the reference group

Bivariate analysis

The final model

Bootstrap

Distribution of estimates under repeated sampling

Bootstrap mean OR and 95% CI

Discussion & Conclusion

HIV surveillance

Risk Factor Investigation

P8105: Project Report

Junjie Hu, Ruixi Li, Xuan Lu, Tianyun Jiang, Qinting Shen

Motivation

Related Work

Initial Question

Data Sources and Cleaning

HIV surveillance

Data Source

Data Cleaning

Risk factor investigation

Data Source

Data Cleaning

import NHANES datafiles

HIV surveillance

Risk Factor investigation

Additional analysis

HIV surveillance-Time-Series Analysis(ARIMA Model)

Data Preparation

Model Building

Results

Interpretations of results

Model Diagnosis and Time Series Trend Visualization

Risk Factor Investigation-logistic regression

Set the reference group

Bivariate analysis

The final model

Bootstrap

Distribution of estimates under repeated sampling

Bootstrap mean OR and 95% CI

Discussion & Conclusion

HIV surveillance

Risk Factor Investigation