CDC Data Exercise

Author

Erick Mollinedo

Published

February 7, 2024

Salmonella paratyphi and Salmonella typhi infection to Salmonellosis in 2019

These are the packages used for this exercise:

library(here)
here() starts at C:/Users/molli/OneDrive/Documentos/UGA/Spring 2024/MADA/erickmollinedo-MADA-portfolio
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The following dataset is about the provisional cases of Salmonellosis for the year 2019 in the United States regions, the US territories and non-US residents. Salmonellosis is part of the national notifiable diseases reported from the National Notifiable Diseases Surveillance System (NNDSS). The cases are reported by the state health departments to the Centers for Disease Control and Prevention (CDC) on a weekly basis. This dataset was obtained from the CDC data website https://data.cdc.gov/, and the original dataset was downloaded from this link.

Cleaning the dataset

The following code chunk details about loading the dataset into the salmonella object.

#Load the dataset into the `salmonella` object
salmonella <- read_csv(here("cdcdata-exercise", "data", "Salmonella_CDC_2019.csv"))
Rows: 1470 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (15): Reporting Area, Salmonella Paratyphi infection§, Current week, fla...
dbl  (8): MMWR Year, MMWR Week, Salmonella Paratyphi infection§, Current wee...
lgl  (6): Salmonella Paratyphi infection§, Previous 52 weeks Max†, Salmonell...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Explore the dimensions of the dataset
nrow(salmonella)
[1] 1470
ncol(salmonella)
[1] 29

This dataset has 1470 observations and 29 variables, among those variables I am only interested in some, since some are repetitive or have a lot of non-reported (blank or missing values). So in this part of the code I am deleting most of the variables, so I only have the location area, the week, and the weekly reported cases of Salmonella typhi, Salmonella paratyphi and Salmonellosis (which represents cases of Salmonella other than S. paratyphi and S. typhi).

#Using the `select()` function to choose only 5 variables
salmonella <- salmonella %>% select(c(`Reporting Area`, `MMWR Week`, `Salmonella Paratyphi infection§, Current week`, `Salmonella Typhi infection¶, Current week`, `Salmonellosis (excluding Salmonella Paratyphi infection and Salmonella Typhoid infection)**, Current week`))

Here I am changing the name of the variables, so they are easier to read, then I am changing the NA values to 0, so they are not inputed as NAs.

#First renaming all the columns using the `rename()` function
salmonella <- salmonella %>% rename(Region = `Reporting Area`,
                                    Week = `MMWR Week`,
                                    `S. paratyphi` = `Salmonella Paratyphi infection§, Current week`,
                                    `S. typhi` = `Salmonella Typhi infection¶, Current week`,
                                    `Other Salmonella` = `Salmonellosis (excluding Salmonella Paratyphi infection and Salmonella Typhoid infection)**, Current week`)

#Then changing all `NAs` to `0`
na_index <- is.na(salmonella)
salmonella[na_index] <- 0

Finally, I decided to keep only the records that belong to any of the 9 Census Bureau designated regions. This means removing all data from the individual 50 US states and 6 US territories. But first I wanted to explore if there are any typos in some of the locations.

#First exploring how many unique values are from the `location` variable using `unique()`
unique(salmonella$Region) %>% sort(decreasing = F)
 [1] "ALABAMA"                  "ALASKA"                  
 [3] "AMERICAN SAMOA"           "ARIZONA"                 
 [5] "ARKANSAS"                 "CALIFORNIA"              
 [7] "COLORADO"                 "CONNECTICUT"             
 [9] "DELAWARE"                 "DISTRICT OF COLUMBIA"    
[11] "EAST NORTH CENTRAL"       "EAST SOUTH CENTRAL"      
[13] "FLORIDA"                  "GEORGIA"                 
[15] "GUAM"                     "HAWAII"                  
[17] "IDAHO"                    "ILLINOIS"                
[19] "INDIANA"                  "IOWA"                    
[21] "KANSAS"                   "KENTUCKY"                
[23] "LOUISIANA"                "MAINE"                   
[25] "MARYLAND"                 "MASSACHUSETTS"           
[27] "MICHIGAN"                 "MIDDDLE ATLANTIC"        
[29] "MIDDLE ATLANTIC"          "MINNESOTA"               
[31] "MISSISSIPPI"              "MISSOURI"                
[33] "MONTANA"                  "MOUNTAIN"                
[35] "NEBRASKA"                 "NEVADA"                  
[37] "NEW ENGLAND"              "NEW HAMPSHIRE"           
[39] "NEW JERSEY"               "NEW MEXICO"              
[41] "NEW YORK"                 "NEW YORK CITY"           
[43] "NON-US RESIDENTS"         "NORTH CAROLINA"          
[45] "NORTH DAKOTA"             "NORTHERN MARIANA ISLANDS"
[47] "OHIO"                     "OKLAHOMA"                
[49] "OREGON"                   "PACIFIC"                 
[51] "PENNSYLVANIA"             "PUERTO RICO"             
[53] "RHODE ISLAND"             "SOUTH ATLANTIC"          
[55] "SOUTH CAROLINA"           "SOUTH DAKOTA"            
[57] "TENNESSEE"                "TEXAS"                   
[59] "TOTAL"                    "U.S. VIRGIN ISLANDS"     
[61] "US RESIDENTS"             "US TERRITORIES"          
[63] "UTAH"                     "VERMONT"                 
[65] "VIRGINIA"                 "WASHINGTON"              
[67] "WEST NORTH CENTRAL"       "WEST SOUTH CENTRAL"      
[69] "WEST VIRGINIA"            "WISCONSIN"               
[71] "WYOMING"                 

As seen above there are two middle atlantic variables: MIDDLE ATLANTIC and MIDDDLE ATLANTIC so I corrected the later one.

#Rename all `MIDDDLE ATLANTIC` observations to `MIDDLE ATLANTIC` using `mutate()` and `recode()`
salmonella <- salmonella %>% mutate(Region = recode(Region, `MIDDDLE ATLANTIC` = "MIDDLE ATLANTIC"))

#Check again if the operation worked out using the `unique()` function
unique(salmonella$Region) %>% sort(decreasing = F)
 [1] "ALABAMA"                  "ALASKA"                  
 [3] "AMERICAN SAMOA"           "ARIZONA"                 
 [5] "ARKANSAS"                 "CALIFORNIA"              
 [7] "COLORADO"                 "CONNECTICUT"             
 [9] "DELAWARE"                 "DISTRICT OF COLUMBIA"    
[11] "EAST NORTH CENTRAL"       "EAST SOUTH CENTRAL"      
[13] "FLORIDA"                  "GEORGIA"                 
[15] "GUAM"                     "HAWAII"                  
[17] "IDAHO"                    "ILLINOIS"                
[19] "INDIANA"                  "IOWA"                    
[21] "KANSAS"                   "KENTUCKY"                
[23] "LOUISIANA"                "MAINE"                   
[25] "MARYLAND"                 "MASSACHUSETTS"           
[27] "MICHIGAN"                 "MIDDLE ATLANTIC"         
[29] "MINNESOTA"                "MISSISSIPPI"             
[31] "MISSOURI"                 "MONTANA"                 
[33] "MOUNTAIN"                 "NEBRASKA"                
[35] "NEVADA"                   "NEW ENGLAND"             
[37] "NEW HAMPSHIRE"            "NEW JERSEY"              
[39] "NEW MEXICO"               "NEW YORK"                
[41] "NEW YORK CITY"            "NON-US RESIDENTS"        
[43] "NORTH CAROLINA"           "NORTH DAKOTA"            
[45] "NORTHERN MARIANA ISLANDS" "OHIO"                    
[47] "OKLAHOMA"                 "OREGON"                  
[49] "PACIFIC"                  "PENNSYLVANIA"            
[51] "PUERTO RICO"              "RHODE ISLAND"            
[53] "SOUTH ATLANTIC"           "SOUTH CAROLINA"          
[55] "SOUTH DAKOTA"             "TENNESSEE"               
[57] "TEXAS"                    "TOTAL"                   
[59] "U.S. VIRGIN ISLANDS"      "US RESIDENTS"            
[61] "US TERRITORIES"           "UTAH"                    
[63] "VERMONT"                  "VIRGINIA"                
[65] "WASHINGTON"               "WEST NORTH CENTRAL"      
[67] "WEST SOUTH CENTRAL"       "WEST VIRGINIA"           
[69] "WISCONSIN"                "WYOMING"                 

The operation worked, so now I will filter only the 9 US regions. Then, checking again if the operation worked.

#Using `filter()` to keep only the 9 Census Bureau designated regions
salmonella <- filter(salmonella, Region %in% c("NEW ENGLAND", "MIDDLE ATLANTIC", "EAST NORTH CENTRAL", "WEST NORTH CENTRAL",
                                                 "SOUTH ATLANTIC", "EAST SOUTH CENTRAL", "WEST SOUTH CENTRAL", "MOUNTAIN", "PACIFIC"))

#Check again if the operation worked out using the `unique()` function
unique(salmonella$Region) %>% sort(decreasing = F)
[1] "EAST NORTH CENTRAL" "EAST SOUTH CENTRAL" "MIDDLE ATLANTIC"   
[4] "MOUNTAIN"           "NEW ENGLAND"        "PACIFIC"           
[7] "SOUTH ATLANTIC"     "WEST NORTH CENTRAL" "WEST SOUTH CENTRAL"

Exploratory and Descriptive Analysis

First, I created a dataframe salmonella_summary that summarizes the number of infections of each type of Salmonella by each region.

#First I grouped the observations using `group_by()`, and then used `summarize()` with `sum()` to create the summary of infections for each type of Salmonella by each region
salmonella_summary <- salmonella %>% group_by(Region) %>% 
 summarize(`S. paratyphi` = sum(`S. paratyphi`),
            `S. typhi` = sum(`S. typhi`),
            `Other Salmonella` = sum(`Other Salmonella`))

#View the dataframe
salmonella_summary
# A tibble: 9 × 4
  Region             `S. paratyphi` `S. typhi` `Other Salmonella`
  <chr>                       <dbl>      <dbl>              <dbl>
1 EAST NORTH CENTRAL              0          1                504
2 EAST SOUTH CENTRAL              0          0                117
3 MIDDLE ATLANTIC                 1          8                322
4 MOUNTAIN                        1          2                283
5 NEW ENGLAND                     1          0                 50
6 PACIFIC                         0          0                 51
7 SOUTH ATLANTIC                  4         15                331
8 WEST NORTH CENTRAL              0          1                224
9 WEST SOUTH CENTRAL              5          1                242

To create a table that shows the frequency of cases by region and their percentages, I decided to transpose the data frame, creating the salmonella_summary_transp object.

#Transpose data using the `data.frame()` function to create the data frame, then using `t()` to transpose column by rows
salmonella_summary_transp <- data.frame(cbind(names(salmonella_summary), t(salmonella_summary)))

#Since this function didn't properly named the columns, I manually set them using the `colnames()` function
colnames(salmonella_summary_transp) <- c("Bacteria",
                                         "East North Central", 
                                         "East South Central",
                                         "Middle Atlantic",
                                         "Mountain",
                                         "New England",
                                         "Pacific",
                                         "South Atlantic",
                                         "West North Central",
                                         "West South Central")

#Here I also specified that the rows shouldn't be named, using the `rownames()` and then set to NULL
rownames(salmonella_summary_transp) <- NULL

#I also deleted the first row of the new data frame, since it contained the name of the columns, I did this using base R.
salmonella_summary_transp <- salmonella_summary_transp[-1,]

#View the data frame
salmonella_summary_transp
          Bacteria East North Central East South Central Middle Atlantic
2     S. paratyphi                  0                  0               1
3         S. typhi                  1                  0               8
4 Other Salmonella                504                117             322
  Mountain New England Pacific South Atlantic West North Central
2        1           1       0              4                  0
3        2           0       0             15                  1
4      283          50      51            331                224
  West South Central
2                  5
3                  1
4                242

As seen above, there is the problem that all the columns are character type variable, so I changed them to numeric in the following code chunk.

#Use the `mutate_at()` function and then the as.numeric statement to change all the variables, except the first one to numeric type
salmonella_summary_transp <- salmonella_summary_transp %>% mutate_at(c("East North Central", 
                                         "East South Central",
                                         "Middle Atlantic",
                                         "Mountain",
                                         "New England",
                                         "Pacific",
                                         "South Atlantic",
                                         "West North Central",
                                         "West South Central"), as.numeric)

#Using `str()` to check if the dataframe was changed
str(salmonella_summary_transp)
'data.frame':   3 obs. of  10 variables:
 $ Bacteria          : chr  "S. paratyphi" "S. typhi" "Other Salmonella"
 $ East North Central: num  0 1 504
 $ East South Central: num  0 0 117
 $ Middle Atlantic   : num  1 8 322
 $ Mountain          : num  1 2 283
 $ New England       : num  1 0 50
 $ Pacific           : num  0 0 51
 $ South Atlantic    : num  4 15 331
 $ West North Central: num  0 1 224
 $ West South Central: num  5 1 242

Finally, I created a table that summarizes the frequency and percentage of cases by each type of bacteria and by region under the salmonella_freq object.

salmonella_freq <- data.frame(salmonella_summary_transp %>% 
  group_by(Bacteria) %>% #Grouping by type of bacteria
  summarize(`East North Central` = paste0(sum(`East North Central`), "(", #To sum all cases of salmonella from this region
                                          round(sum(`East North Central`)/sum(salmonella_summary_transp$`East North Central`) *100,2), #To also estimate the percentage of cases for this region (The following lines of code repeat the two steps shown here)
                                          "%)"),
            `East South Central` = paste0(sum(`East South Central`), "(",
                                          round(sum(`East South Central`)/sum(salmonella_summary_transp$`East South Central`) *100,2),
                                          "%)"),
            `Middle Atlantic` = paste0(sum(`Middle Atlantic`), "(",
                                          round(sum(`Middle Atlantic`)/sum(salmonella_summary_transp$`Middle Atlantic`) *100,2),
                                          "%)"),
            `Mountain` = paste0(sum(`Mountain`), "(",
                                          round(sum(`Mountain`)/sum(salmonella_summary_transp$`Mountain`) *100,2),
                                          "%)"),
            `New England` = paste0(sum(`New England`), "(",
                                          round(sum(`New England`)/sum(salmonella_summary_transp$`New England`) *100,2),
                                          "%)"),
            `Pacific` = paste0(sum(`Pacific`), "(",
                                          round(sum(`Pacific`)/sum(salmonella_summary_transp$`Pacific`) *100,2),
                                          "%)"),
            `South Atlantic` = paste0(sum(`South Atlantic`), "(",
                                          round(sum(`South Atlantic`)/sum(salmonella_summary_transp$`South Atlantic`) *100,2),
                                          "%)"),
            `West North Central` = paste0(sum(`West North Central`), "(",
                                          round(sum(`West North Central`)/sum(salmonella_summary_transp$`West North Central`) *100,2),
                                          "%)"),
            `West South Central` = paste0(sum(`West South Central`), "(",
                                          round(sum(`West South Central`)/sum(salmonella_summary_transp$`West South Central`) *100,2),
                                          "%)")))

#View the table
salmonella_freq
          Bacteria East.North.Central East.South.Central Middle.Atlantic
1 Other Salmonella         504(99.8%)          117(100%)     322(97.28%)
2     S. paratyphi              0(0%)              0(0%)         1(0.3%)
3         S. typhi            1(0.2%)              0(0%)        8(2.42%)
     Mountain New.England  Pacific South.Atlantic West.North.Central
1 283(98.95%)  50(98.04%) 51(100%)    331(94.57%)        224(99.56%)
2    1(0.35%)    1(1.96%)    0(0%)       4(1.14%)              0(0%)
3     2(0.7%)       0(0%)    0(0%)      15(4.29%)           1(0.44%)
  West.South.Central
1        242(97.58%)
2           5(2.02%)
3            1(0.4%)

In this table it is observed that the majority of cases of Salmonellosis in all the regions belong to the types of Salmonella other than S. typhi or S. paratyphi.

And now, to have a visual representation of how the cases of each type of Salmonella look by week, I plotted the following figures. The first figure represents the number of Salmonella paratyphi cases by week and color coded by US region

#Using `ggplot()` and the `geom_col()` functions to plot the cases of S. paratyphi through time
ggplot(salmonella, aes(x= Week, y= `S. paratyphi`, fill= Region))+
  geom_col()+
  labs(x= "Week", y= "No. Cases")+
  scale_x_continuous(breaks = seq(1, 21, by= 1))

This figure represents the number of Salmonella typhi cases by week and color coded by US region

#Using `ggplot()` and the `geom_col()` functions to plot the cases of S. typhi through time
ggplot(salmonella, aes(x= Week, y= `S. typhi`, fill= Region))+
  geom_col()+
  labs(x= "Week", y= "No. Cases")+
  scale_x_continuous(breaks = seq(1, 21, by= 1))

And finally, the next figure shows the number of cases of Other types of Salmonella (the majority of them) by week and color coded by US region.

#Using `ggplot()` and the `geom_col()` functions to plot the cases of all other types of Salmonellosis through time
ggplot(salmonella, aes(x= Week, y= `Other Salmonella`, fill= Region))+
  geom_col()+
  labs(x= "Week", y= "No. Cases")+
  scale_x_continuous(breaks = seq(1, 21, by= 1))+
  scale_y_continuous(breaks = seq(0, 200, by= 20))

This section contributed by MUTSA NYAMURANGA

Creating Synthetic Replicate Data

# make sure the packages are installed
# Load required packages
library(here)
library(dplyr)
library(ggplot2)
library(skimr)
library(gtsummary)

Here I set a seed so that my synthetic data will be reproducible to assess discrepencies with the original data.

set.seed(189)
n_observations <- 189

Analyzing Orginal Data Set

Although I have view Erick’s code and his analysis, I would like to also gain an understanding of what he looked at and how he got there. Taking a look at the data myself will help create the correct data frame for replication.

#Skim the data structure to analyze observations and variable types
skimr::skim(salmonella)
Data summary
Name salmonella
Number of rows 189
Number of columns 5
_______________________
Column type frequency:
character 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Region 0 1 7 18 0 9 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Week 0 1 11.00 6.07 1 6 11 16 21 ▇▆▆▆▆
S. paratyphi 0 1 0.06 0.41 0 0 0 0 5 ▇▁▁▁▁
S. typhi 0 1 0.15 0.41 0 0 0 0 2 ▇▁▁▁▁
Other Salmonella 0 1 11.24 11.17 0 1 8 18 52 ▇▃▂▁▁
#Collect distribution of variable observations
gtsummary::tbl_summary(salmonella, statistic = list(
  all_continuous() ~ "{mean}/{median}/{min}/{max}/{sd}",
  all_categorical() ~ "{n} / {N} ({p}%)"
),)
Characteristic N = 1891
Region
    EAST NORTH CENTRAL 21 / 189 (11%)
    EAST SOUTH CENTRAL 21 / 189 (11%)
    MIDDLE ATLANTIC 21 / 189 (11%)
    MOUNTAIN 21 / 189 (11%)
    NEW ENGLAND 21 / 189 (11%)
    PACIFIC 21 / 189 (11%)
    SOUTH ATLANTIC 21 / 189 (11%)
    WEST NORTH CENTRAL 21 / 189 (11%)
    WEST SOUTH CENTRAL 21 / 189 (11%)
Week 11.0/11.0/1.0/21.0/6.1
S. paratyphi
    0 181 / 189 (96%)
    1 7 / 189 (3.7%)
    5 1 / 189 (0.5%)
S. typhi
    0 165 / 189 (87%)
    1 20 / 189 (11%)
    2 4 / 189 (2.1%)
Other Salmonella 11/8/0/52/11
1 n / N (%); Mean/Median/Minimum/Maximum/SD
#Distributions Within each variable
table(salmonella$Region)

EAST NORTH CENTRAL EAST SOUTH CENTRAL    MIDDLE ATLANTIC           MOUNTAIN 
                21                 21                 21                 21 
       NEW ENGLAND            PACIFIC     SOUTH ATLANTIC WEST NORTH CENTRAL 
                21                 21                 21                 21 
WEST SOUTH CENTRAL 
                21 
table(salmonella$Week)

 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 
 9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9 
table(salmonella$`S. paratyphi`)

  0   1   5 
181   7   1 
table(salmonella$`S. typhi`)

  0   1   2 
165  20   4 
table(salmonella$`Other Salmonella`)

 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
41  9  5  8  6  4 12  7  6  4  3  3  4 10  4  5  3  5  4  5  5  5  1  3  3  2 
26 27 28 29 30 31 32 34 35 36 38 39 47 48 52 
 1  3  3  1  2  1  1  1  3  1  1  1  1  1  1 

Synthesis

Here I create the synthetic data frame based on elements of Erick’s analysis

# Create synthetic data frame similar to the original
syn_salmonella <- data.frame(

  Region = sample(c("NEW ENGLAND", "MIDDLE ATLANTIC", "EAST NORTH CENTRAL", "WEST NORTH CENTRAL",
                    "SOUTH ATLANTIC", "EAST SOUTH CENTRAL", "WEST SOUTH CENTRAL", "MOUNTAIN", "PACIFIC"),
                  n_observations, replace = TRUE),
  Week = sample(1:21, n_observations, replace = TRUE),
  `S. paratyphi` = sample(c(0, 1), n_observations, replace = TRUE, prob = c(0.8, 0.2)),
  `S. typhi` = sample(c(0, 1), n_observations, replace = TRUE, prob = c(0.9, 0.1)),  
  `Other Salmonella` = sample(c(0, 1), n_observations, replace = TRUE, prob = c(0.7, 0.3)) 
)

I take a look at the data structure to make sure everything has been created correctly.

str(syn_salmonella)
'data.frame':   189 obs. of  5 variables:
 $ Region          : chr  "MIDDLE ATLANTIC" "EAST NORTH CENTRAL" "EAST NORTH CENTRAL" "EAST SOUTH CENTRAL" ...
 $ Week            : int  8 19 6 1 7 16 11 10 10 14 ...
 $ S..paratyphi    : num  1 1 1 0 0 0 0 1 1 0 ...
 $ S..typhi        : num  0 0 1 0 1 0 0 0 0 0 ...
 $ Other.Salmonella: num  0 1 0 1 1 0 1 0 0 0 ...
colnames(syn_salmonella)
[1] "Region"           "Week"             "S..paratyphi"     "S..typhi"        
[5] "Other.Salmonella"
ncol(syn_salmonella)
[1] 5

Summary Tables

I, then, create summary tables for the data

# Summary table similar to original
syn_salmonella_summary <- syn_salmonella %>%
  group_by(Region) %>%
  summarize(`S..paratyphi` = sum(`S..paratyphi`, na.rm = TRUE),  # Add na.rm = TRUE if there are NA values
            `S..typhi` = sum(`S..typhi`, na.rm = TRUE),
            `Other.Salmonella` = sum(`Other.Salmonella`, na.rm = TRUE))

# Transpose summary data frame
syn_salmonella_summary_transp <- data.frame(t(syn_salmonella_summary[-1]))
colnames(syn_salmonella_summary_transp) <- syn_salmonella_summary$Region

# Change data types to numeric (excluding the first column)
syn_salmonella_summary_transp[, -1] <- sapply(syn_salmonella_summary_transp[, -1], as.numeric)
# Table of frequencies and percentages
syn_salmonella_freq <- syn_salmonella_summary_transp %>%
  mutate(Total = rowSums(.)) %>%
  mutate(across(everything(), ~paste0(., " (", round(. / Total * 100, 2), "%)"), .names = "{col}_Percent")) %>%
  select(-Total) %>%
  rbind(c("Total", colSums(syn_salmonella_summary_transp[, -1])))
Warning in rbind(deparse.level, ...): number of columns of result, 19, is not a
multiple of vector length 9 of arg 2
syn_salmonella_summary_transp
                 EAST NORTH CENTRAL EAST SOUTH CENTRAL MIDDLE ATLANTIC MOUNTAIN
S..paratyphi                      8                  5               4        6
S..typhi                          5                  3               0        2
Other.Salmonella                  6                  5               5        4
                 NEW ENGLAND PACIFIC SOUTH ATLANTIC WEST NORTH CENTRAL
S..paratyphi               5       5              1                  7
S..typhi                   1       2              2                  1
Other.Salmonella           5       9              4                  5
                 WEST SOUTH CENTRAL
S..paratyphi                      3
S..typhi                          2
Other.Salmonella                  7
syn_salmonella_summary
# A tibble: 9 × 4
  Region             S..paratyphi S..typhi Other.Salmonella
  <chr>                     <dbl>    <dbl>            <dbl>
1 EAST NORTH CENTRAL            8        5                6
2 EAST SOUTH CENTRAL            5        3                5
3 MIDDLE ATLANTIC               4        0                5
4 MOUNTAIN                      6        2                4
5 NEW ENGLAND                   5        1                5
6 PACIFIC                       5        2                9
7 SOUTH ATLANTIC                1        2                4
8 WEST NORTH CENTRAL            7        1                5
9 WEST SOUTH CENTRAL            3        2                7

Plotting similar to original

Finally, I create plots similar to the plots made by Erick in his anaylsis.

# Plot for S. paratyphi cases by week and region
ggplot(syn_salmonella, aes(x = Week, y = `S..paratyphi`, fill = Region)) +
  geom_col() +
  labs(x = "Week", y = "No. Cases") +
  scale_x_continuous(breaks = seq(1, 21, by = 1))

# Plot for S. typhi cases by week and region
ggplot(syn_salmonella, aes(x = Week, y = `S..typhi`, fill = Region)) +
  geom_col() +
  labs(x = "Week", y = "No. Cases") +
  scale_x_continuous(breaks = seq(1, 21, by = 1))

# Plot for Other Salmonella cases by week and region
ggplot(syn_salmonella, aes(x = Week, y = `Other.Salmonella`, fill = Region)) +
  geom_col() +
  labs(x = "Week", y = "No. Cases") +
  scale_x_continuous(breaks = seq(1, 21, by = 1)) +
  scale_y_continuous(breaks = seq(0, 200, by = 20))

Data Comparison

I believe that the data is quite similar in terms of volume, but the differences come in distribution throughout the week. The similarities that the synthetic data can replicate are not going to be on a week to week bases unless specified, but in that case, we would essentially be copy and pasting the original data.