January 2016

"Big" thoughts

  • Big data is a subset of multidimensional data. Both contribute to (but neither taken on their own are necessary ingredients for) telling a compelling story with your data.
  • Some of the best and most useful data vizualizations are ones that we make for ourselves

John Tukey

"The greatest value of a picture is when it forces us to notice what we never expected to see." - John Tukey, 1977

Example: boxplots

Data for this talk

library(NHANES)
data(NHANES)
set.seed(123) ## why set the seed?
str(NHANES)
## Classes 'tbl_df', 'tbl' and 'data.frame':    10000 obs. of  76 variables:
##  $ ID              : int  51624 51624 51624 51625 51630 51638 51646 51647 51647 51647 ...
##  $ SurveyYr        : Factor w/ 2 levels "2009_10","2011_12": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Gender          : Factor w/ 2 levels "female","male": 2 2 2 2 1 2 2 1 1 1 ...
##  $ Age             : int  34 34 34 4 49 9 8 45 45 45 ...
##  $ AgeDecade       : Factor w/ 8 levels " 0-9"," 10-19",..: 4 4 4 1 5 1 1 5 5 5 ...
##  $ AgeMonths       : int  409 409 409 49 596 115 101 541 541 541 ...
##  $ Race1           : Factor w/ 5 levels "Black","Hispanic",..: 4 4 4 5 4 4 4 4 4 4 ...
##  $ Race3           : Factor w/ 6 levels "Asian","Black",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Education       : Factor w/ 5 levels "8th Grade","9 - 11th Grade",..: 3 3 3 NA 4 NA NA 5 5 5 ...
##  $ MaritalStatus   : Factor w/ 6 levels "Divorced","LivePartner",..: 3 3 3 NA 2 NA NA 3 3 3 ...
##  $ HHIncome        : Factor w/ 12 levels " 0-4999"," 5000-9999",..: 6 6 6 5 7 11 9 11 11 11 ...
##  $ HHIncomeMid     : int  30000 30000 30000 22500 40000 87500 60000 87500 87500 87500 ...
##  $ Poverty         : num  1.36 1.36 1.36 1.07 1.91 1.84 2.33 5 5 5 ...
##  $ HomeRooms       : int  6 6 6 9 5 6 7 6 6 6 ...
##  $ HomeOwn         : Factor w/ 3 levels "Own","Rent","Other": 1 1 1 1 2 2 1 1 1 1 ...
##  $ Work            : Factor w/ 3 levels "Looking","NotWorking",..: 2 2 2 NA 2 NA NA 3 3 3 ...
##  $ Weight          : num  87.4 87.4 87.4 17 86.7 29.8 35.2 75.7 75.7 75.7 ...
##  $ Length          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ HeadCirc        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Height          : num  165 165 165 105 168 ...
##  $ BMI             : num  32.2 32.2 32.2 15.3 30.6 ...
##  $ BMICatUnder20yrs: Factor w/ 4 levels "UnderWeight",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ BMI_WHO         : Factor w/ 4 levels "12.0_18.5","18.5_to_24.9",..: 4 4 4 1 4 1 2 3 3 3 ...
##  $ Pulse           : int  70 70 70 NA 86 82 72 62 62 62 ...
##  $ BPSysAve        : int  113 113 113 NA 112 86 107 118 118 118 ...
##  $ BPDiaAve        : int  85 85 85 NA 75 47 37 64 64 64 ...
##  $ BPSys1          : int  114 114 114 NA 118 84 114 106 106 106 ...
##  $ BPDia1          : int  88 88 88 NA 82 50 46 62 62 62 ...
##  $ BPSys2          : int  114 114 114 NA 108 84 108 118 118 118 ...
##  $ BPDia2          : int  88 88 88 NA 74 50 36 68 68 68 ...
##  $ BPSys3          : int  112 112 112 NA 116 88 106 118 118 118 ...
##  $ BPDia3          : int  82 82 82 NA 76 44 38 60 60 60 ...
##  $ Testosterone    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ DirectChol      : num  1.29 1.29 1.29 NA 1.16 1.34 1.55 2.12 2.12 2.12 ...
##  $ TotChol         : num  3.49 3.49 3.49 NA 6.7 4.86 4.09 5.82 5.82 5.82 ...
##  $ UrineVol1       : int  352 352 352 NA 77 123 238 106 106 106 ...
##  $ UrineFlow1      : num  NA NA NA NA 0.094 ...
##  $ UrineVol2       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ UrineFlow2      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Diabetes        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ DiabetesAge     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ HealthGen       : Factor w/ 5 levels "Excellent","Vgood",..: 3 3 3 NA 3 NA NA 2 2 2 ...
##  $ DaysPhysHlthBad : int  0 0 0 NA 0 NA NA 0 0 0 ...
##  $ DaysMentHlthBad : int  15 15 15 NA 10 NA NA 3 3 3 ...
##  $ LittleInterest  : Factor w/ 3 levels "None","Several",..: 3 3 3 NA 2 NA NA 1 1 1 ...
##  $ Depressed       : Factor w/ 3 levels "None","Several",..: 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ nPregnancies    : int  NA NA NA NA 2 NA NA 1 1 1 ...
##  $ nBabies         : int  NA NA NA NA 2 NA NA NA NA NA ...
##  $ Age1stBaby      : int  NA NA NA NA 27 NA NA NA NA NA ...
##  $ SleepHrsNight   : int  4 4 4 NA 8 NA NA 8 8 8 ...
##  $ SleepTrouble    : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ PhysActive      : Factor w/ 2 levels "No","Yes": 1 1 1 NA 1 NA NA 2 2 2 ...
##  $ PhysActiveDays  : int  NA NA NA NA NA NA NA 5 5 5 ...
##  $ TVHrsDay        : Factor w/ 7 levels "0_hrs","0_to_1_hr",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ CompHrsDay      : Factor w/ 7 levels "0_hrs","0_to_1_hr",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ TVHrsDayChild   : int  NA NA NA 4 NA 5 1 NA NA NA ...
##  $ CompHrsDayChild : int  NA NA NA 1 NA 0 6 NA NA NA ...
##  $ Alcohol12PlusYr : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
##  $ AlcoholDay      : int  NA NA NA NA 2 NA NA 3 3 3 ...
##  $ AlcoholYear     : int  0 0 0 NA 20 NA NA 52 52 52 ...
##  $ SmokeNow        : Factor w/ 2 levels "No","Yes": 1 1 1 NA 2 NA NA NA NA NA ...
##  $ Smoke100        : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ Smoke100n       : Factor w/ 2 levels "Non-Smoker","Smoker": 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ SmokeAge        : int  18 18 18 NA 38 NA NA NA NA NA ...
##  $ Marijuana       : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
##  $ AgeFirstMarij   : int  17 17 17 NA 18 NA NA 13 13 13 ...
##  $ RegularMarij    : Factor w/ 2 levels "No","Yes": 1 1 1 NA 1 NA NA 1 1 1 ...
##  $ AgeRegMarij     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ HardDrugs       : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ SexEver         : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
##  $ SexAge          : int  16 16 16 NA 12 NA NA 13 13 13 ...
##  $ SexNumPartnLife : int  8 8 8 NA 10 NA NA 20 20 20 ...
##  $ SexNumPartYear  : int  1 1 1 NA 1 NA NA 0 0 0 ...
##  $ SameSex         : Factor w/ 2 levels "No","Yes": 1 1 1 NA 2 NA NA 2 2 2 ...
##  $ SexOrientation  : Factor w/ 3 levels "Bisexual","Heterosexual",..: 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ PregnantNow     : Factor w/ 3 levels "Yes","No","Unknown": NA NA NA NA NA NA NA NA NA NA ...

Modified data

library(dplyr)
NHANES_ltd <- select(NHANES[sample(nrow(NHANES), 500),],  ## subset for lighter-weight figures
                     Age, Gender, Education, HHIncomeMid, Height, BMI_WHO, SexAge, AgeFirstMarij) %>%
  mutate(Education = as.ordered(Education),
         BMI_WHO = as.ordered(BMI_WHO))
str(NHANES_ltd)
## Classes 'tbl_df', 'tbl' and 'data.frame':    500 obs. of  8 variables:
##  $ Age          : int  67 21 23 21 13 80 5 3 10 30 ...
##  $ Gender       : Factor w/ 2 levels "female","male": 2 2 1 2 2 2 2 1 2 1 ...
##  $ Education    : Ord.factor w/ 5 levels "8th Grade"<"9 - 11th Grade"<..: 1 4 3 3 NA 3 NA NA NA 5 ...
##  $ HHIncomeMid  : int  17500 NA 100000 22500 7500 22500 12500 50000 70000 100000 ...
##  $ Height       : num  169 184 142 176 162 ...
##  $ BMI_WHO      : Ord.factor w/ 4 levels "12.0_18.5"<"18.5_to_24.9"<..: 2 2 3 2 2 2 1 1 2 4 ...
##  $ SexAge       : int  17 NA NA NA NA NA NA NA NA 18 ...
##  $ AgeFirstMarij: int  NA NA NA NA NA NA NA NA NA NA ...

outline

Multivariate plots

  • CSV fingerprint
  • pairs plots options
  • Table plot
  • Parallel plots

Lower-variate plots

  • Faceting
  • Heat maps, contour plots
  • Graphical inference
  • scatter plots with smooths, marginal histograms,

Overview of your data

CSV fingerprint

Standard pairs plot / scatterplot matrix

plot(NHANES_ltd)

select(NHANES_ltd, Age, Height, SexAge, AgeFirstMarij) %>% 
  pairs()

Generalized pairs plot

Generalized pairs plot

The pairs plot is useful on its own, but the generalized pairs plot is even better.

Emerson, J. W., Green, W. A., Schloerke, B., Crowley, J., Cook, D., Hofmann, H., and Wickham, H. (2013). The generalized pairs plot. Journal of Computational and Graphical Statistics, 22(1):79–91.

http://bit.ly/gpairs

library(ggplot2)
library(GGally)
print(select(NHANES_ltd, Age, Gender, Height, SexAge, AgeFirstMarij) %>% 
  ggpairs())

Tableplots

Tableplots

Tennekes, M., de Jonge, E., and Daas, P. J., H. (2013). Visualizing and inspecting large datasets with tableplots. Journal of Data Science, 11(2013):43-58. http://bit.ly/tabplot

library(tabplot)
NHANES_ltd2 <- select(NHANES, 
                     Age, Education, HHIncomeMid, Height, BMI_WHO, SexAge, AgeFirstMarij) %>%
  mutate(Education = as.ordered(Education),
         BMI_WHO = as.ordered(BMI_WHO))
tableplot(NHANES_ltd2, sortCol=Age)

tableplot(NHANES_ltd2, sortCol=BMI_WHO)

tableplot(NHANES_ltd2, sortCol=Education)

Graphical inference

Graphical inference

Wickham, H., Cook, D., Hofmann, H., and Buja, A. (2010). Graphical inference for infovis. IEEE Transactions on Visualization and Computer Graphics, 16(6).

http://bit.ly/graphical_inference

Do college grads become sexually active later compared with individuals similar individuals with less than a college education?

Can you see the difference?

library(nullabor)
qplot(Education, SexAge, data=NHANES_ltd) %+% lineup(null_permute('SexAge'), NHANES_ltd) +
  facet_wrap(~.sample) + geom_boxplot() + theme(axis.text.x  = element_text(angle=90, vjust=0.5))
## decrypt("OlCE bQTQ Aw GWPATAWw vr")

decrypt("OlCE bQTQ Aw GWPATAWw vr")
## [1] "True data in position 20"

Thank you!