[R for Data Science] 3 Data visualization

library(tidyverse)

# Do cars with big engines use more fuel than cars with small engines? 
# the relationship between engine size and fuel efficiency 
# mpg contains observations collected by the US Environmental Protection Agency on 38 models of car.

mpg
# displ, a car’s engine size, in litres.
# hwy, a car’s fuel efficiency on the highway, in miles per gallon (mpg)

ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy))

# In other words, cars with big engines use more fuel
# mapping - This defines how variables in your dataset are mapped to visual properties.


# ggplot(data = <DATA>) + 
#     <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

# 3.2.4 Exercises
# 1. Run ggplot(data = mpg). What do you see?
ggplot(data = mpg) 
# 2. How many rows are in mpg? How many columns?
str(mpg) 
# 3. What does the drv variable describe? Read the help for ?mpg to find out.
?mpg
# the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd

# 4. Make a scatterplot of hwy vs cyl.
ggplot(data = mpg) +
    geom_point(mapping = aes(x = hwy, y = cyl))

# 5. What happens if you make a scatterplot of class vs drv? Why is the plot not useful?
ggplot(data = mpg) +
    geom_point(mapping = aes(x = class, y = drv))
# not useful, since is is categorical value.

# 3.3 Aesthetic mappings

# An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points. let’s use the word “level” to describe aesthetic properties.

ggplot(data = mpg)+
    geom_point(mapping = aes(x = displ, y = hwy, color = class))

#  ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling

ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy, size = class))

ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy, alpha = class))

ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy, shape = class))

# 그냥 모든 점을 파란색으로 만들고 싶을 떄 

ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

# aes 안에 색을 넣으면 그 색으로 바뀌는게 아니라 변수로 인식함 

# 3.3.1 Exercises
# What’s gone wrong with this code? Why are the points not blue?

# the argument color = blue is included within the mapping argument, and as such, it is treated as an aesthetic, which is a mapping between a variable and a value. it is only interpreted as a ctegorical value. 
# 
# 2. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?

# Those with <chr> above their columns are categorical, while those with <dbl> or <int> are continuous.

# 3. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?

# When a continuous value is mapped to shape, it gives an error. Though we could split a continuous variable into discrete categories and use a shape aesthetic, this would conceptually not make sense. A numeric variable has an order, but shapes do not. It is clear that smaller points correspond to smaller values, or once the color scale is given, which colors correspond to larger or smaller values. But it is not clear whether a square is greater or less than a circle.

# 4. What happens if you map the same variable to multiple aesthetics?

# Because it is redundant information, in most cases avoid mapping a single variable to multiple aesthetics

# 5. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)

ggplot(mtcars, aes(wt, mpg)) +
    geom_point(shape = 21, colour = "black", fill = "white", size = 5, stroke = 5)

# 6. What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)?

ggplot(mpg, aes(x = displ, y = hwy, colour = displ < 5)) +
    geom_point()

# Aesthetics can also be mapped to expressions like displ < 5. In this case, the result of displ < 5 is a logical variable which takes values of TRUE or FALSE.


# 3.4 Common problems

# One common problem when creating ggplot2 graphics is to put the + in the wrong place

# 3.5 Facets
#  Another way, particularly useful for categorical variables, is to split your plot into facets, subplots that each display one subset of the data.

ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy))+
    facet_wrap(~ class, nrow = 2)

# To facet your plot on the combination of two variables, add facet_grid() to your plot call

ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy)) +
    facet_grid(drv ~ cyl)

# 3.5.1 Exercises
# 1. What happens if you facet on a continuous variable?

ggplot(mpg, aes(x = displ, y = hwy)) +
    geom_point() +
    facet_grid(. ~ cty)

# The continuous variable is converted to a categorical variable, and the plot contains a facet for each distinct value

# 2, What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?

# The empty cells (facets) in this plot are combinations of drv and cyl that have no observations. These are the same locations in the scatter plot of drv and cyl that have no points.

# 3. What plots does the following code make? What does . do?
# The symbol . ignores that dimension when faceting.For example, drv ~ . facet by values of drv on the y-axis

# 4. What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

ggplot(data = mpg) + 
    geom_point(mapping = aes(x = displ, y = hwy)) + 
    facet_wrap(~ class, nrow = 2)

# advantage is that the ability to encode more distinct categories
# disadvantage is difficulty of comparing the values of observations between categories since the observations for each category are on different plots

# 5. Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol variables?

# The arguments nrow (ncol) determines the number of rows (columns) to use when laying out the facets. It is necessary since facet_wrap() only facets on one variable.

# The nrow and ncol arguments are unnecessary for facet_grid() since the number of unique values of the variables specified in the function determines the number of rows and columns.

# 6. When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?

# There will be more space for columns if the plot is laid out horizontally (landscape).

# 3.6 Geometric objects

ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) +
    geom_smooth(mapping = aes(x = displ, y = hwy))

# 다른 라인 타입 변수에 따른 
ggplot(data = mpg) +
    geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

ggplot(data = mpg) +
    geom_smooth(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) +
    geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))

ggplot(data = mpg) +
    geom_smooth(
        mapping = aes(x = displ, y = hwy, color = drv),
        show.legend = FALSE)

ggplot( data = mpg, mapping = aes( x = displ, y = hwy)) +
    geom_point() +
    geom_smooth()

ggplot( data = mpg, mapping = aes( x = displ, y = hwy)) +
    geom_point(mapping = aes( color = class )) +
    geom_smooth()

ggplot( data = mpg, mapping = aes( x = displ, y = hwy)) +
    geom_point(mapping = aes( color = class )) +
    geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)


# 3.6.1 Exercises
# 1. What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?

line chart: geom_line()
boxplot: geom_boxplot()
histogram: geom_histogram()
area chart: geom_area()

# 2. Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, colour = drv)) +
    geom_point() +
    geom_smooth(se = FALSE)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'


# 3. What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter? 
ggplot(data = mpg) +
    geom_smooth(
        mapping = aes(x = displ, y = hwy, colour = drv),
        show.legend = FALSE
    )
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'


# 4. What does the se argument to geom_smooth() do?

# It adds standard error bands to the lines.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, colour = drv)) +
    geom_point() +
    geom_smooth(se = TRUE)

# 5. Will these two graphs look different? Why/why not?
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
    geom_point() + 
    geom_smooth()

ggplot() + 
    geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
    geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))


# 6. Recreate the R code necessary to generate the following graphs.

ggplot( data = mpg, mapping = aes(x= displ, y = hwy) )+
    geom_point() +
    geom_smooth(col = "blue", se = FALSE)

ggplot(mpg, aes(x= displ, y = hwy)) +
    geom_point() +
    geom_smooth(mapping = aes(group = drv),se =FALSE)

ggplot(mpg, aes(x= displ, y = hwy)) +
    geom_point(aes(col = drv)) +
    geom_smooth(aes(col = drv), se = FALSE)

ggplot(mpg, aes(x= displ, y = hwy)) +
    geom_point(aes(col = drv)) +
    geom_smooth(se = FALSE)

ggplot(mpg, aes(x= displ, y = hwy)) +
    geom_point(aes(col = drv)) +
    geom_smooth(aes(linetype = drv), se = FALSE)

ggplot(mpg, aes(x= displ, y = hwy)) +
    geom_point(size = 4, col = "white") +
    geom_point(aes(col = drv))


# 3.7 Statistical transformations
ggplot(data = diamonds) +
    geom_bar(mapping = aes( x = cut ))


ggplot(data = diamonds) +
    stat_count(mapping = aes( x = cut ))

ggplot(data = diamonds) + 
    stat_summary(
        mapping = aes(x = cut, y = depth),
        fun.min = min,
        fun.max = max,
        fun = median
    )


# 3.7.1 Exercises

#What does geom_col() do? How is it different to geom_bar()?

# The geom_col() function has different default stat than geom_bar(). The default stat of geom_col() is stat_identity(), which leaves the data as is. The geom_col() function expects that the data contains x values and y values which represent the bar height.


#     In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?

# If group = 1 is not included, then all the bars in the plot will have the same height, a height of 1.

# 3.8 Position adjustments

ggplot(data = diamonds) +
    geom_bar(mapping = aes(x = cut, col = cut))
ggplot(data = diamonds) +
    geom_bar(mapping = aes(x = cut, fill = cut))

#  if you map the fill aesthetic to another variable, like clarity: the bars are automatically stacked.

ggplot( data = diamonds)+
    geom_bar(mapping = aes( x= cut, fill =clarity ))

ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + 
    geom_bar(alpha = 1/5, position = "identity")

ggplot(data = diamonds) + 
    geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")

ggplot(data = diamonds) + 
    geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

# You can avoid this gridding by setting the position adjustment to “jitter”. position = "jitter" adds a small amount of random noise to each point. 

ggplot(data = mpg) + 
    geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
    geom_boxplot()
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
    geom_boxplot() +
    coord_flip()


bar <- ggplot(data = diamonds) + 
    geom_bar(
        mapping = aes(x = cut, fill = cut), 
        show.legend = FALSE,
        width = 1
    ) + 
    theme(aspect.ratio = 1) +
    labs(x = NULL, y = NULL)

bar + coord_flip()
bar + coord_polar()

ggplot(data = <DATA>) + 
    <GEOM_FUNCTION>(
        mapping = aes(<MAPPINGS>),
        stat = <STAT>, 
        position = <POSITION>
    ) +
    <COORDINATE_FUNCTION> +
    <FACET_FUNCTION>
저작자표시 (새창열림)
'Data Analysis > R' 카테고리의 다른 글

[UCLA : Statistical Consulting Group] Introduction to R (0)	2022.05.13
[Machine Learning with R] Managing and Understanding Data Part.2 (0)	2022.03.24
[Machine Learning with R] Managing and Understanding Data Part.1 (0)	2022.03.24
매운 블로그

[R for Data Science] 3 Data visualization

'Data Analysis > R' 카테고리의 다른 글

댓글

티스토리툴바

[R for Data Science] 3 Data visualization

'Data Analysis > R' 카테고리의 다른 글

관련글

댓글

티스토리툴바