[Machine Learning with R] Managing and Understanding Data Part.2

Exploring numeric variables

The summary() function displays several common summary statistics.

summary(usedcars$year)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2000    2008    2009    2009    2010    2012

- we can figure out that 'year' variable indicates the year of manufacture rather than the year the advertisement was posted, since we know the vehicles were recently listed for sale.

> summary(usedcars[c("price","mileage")])
     price          mileage      
 Min.   : 3800   Min.   :  4867  
 1st Qu.:10995   1st Qu.: 27200  
 Median :13592   Median : 36385  
 Mean   :12962   Mean   : 44261  
 3rd Qu.:14904   3rd Qu.: 55125  
 Max.   :21992   Max.   :151479

- The summary statistics can be divided into two types : measures of center and measures of spread.

Measuring the central tendency - mean and median

- In statistics, the average is also known as the mean, a measurement defined as the sum of all values divided by the number of values.

> (36000 + 44000 + 56000)/ 3
[1] 45333.33
> mean(c(36000 , 44000 , 56000))
[1] 45333.33

- We have information that the mean price is 12962, and the mean mileage is 44261. What does this tell us about our data? Since the average price is relatively low, we might expect that the data includes economy-class cars. Of course the data can also include the late-model luxury cars with high mileage, but the relatively low mean mileage statistic doesn't provide evidence to support this hypothesis. Also, it doesn't provide evidence to ignore the possibility either.

- Another commonly used measure of central tendency is the median, which is the value that occurs halfway through an ordered list of values.

> median(c(36000 , 44000 , 56000))
[1] 44000

Measuring spread - quartiles and the five-number summary

- By using mean and median, we can only figure out the central tendency not the diversity in the measurements. So, to concern about the spread of the data, we have to know how tightly or loosely the values are spaces. Knowing about the spread provides a sense of the data's highs and lows, and whether most values are like or unloke the mean and median.

The five - number summary
- It is a set of five statistics that roughly depict the spread of a dataset.

1. Minimum
2. First quartile, or Q1
3. Median, or Q2
4. Third quartile, or Q3
5. Maximum

- The span between the min and max value is known as the range.

> range(usedcars$price)
[1]  3800 21992
> diff(range(usedcars$price))
[1] 18192

- The middle 50 percent of data between Q1 and Q3 is of particular interest because it itself is a simple measure of spread. The difference between Q1 and Q3 is known as the interquartile range(IQR).

> IQR(usedcars$price)
[1] 3909.5

> quantile(usedcars$price)
     0%     25%     50%     75%    100% 
 3800.0 10995.0 13591.5 14904.5 21992.0

- If we specify an additional probs parameter using a vector denoting cut points, we can obtain arbitrary quantiles, such as the 1st and 99th percentiles.

> quantile(usedcars$price, probs = c(0.01, 0.99))
      1%      99% 
 5428.69 20505.00 
> quantile(usedcars$price, seq(from = 0 , to = 1, by = 0.20))
     0%     20%     40%     60%     80%    100% 
 3800.0 10759.4 12993.8 13992.0 14999.0 21992.0

- mileage : the difference between Q3 and the maximum is far greater than that between the minimum and Q1. In other words, the larger values are far more spread out than the smaller values.

- This finding explains why the mean value is much greater than the median. Because the mean is sensitive to extreme values, it is pulled higher, while the median stays in relatively the same place.

Visualizing numeric variables - boxplots

- A common visualization of the five-number summary is a boxplot or box-and-whiskers plot.

- The boxplot displays the center and spread of a numeric variable in a format that allows you to quickly obtain a sense of the range and skew of a variable, or compare it to other variables.

> boxplot(usedcars$price, main = "Boxplot of Used Car Prices",
+         ylab = "Price ($)")
> boxplot(usedcars$mileage, main = "Boxplot of Used Car Mileage",
+         ylab = "Odometer (mi.)")

- The horizontal lines forming the box in the middle of each figure represent Q1, Q2(the median), and Q3 when reading the plot from bottom - to - top. And the median is denoted by the dark line.

Visualizing numeric variables - histograms

- It is another way to graphically depict the spread of a numeric variable. It is similar to a boxplot in that it divides the variables's values into a predefined number of portions, or bins that act as containers for values.

> hist(usedcars$price, main = "Histogram of Used Car Prices",
+      xlab = "Price ($)")
> hist(usedcars$mileage, main = "Histogram of Used Car Prices",
+      xlab = "Price ($)")

- The heights indicate the count, or frequency, of values falling within each of the equally-sized bins partitioning the values.

Measuring spread - variance and standard deviation

- The spread is measured by a statistic called the standard deviation. In order to calculate the standard deviation, we must first obtain the variance, which is defined as the average of the squared difference between each value and the mean value.

> var(usedcars$price)
[1] 9749892
> sd(usedcars$price)
[1] 3122.482
> var(usedcars$mileage)
[1] 728033954
> sd(usedcars$mileage)
[1] 26982.1

- When interpreting the variance, larger numbers indicate that the data are spread more widely around the mean. The standard deviation indicates, on average, how much each value differs from the mean.

Exploring categorical variables

- In contrast to numeric data, categorical data is examined using tables rather than summary statistics. A table that presents a single categorical variable is known as a one-way table.

> table(usedcars$year)

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 
   3    1    1    1    3    2    6   11   14   42   49   16    1 
> table(usedcars$model)

 SE SEL SES 
 78  23  49 
> table(usedcars$color)

 Black   Blue   Gold   Gray  Green    Red Silver  White Yellow 
    35     17      1     16      5     25     32     16      3

> model_table <- table(usedcars$model)
> prop.table(model_table)

       SE       SEL       SES 
0.5200000 0.1533333 0.3266667

> color_table <- table(usedcars$color)
> color_pct <- prop.table(color_table) * 100
> round(color_pct, digits = 1)

 Black   Blue   Gold   Gray  Green    Red Silver  White Yellow 
  23.3   11.3    0.7   10.7    3.3   16.7   21.3   10.7    2.0

Visualizing relationships - scatterplots

- A scatterplot is a diagram that visulaizes a bivariate relationship. Patterns in the placement of dots reveal underlying associations between the two features.

- To use plot(), we need to specify x and y vectors containing the values used to position the dots on the figure. convention dictates that the y variable is the one that is presumed to depend on the other (dependent variable).

- our hypothesis is that price depends of the odometer milege. Therefore, we will use price as the y, or dependent, variable.

> plot(x = usedcars$mileage, y = usedcars$price,
+      main = "Scatterplot of Price vs. Milege",
+      xlab = "Used Car Odometer (mi. )",
+      ylab = "Used Car Price ($)")

- To read the plot, examine how values of the y axis variable change as the values on the x axis increase.

- 그리고 우측에 보면 마일리지가 높으면서 동시에 차량의 가격이 높은 이상치를 발견할 수 있는데, 이는 고급차 또한 포함되어 있다는 증거가 될 수 있다. 그리고 20000달러 위에 있으면서 마일리지가 낮은 차량은 비교적 새 차량임을 추측할 수 있다.

- The relationship between price and mileage is known as a negative association. The strength of a linear association between two variables is measured by a statistic known as corrleation.

Examining relationships - two-way cross-tabulations

- To examine a relationship between two nominal variables, a two way cross tabulation is used ( also known as a crosstab or a contingency table ). It allows you to examine how the values of one variable vary by the values of another.

> install.packages("gmodels")

> usedcars$conservative <- usedcars$color %in% c("Black","Gray","Silver","White")
> usedcars$conservative
  [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
 [13] FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
 [25] FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE
 [37]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
 [49]  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [61] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
 [73]  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE
 [85]  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [97] FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE
[109] FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE
[121]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
[133]  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE
[145]  TRUE  TRUE FALSE FALSE FALSE FALSE

- We will create a binary indicator (often called a dummy variable), indicating whether or not the car's color is conservative by our definition.

- "%in%" operator returns TRUE or FALSE for each value in the vector on the left-hand side of the operator, depending on whether the value is found in the vector on the right-hand side.

usedcars$conservative <- usedcars$color %in% c("Black","Gray","Silver","White")
: It means "is the used car color in the set of black, gray, silver, and white?"

> table(usedcars$conservative)

FALSE  TRUE 
   51    99

- we see that about two-thirds of cars have conservative colors while one-third do not have conservative colors.

- A cross-tabulation to see how the proportion of conservative colored cars varies by model. Since we're assuming that the model of car dictates the choice of color, we'll treat conservative as the dependent variable.

> CrossTable(x = usedcars$model, y = usedcars$conservative)

 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  150 

 
               | usedcars$conservative 
usedcars$model |     FALSE |      TRUE | Row Total | 
---------------|-----------|-----------|-----------|
            SE |        27 |        51 |        78 | 
               |     0.009 |     0.004 |           | 
               |     0.346 |     0.654 |     0.520 | 
               |     0.529 |     0.515 |           | 
               |     0.180 |     0.340 |           | 
---------------|-----------|-----------|-----------|
           SEL |         7 |        16 |        23 | 
               |     0.086 |     0.044 |           | 
               |     0.304 |     0.696 |     0.153 | 
               |     0.137 |     0.162 |           | 
               |     0.047 |     0.107 |           | 
---------------|-----------|-----------|-----------|
           SES |        17 |        32 |        49 | 
               |     0.007 |     0.004 |           | 
               |     0.347 |     0.653 |     0.327 | 
               |     0.333 |     0.323 |           | 
               |     0.113 |     0.213 |           | 
---------------|-----------|-----------|-----------|
  Column Total |        51 |        99 |       150 | 
               |     0.340 |     0.660 |           | 
---------------|-----------|-----------|-----------|
# 일반횟수
# 카이 제곱 ( 기대치 비율 )
# 행을 기준으로 비율 값 ( 가로로 읽는다. )
# 컬럼을 기준으로 비율 값 ( 세로로 읽는다. )
# 전체를 기준으로 비율 값

- The legend at the top (labeled Cell contents) indicates how to interpret each value.

- The columns indicate whether or not the car's color is conservative

- The Chi-square values refer to the cell's contribution in the Pearson's Chi-squared test for independence between two variables. This test measures how likely it is that the difference in cell counts in the table is due to chance alone. If the probability is very low, it provides strong evidence that the two variables are associated.

> CrossTable(x = usedcars$model, y = usedcars$conservative, chisq = TRUE)

 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  150 

 
               | usedcars$conservative 
usedcars$model |     FALSE |      TRUE | Row Total | 
---------------|-----------|-----------|-----------|
            SE |        27 |        51 |        78 | 
               |     0.009 |     0.004 |           | 
               |     0.346 |     0.654 |     0.520 | 
               |     0.529 |     0.515 |           | 
               |     0.180 |     0.340 |           | 
---------------|-----------|-----------|-----------|
           SEL |         7 |        16 |        23 | 
               |     0.086 |     0.044 |           | 
               |     0.304 |     0.696 |     0.153 | 
               |     0.137 |     0.162 |           | 
               |     0.047 |     0.107 |           | 
---------------|-----------|-----------|-----------|
           SES |        17 |        32 |        49 | 
               |     0.007 |     0.004 |           | 
               |     0.347 |     0.653 |     0.327 | 
               |     0.333 |     0.323 |           | 
               |     0.113 |     0.213 |           | 
---------------|-----------|-----------|-----------|
  Column Total |        51 |        99 |       150 | 
               |     0.340 |     0.660 |           | 
---------------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  0.1539564     d.f. =  2     p =  0.92591

- The probability is about 93%, suggesting that it is very likely that the variations in cell count are due to chance alone, and not due to a true association between model and color.

*denoting

*arbitrary

*robust

*convention

*bivariate

*profound

저작자표시 (새창열림)

'Data Analysis > R' 카테고리의 다른 글

[R for Data Science] 3 Data visualization (0)	2022.05.14
[UCLA : Statistical Consulting Group] Introduction to R (0)	2022.05.13
[Machine Learning with R] Managing and Understanding Data Part.1 (0)	2022.03.24

매운 블로그

[Machine Learning with R] Managing and Understanding Data Part.2

'Data Analysis > R' 카테고리의 다른 글

댓글

티스토리툴바

[Machine Learning with R] Managing and Understanding Data Part.2

'Data Analysis > R' 카테고리의 다른 글

관련글

댓글

티스토리툴바