[Machine Learning with R] Managing and Understanding Data Part.1
Vectors
- The fundamental R data structure is the vector, which stores an ordered set of values called elements.
- Vector can contain any number of elements, but all the elements must be of the same type!
Several Vector Types
1. integer ( numbers without decimals )
2. numeric ( numbers with decimals )
3. logical ( True or False values )
4. NULL ( indicate the absence of any value )
5. NA ( missing value )
subject_name <- c("Jone Doe", "Jane Doe", "Steve Graves")
temperature <- c(98.1, 98.6, 101.4)
flu_status <- c(FALSE, FALSE, TRUE)
- R vectors's records can be accesed by counting the item's number in the set beginning at 1! with square brackets after the name of the vector (ex. [2])
> temperature[2]
[1] 98.6
- also, a range of values can be obtained using the colon operator.
> temperature[2:3]
[1] 98.6 101.4
- items can be excluded by specifying a negative item number.
> temperature[-2]
[1] 98.1 101.4
- it is also sometimes useful to specify a logical vector indicating whether each item should be included.
> temperature[c(TRUE, TRUE, FALSE)]
[1] 98.1 98.6
Factors
- Features that represent characteristic with categories of values are known as nominal. And even though it is possible to use vector to store nominal data, R provides a data structure known as a factor specifically for this purpose.
- It is a special case of vector that is solely used for representing nominal variables.
*Why we are not using character vectors?
- The advantages of using factors is that they are generally more efficient than character vectors because the category labels are stored only once.
> gender <- factor(c("MALE", "FEMALE", "MALE"))
> gender
[1] MALE FEMALE MALE
Levels: FEMALE MALE
- R printed additional information indicating the levels of the gender factor. The levels comprise the set of possible categories the data could take, in this case MALE of FEMALE. We can also add the levels that may not appear in the data!
> blood <- factor(c("O","AB","A"),
+ levels = c("A","B","AB","O"))
> blood
[1] O AB A
Levels: A B AB O
- Storing the additional level allows for the possibility of adding data with the other blood type in the future.
Lists
- It is used for storing an ordered set of values. Unlike a vector that requires all elements to be the same type, a list allows different types of values to be collected. Due to this flexibility, lists are often used to store various types of input and output data and sets of configuration parameters for machine learning models.
subject_name[1]
temperature[1]
flu_status[1]
gender[1]
blood[1]
# we will make these data into one object
# you have the option of providing names for each value in the sequence of items.
subject1 <- list(fullname = subject_name[1],
temperature = temperature[1],
flu_status = flu_status[1],
gender = gender[1],
blood = blood[1])
- Printing a patient's data is now a matter of typing a single command !
> subject1
$fullname
[1] "Jone Doe"
$temperature
[1] 98.1
$flu_status
[1] FALSE
$gender
[1] MALE
Levels: FEMALE MALE
$blood
[1] O
Levels: A B AB O
- Also it is often easier to access temperature directly, by appending a $ and the value's name to the name of the list
> subject1[2]
$temperature
[1] 98.1
> subject1$temperature
[1] 98.1
Data Frames
- A structure analogous to a spreadsheet or database since it has both rows and columns of data. It is literally a list of vectors, it combines aspects of both vectors and lists.
> pt_data <- data.frame(subject_name, temperature, flu_status, gender, blood, stringsAsFactors = FALSE)
- stringasFactors = FALSE : If we do not specify this option, R will automatically convert every character vector to a factor. But not all the characters are categorical data.
> pt_data
subject_name temperature flu_status gender blood
1 Jone Doe 98.1 FALSE MALE O
2 Jane Doe 98.6 FALSE FEMALE AB
3 Steve Graves 101.4 TRUE MALE A
- A data frame has two dimensions and it is therefore displayed in matrix format. The data frame has one column for each vector of patient data and one row for each patient.In machine learning, the columns are the features or attributes and the rows are the example.
> pt_data$subject_name
[1] "Jone Doe" "Jane Doe" "Steve Graves"
> pt_data[c("temperature", "flu_status")]
temperature flu_status
1 98.1 FALSE
2 98.6 FALSE
3 101.4 TRUE
> pt_data[2:3]
temperature flu_status
1 98.1 FALSE
2 98.6 FALSE
3 101.4 TRUE
- To extract values in the data frame, we can use methods like those we learned for accessing values in vectors, with an important exception; because the data frame is two-dimensional, you will need to specify the position of both the rows and columns you would like to extract. [rows, columns] , starting from the number 1.
> pt_data[1,2]
[1] 98.1
> pt_data[c(1,3), c(2,4)]
temperature gender
1 98.1 MALE
3 101.4 MALE
- To extract all of the rows or columns, rather than listing every one, simply leave the row or column portion blank.
> pt_data[,1]
[1] "Jone Doe" "Jane Doe" "Steve Graves"
> pt_data[1,]
subject_name temperature flu_status gender blood
1 Jone Doe 98.1 FALSE MALE O
> pt_data[,]
subject_name temperature flu_status gender blood
1 Jone Doe 98.1 FALSE MALE O
2 Jane Doe 98.6 FALSE FEMALE AB
3 Steve Graves 101.4 TRUE MALE A
- columns can be accessed by name, rather than position, and negative signs can be used to exclude rows or columns of data.
> pt_data[c(1,3), c("temperature", "gender")]
temperature gender
1 98.1 MALE
3 101.4 MALE
> pt_data[-2, c(-1,-3,-5)]
temperature gender
1 98.1 MALE
3 101.4 MALE
Matrixes and Arrays
- Matrix is a data structure that represents a two-dimensional table, with rows and columns of data. It can contain any single type of data, although they are most often used for mathematical operations and therefore typically store only numeric data.
- R loaded the first column of the matrix first, then loaded the second column. This is called column-major order.
> m <- matrix(c('a','b','c','d'), nrow = 2)
> m
[,1] [,2]
[1,] "a" "c"
[2,] "b" "d"
- This is equivalent to the matrix produced using ncol = 2
> m <- matrix(c('a','b','c','d','e','f'), nrow = 2)
> m
[,1] [,2] [,3]
[1,] "a" "c" "e"
[2,] "b" "d" "f"
> m <- matrix(c('a','b','c','d','e','f'), ncol = 2)
> m
[,1] [,2]
[1,] "a" "d"
[2,] "b" "e"
[3,] "c" "f"
- Array is a multi-dimensional table of data. Where a matrix has rows and columns of values, an array has rows, columns, and any number of additional layers of values.
Importing and Saving data from CSV files
- A tabular data file is structured in matrix form, in such a way that each line of text reflects one example, and each example has the same number of features. The feature values on each line are separated by a predefined symbol known as a delimiter. Often, the first line of a tabular data file lists the names of the columns of data. This is called a header line.
- By default, R assumes that the CSV file includes a header line listing the names of the features in the dataset. So if a CSV file does not have a header, specify the option header = FALSE .
mydata <- read.csv("mydata.csv", stringsAsFactors = FALSE, header = FALSE)
write.csv(pt_data, file = "pt_data.csv")
Exploring and understanding data
- The better you understand your data, the better you will be able to match a machine learning model to your learning problem.
- we will explore the 'usedcars.csv' dataset, which contains actual data about used cars recently advertised for sale on a popular U.S. website.
> usedcars <- read.csv("C:/Users/USER/Desktop/4-1/Business Anlaytics/Week3_R_Practice/Week3_R_Practice/usedcars.csv", stringsAsFactors = FALSE)
# 상당히 길다;;;
- One of the first questions to ask in your investigation should be about how data is organized. If you are fortunate, your source will provide a data dictionary, a docuent that describes the data's features.
- The str() function provides a method for displaying the structure of a data frame, or any R data structure including vectors and lists.
> str(usedcars)
'data.frame': 150 obs. of 6 variables:
$ year : int 2011 2011 2011 2011 2012 2010 2011 2010 2011 2010 ...
$ model : chr "SEL" "SEL" "SEL" "SEL" ...
$ price : int 21992 20995 19995 17809 17500 17495 17000 16995 16995 16995 ...
$ mileage : int 7413 10926 7351 11613 8367 25125 27393 21026 32655 36116 ...
$ color : chr "Yellow" "Gray" "Silver" "Gray" ...
$ transmission: chr "AUTO" "AUTO" "AUTO" "AUTO" ...
- We can now presume that we have examples of n = 150 automobiles for sale.
- Applying a bit of subject-area knowledge to the feature names and values allowsus to make some assumptions about what the variables represent.
* analogous
* sleuthing
* prudent