Random forest clustering

2/13/2024

Feature selection works by removing irrelevant features and redundancy. Some studies apply two approaches to find relevant variables and remove redundancy in the dataset, i.e., using feature extraction and feature selection.

Hence, an approach to remove redundancy of data is required. In, if the same or similar characteristics or variables are used in the classification process, the accuracy of the model could be decreased. Redundant variables are variables or features that are similar. However, the Random Forest algorithm is only able to detect significant variables, but not redundant variables. It is also suitable for use on datasets that have more numbers of variables than the number of samples. This is because this algorithm can look for essential variables in a dataset. In the work of Moorthy and Mohamad, the Random Forest algorithm was used for gene selection and classification. Therefore, the variables or relevant features must first be determined. However, to get an accurate model using classification, a lot of sample data and variables that have correlation with classes in the dataset are required. The main problem of microarray data is that it has more variables than the samples. These algorithms have advantages and weaknesses. Several algorithms have been proposed, with some papers examining each algorithm separately in a closed condition. Researchers have performed cancer detection based on microarray data classification. To overcome this problem, a reduction process is conducted. The bigger the size of the data and the number of fixed observations, then the accuracy of classification at a certain point will be smaller. Therefore, microarray has high data dimensionality.

Microarray analysis plays an essential role in diagnosing a disease because it can be used to look at the level of gene expression in a particular cell sample and examine thousands of genes simultaneously. There are many ways to detect cancer, including one that is known as the microarray technique. According to the World Health Organization (WHO) data in 2015, 8.8 million deaths were caused by cancer, with the number set to increase every year if diagnosis was not resolved early. Keywords: Classification, Clustering, Dimensional Reduction, Microarray, Random ForestĬancer is a known deadly disease around the world.

The accuracy of the proposed approach is therefore higher than the approach using Random Forest without clustering. Based on the simulation, the accuracy of the proposed approach for each dataset, namely Colon, Lung Cancer, and Prostate Tumor, achieved 85.87%, 98.9%, and 89% accuracy, respectively. Next, the Random Forest algorithm is used. All best elements of each cluster are selected and used as features in the classification process. The result of clustering is ranked using the Relief algorithm such that the best scoring element for each cluster is obtained. The proposed approach can be used to categorize features that have the same characteristics in one cluster, so that redundancy in microarray data is removed. In this paper, we used k-means algorithm as the clustering approach for feature selection. There are two types of dimensional reduction, namely feature selection and feature extraction. Dimensional reduction can eliminate redundancy of data thus, features used in classification are features that only have a high correlation with their class. Therefore, to classify microarray data, a dimensional reduction process is required. However, microarray data have very little sample data and high data dimensionality. Microarray analysis allows the examination of levels of gene expression in specific cell samples, where thousands of genes can be analyzed simultaneously. Abstract: Microarray data plays an essential role in diagnosing and detecting cancer.

0 Comments

Random forest clustering

Leave a Reply.

Author

Archives

Categories