1. The aim of our project is to analyze the Breast Cancer Wisconsin (Original) dataset to classify the data by using various classification models and compare the misclassification rate between these models.
a. We are planning to use classification models like Decision tree, Bagging, Random Forest, Naïve Bayes classifier, Support Vector machine and compare the results for better accuracy.
b. We have chosen Breast Cancer Wisconsin (Original) dataset obtained from UCI Machine learning repository for analysis.
c. This is a secondhand dataset available at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29 .
2. Number of attributes in the Breast Cancer Wisconsin (Original) dataset are: 32 (ID, diagnosis, 30 real-valued input features). Attribute information:
a. ID number,
b. Diagnosis (M = malignant, B = benign) – Predicting Variable
c. Ten real-valued features are computed for each cell nucleus – radius (mean of distances from center to points on the perimeter), texture (standard deviation of gray-scale values), perimeter, area, smoothness (local variation in radius lengths), compactness (perimeter^2 / area – 1.0), concavity (severity of concave portions of the contour), concave points (number of concave portions of the contour), symmetry, fractal dimension (“coastline approximation”)
d. Class distribution: 357 benign, 212 malignant
3. Here, Diagnosis is the field that we are predicting which takes two values B = benign and M = malignant.
a. Benign – Benign tumors may grow larger but do not spread to other parts of the body. Also called nonmalignant.
b. Malignant – Malignant tumors are cancerous
4. Breast cancer is the most prevalent diagnosed cancer for women in the U.S. following skin cancer. In both men and women, breast cancer occurs, but in women it is much more common. There has been a rise in breast cancer survival rates and a steady decline in the mortality connected with this disease, primarily because of variables such as prior detection, a more individualized treatment approach and a greater sense of the disease.