[{"metadata":{},"cell_type":"markdown","source":"# Midterm, Summer 2022, CSC 9010-035, Python, Machine Learning and Data
-
- Site Admin
- Posts: 899603
- Joined: Mon Aug 02, 2021 8:13 am
[{"metadata":{},"cell_type":"markdown","source":"# Midterm, Summer 2022, CSC 9010-035, Python, Machine Learning and Data
[{"metadata":{},"cell_type":"markdown","source":"# Midterm, Summer 2022, CSC 9010-035, Python, Machine Learning and Data Science"},{"metadata":{},"cell_type":"markdown","source":"## Thrust\n\nThis quiz is aimed to test your understanding of various classification algorithms\n - Logistic regression\n - Support Vector Machines\n - Decision Trees\n - Random Forest"},{"metadata":{},"cell_type":"markdown","source":"The class discussed all of these, however did not go into the details of all of the options that could be used.\n\nPart of what you will need to do is to examine the Sci-Kit Learn documentation to better understand how to use these algorithms.\n\nFor example, for Logistics Regression, you should look at:\n - https://scikit-learn.org/stable/modules ... gression\n - https://scikit-learn.org/stable/modules ... e","source":"# imports\n\n# NOTE: Most likely you will need to import more libraries\n\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\n\nfrom sklearn.decomposition import PCA\nfrom sklearn.metrics import ConfusionMatrixDisplay, accuracy_score\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.svm import LinearSVC, SVC, NuSVC","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Facial Expressions\n\nThe data come from https://www.kaggle.com/competitions/cha ... data\n\nIt is over 30K images of faces that are labeled in one of six categories (see `expressions`) below.\n\nEach image is a 48 x 48 pixel gray-scale image. By gray-scale, we mean each pixel is between 0 and 255 and represents a color of gray from black = 0 to white = 255\n\nOrdinarily, we would use Convolutional Neural Networks to develop a classification estimator.\n\nBut not here. We haven't studied CNNs yet. Instead we will use `sklearn` estimatators only."},{"metadata":{},"cell_type":"markdown","source":"I have found that the time to run logistics estimation is way too long on this data.\n\nSo we will do two things to reduce the size:\n - Only look at data that corresponds to Angry, Happy and Neutral labels. See the variable `expressions_subset`\n - Reduce the number of features from 2304 to 600 by using Principal component analysis\n - I'll provide a lecture on Saturday morning that describes Principal component analysis, but for now, it is just a way to reduce the features"},{"metadata":{"trusted":true},"cell_type":"code","source":"expressions = {0: 'Angry', 1: 'Disgust', 2: 'Fear', 3: 'Happy', 4: 'Sad', 5: 'Surprise', 6: 'Neutral'}\nexpresions_subset = {0: 'Angry', 3: 'Happy', 6: 'Neutral'}","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Read in the data\n\nNote that Pandas allows to read in the data already zipped"},{"metadata":{"trusted":true},"cell_type":"code","source":"data_location_folder = \"Data\"\n\ndata_location = data_location_folder + \"/icml_face_data.csv.zip\"\nimages_plus_target_df = pd.read_csv(data_location)\nimages_plus_target_df.head()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"### How many items by emotion/target?","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"images_plus_target_df['emotion'].value_counts()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### How many items by Usage?\n\nWe will combine the `PublicTest` and `Training` into something we will call `Training` and the `PrivateTest` will be our test"},{"metadata":{"trusted":true},"cell_type":"code","source":"images_plus_target_df[' Usage'].value_counts()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"images_plus_target_Training_df = images_plus_target_df[images_plus_target_df['emotion'].isin(expresions_subset.keys()) & \n images_plus_target_df[' Usage'].isin({\"Training\", \"PublicTest\"})]\n\nimages_plus_target_Test_df = images_plus_target_df[images_plus_target_df['emotion'].isin(expresions_subset.keys()) & \n images_plus_target_df[' Usage'].isin({\"PrivateTest\"})]","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"images_plus_target_Training_df['emotion'].value_counts()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"images_plus_target_Test_df['emotion'].value_counts()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Convert the data to Numpy\n\nThe pixels are currently encoded as a long string with spaces as separators\n\nHere we create a numpy array of 2304 columns, and also extract the target"},{"metadata":{"trusted":true},"cell_type":"code","source":"def convert_to_numpy(image_df):\n X = np.zeros(shape=(len(image_df), 48 * 48))\n for i, row in enumerate(image_df.index):\n image = np.fromstring(image_df.loc[row, ' pixels'], dtype=int, sep=' ')\n X = image\n\n y = np.array(list(expressions for i in image_df['emotion'].to_numpy()))\n return X, y","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"X_training_all, y_training_all = convert_to_numpy(images_plus_target_Training_df)\nX_test_all, y_test_all = convert_to_numpy(images_plus_target_Test_df)\nprint(f\"Number of training instances: {len(X_training_all)}\")\nprint(f\"Number of test instances: {len(X_test_all)}\")","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"def plot_examples(X: np.array, y: np.array, emotion: str = 0, extra_title: str = ''):\n fig, axs = plt.subplots(1, 5, figsize=(25, 12))\n fig.subplots_adjust(hspace = .2, wspace=.2)\n axs = axs.ravel()\n for i, idx in enumerate(np.where(y == emotion)[0][:5]):\n axs.imshow(X[idx].reshape([48,48,1]), cmap='gray')\n axs.set_title(emotion + ((' ' + extra_title) if extra_title is not None and len(extra_title) > 0 else ''))\n axs.set_xticklabels([])\n axs.set_yticklabels([])","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Apply Principal Components to reduce the number of features from 2304 to 600\n\nNote, we use the training data to develop the principal components, and apply it to both the training data and the test data.\n\n**NOTE: If CPU time is an issue with you, try using only 500 instead of 600 components**"},{"metadata":{"trusted":true},"cell_type":"code","source":"pca = PCA(n_components=500)\npca.fit(X_training_all)\nX_training_pca = pca.transform(X_training_all)\nX_test_pca = pca.transform(X_test_all)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### PCA loses information\n\nGoing from 2304 features to `n_components` features reduces the fidelity. In PCA, there is a concept of 'explained variance'.\n\nFor 600 components, we see that 98% of the variability is explained. For 500 components, it is 97.5%"},{"metadata":{"trusted":true},"cell_type":"code","source":"print(f\"Total explained variance: {np.sum(pca.explained_variance_ratio_) * 100:.2f}%\")","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### How bad is the reduction?\n\nSee for yourself. Taking the reduced data, we can 'transform' it back to the original size but without adding any more information\n\nThe plots below show examples of before and after the PCA transformation"},{"metadata":{"trusted":true},"cell_type":"code","source":"X_transformed_back_from_pca = pca.inverse_transform(X_training_pca)\nfor emotion in expresions_subset.values():\n plot_examples(X_training_all, y_training_all, emotion)\n plot_examples(X_transformed_back_from_pca, y_training_all, emotion, extra_title='after PCA')","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## The midterm\n\nThe idea is for you to explore using logistics regression, support vector machines, decision trees and random forest\n\nI want you to explore at least 4 ways of running each one. As examples:\n\nUsing logistics regression there are quite a few parameters:\n - l2 penalties (or none). When using l2 penalties, what is the correct `C` coefficient?\n - type of algorithm/solver to use\n - type of way of handling multi-class (number of classes > 2)\n - preprocessing - do you need to center/scale the data before hand?\n - My guess is no - all the features are in the same scale, but it should be verified\n\nAnd with Support Vector Machines (read: https://scikit-learn.org/stable/modules/svm.html), explore:\n - Different multi-class parameters\n - LinearSVC vs SVC\n - For SVC, different kernels\n - Different margin \n \nWith Classification trees (https://scikit-learn.org/stable/modules/tree.html#tree), sklearn has two types:\n - Decision Trees (https://scikit-learn.org/stable/modules ... Classifier)\n - Extra Tree Classification (https://scikit-learn.org/stable/modules ... Classifier). I never used this one.\n\nAnd there are random forest (https://scikit-learn.org/stable/modules ... Classifier)\n\n**Read the documentation, select the 4+ ways you want to explore each of these 4 classifiers, AND WRITE UP NOTES IN MARKDOWN CELLS**\n**Your interpretation and conclusions are really important**"},{"metadata":{},"cell_type":"markdown","source":"## Some examples, just to get you started, are below."},{"metadata":{"trusted":true},"cell_type":"code","source":"# Basic Logistic Regression plot_examples\n\n# lr = LogisticRegression(C=1, multi_class='ovr', solver='liblinear')\nlr = LogisticRegression(C=100000, multi_class='ovr', solver='liblinear')\n\nlr.fit(X_training_pca, y_training_all)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"lr.score(X_test_pca, y_test_all)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"y_test_pred = lr.predict(X_test_pca)\nprint(f\"Accuracy: {accuracy_score(y_true=y_test_all, y_pred=y_test_pred) * 100.0:.2f}%\")\nConfusionMatrixDisplay.from_predictions(y_true=y_test_all, y_pred=y_test_pred, labels=list(expresions_subset.values()))\nplt.show()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# Another Logistics Regression Example\nlr_mn = LogisticRegression(penalty=\"none\", multi_class=\"multinomial\", max_iter=2000, solver='sag', tol=0.00001)\nlr_mn.fit(X_training_pca, y_training_all)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"lr_mn.n_iter_","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"lr_mn.score(X_test_pca, y_test_all)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## SVM"},{"metadata":{"trusted":true},"cell_type":"code","source":"# Basic example. This takes a long time to run. Explore other options?\n\nsvc = LinearSVC(multi_class='crammer_singer', dual=False)\nsvc.fit(X_training_pca, y_training_all)\ny_test_pred = svc.predict(X_test_pca)\nprint(f\"Accuracy: {accuracy_score(y_true=y_test_all, y_pred=y_test_pred) * 100.0:.2f}%\")\nConfusionMatrixDisplay.from_predictions(y_true=y_test_all, y_pred=y_test_pred, labels=list(expresions_subset.values()))\nplt.show()","execution_count":null,"outputs":[]}]