![]() One option in dealing with missing data is to simply ignore or remove the rows in which data is missing, discarding them from our analysis. For example, are the data missing at random or is there a hidden relationship between missing data and some other predictor? There's not exactly a formulaic way to approach how to treat missing data, as the treatment largely depends on the context and the nature of the data. This is important to note the distinction between treating null values for categorical and numerical data, as the treatment will depend on the type of data present. In both of these cases, the column contains categorical data. Let's start with figuring out what to do with the null values found in self_employed and work_interfere. In this case, I've decided that I'd like to build a classifier to predict whether or not someone will seek treatment. Ultimately, this depends on what you're looking to predict or classify. If we're using a supervised machine learning technique, we need to make a distinction in the data between features and labels for each observation. You can use the pandas scatter_matrix to easily visualize your data. self_employed and work_interfere contain some null values.įor numerical features, it can also be helpful to quickly examine any possible correlations between features.Moreso, "Male" and "male" are ostensibly the same but currently being treated as two distinct categories. There are 49 different values listed for Gender.The Age column contained someone who is 99999999999 years old, they should be in the Guiness World Records book or something.The Age column also contained children (such as a 5 year old) that are unlikely to be taking a survey about their workplace.The Age column contained people who had not been born yet (negative numbers). ![]() ![]() Here's a list of things I found that needed attention before feeding this model into a machine learning algorithm. While many columns were fine as is, a couple columns needed cleaning. As I looked through the dataset, I kept an eye out for things just as duplicate labels, erroneous or null values, and other things that just didn't quite seem right. This printed out a cell in my notebook with a ton of information about my dataset that I could easily consume by just scrolling through. Rather than inspecting each feature one-by-one, I opted for the lazier route and ran the following: feature_names = df.columns.tolist() You can read a description of each feature for this dataset on Kaggle. The first thing to understand is what features our dataset contains. This is often helpful to do when building a model. Pro tip: you can return the number of observations in your dataset with df.shape and the number of features in your dataset with df.shape. Running df.shape will return information about the dimensionality of our dataframe (in this case it's the number of rows and columns), which will essentially tell you how many examples and features you are working with. While we're at it, let's take a look at the shape of the dataframe too. The pandas head() function returns the first 5 rows of your dataframe by default, but I wanted to see a bit more to get a better idea of the dataset. Pd.set_option("display.max_columns", num_columns) You can follow along in a Jupyter Notebook if you'd like. It's hard to know what to do if you don't know what you're working with, so let's load our dataset and take a peek. I took a look at this dataset with my friend Florian (he's awesome, check him out here) earlier this year and decided it would serve as a good example for this post. In this post, I'll be cleaning a dataset from Kaggle on Mental Health in Tech. In the real world, however, such datasets are rare and uncommon. In an ideal world, you'll have a perfectly clean dataset with no errors or missing values present. Before you're ready to feed a dataset into your machine learning model of choice, it's important to do some preprocessing so the data behaves nicely for our model.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |