Here, we get to use what we have learnt so far on data as well as building machine learning applications.
We are using data from Kickstarter projects for this part and below there is a part of the data you can see.
The goal of the competition is to predict if a user will download an app after clicking through an ad. For this course you will use a small sample of the data, dropping 99% of negative records (where the app wasn’t downloaded) to make the target more balanced.

What we can do here is predict if a Kickstarter project will succeed. We get the outcome from the state
column. To predict the outcome we can use features such as category, currency, funding goal, country, and when it was launched.

Alright, we found out that there are 6 different types of states. Now, we can look for how many record of each.

We will simplify this by:
- Dropping projects that are “live”
- Counting “successful” states as
outcome = 1
- Combining every other state as
outcome = 0

Now, we convert the launched
feature into categorical features we can use in a model by accessing date and time values through the .dt
attribute on the timestamp column.

Then, for the categorical variables — category
, currency
, and country
— we’ll need to convert them into integers so our model can use the data by using scikit-learn’s LabelEncoder
.

Now, we merge other important columns to create a new dataframe and use that to train a model.

Creating training, validation, and test splits
We need to create data sets for training, validation, and testing.
We’ll use 10% of the data as a validation set, 10% for testing, and the other 80% for training.

Note: we have to be careful that each data set has the same proportion of target classes.

This looks good, each set is around 35% true outcomes likely because the data was well randomized beforehand. But if the data is not well randomized we can use the following code to make it randomized.
“sklearn.model_selection.StratifiedShuffleSplit”
Training a LightGBM model
This is a tree-based model that typically provides the best performance, even compared to XGBoost. It’s also relatively fast to train.

Making predictions & evaluating the model
We will make predictions on the test set with the model and see how well it performs.

Note: An important thing to remember is that you can overfit to the validation data. This is why we need a test set that the model never sees until the final evaluation.
Sammary
The pictures below are the summary of what we have done so far on the target data.



Count Encoding
Count encoding replaces each categorical value with the number of times it appears in the dataset. For example, if the value “GB” occurred 10 times in the country feature, then each “GB” would be replaced with the number 10.
We will use the categorical-encodings
package to get this encoding. The encoder itself is available as CountEncoder
. This encoder and the others in categorical-encodings
work like scikit-learn transformers with .fit
and .transform
methods.

Target Encoding
Target encoding replaces a categorical value with the average value of the target for that value of the feature.
The category_encoders
package provides TargetEncoder
for target encoding. The implementation is similar to CountEncoder
.

CatBoost Encoding
This is similar to target encoding in that it’s based on the target probablity for a given value. However with CatBoost, for each row, the target probability is calculated only from the rows before it.

Interactions
One of the easiest ways to create new features is by combining categorical variables.
For example, if one record has the country "CA"
and category "Music"
, you can create a new value "CA_Music"
. Pandas lets us simply add string columns together like normal Python strings.

Then, label encode the interaction feature and add it to our data.

Number of projects in the last week
To count the number of projects launched in the preceding week for each record, we’ll use the .rolling
method on a series with the "launched"
column as the index. We’ll create the series, using ks.launched
as the index and ks.index
as the values, then sort the times.
Using a time series as the index allows us to define the rolling window size in terms of hours, days, weeks, etc.

With a timeseries index, you can use .rolling
to select time periods as the window. For example launched.rolling('7d')
creates a rolling window that contains all the data in the previous 7 days.


Now that we have the counts, we need to adjust the index so we can join it with the other training data.


Time since the last project in the same category
A handy method for performing operations within groups is to use .groupby
then .transform
. The .transform
method takes a function then passes a series or dataframe to that function for each group. This returns a dataframe with the same indices as the original dataframe. In our case, we’ll perform a groupby on "category"
and use transform to calculate the time differences for each category.

Then, we get NaN
s here for projects that are the first in their category. We’ll need to fill those in with something like the mean or median. We’ll also need to reset the index so we can join it with the other data.

Transforming numerical features
Common choices for this are the square root and natural logarithm. These transformations can also help constrain outliers.
Here we’ll transform the goal feature using the square root and log functions, then fit a model to see if it helps.


Note: The log transformation won’t help our model since tree-based models are scale-invariant. However, this should help if we had a linear model or neural network.
Other transformations include squares and other powers, exponentials, etc. These might help the model discriminate, as the kernel trick for SVMs. Again, it takes a bit of experimentation to see what works. One method is to create a bunch of new features and later choose the best ones with feature selection algorithms.
Summary

Univariate Feature Selection
From the scikit-learn feature selection module, feature_selection.SelectKBest
returns the K best features given some scoring function. For our classification problem, the module provides three different scoring functions: χ2, ANOVA F-value, and the mutual information score. The F-value measures the linear dependency between the feature variable and the target.
With SelectKBest
, we define the number of features to keep, based on the score from the scoring function. Using .fit_transform(features, target)
we get back an array with only the selected features.


All the dropped columns are filled with zeros. We can find the selected columns by choosing features where the variance is non-zero.

L1 regularization
Univariate methods consider only one feature at a time when making a selection decision. Instead, we can make our selection using all of the features by including them in a linear model with L1 regularization.
For regression problems you can use sklearn.linear_model.Lasso
, or sklearn.linear_model.LogisticRegression
for classification. These can be used along with sklearn.feature_selection.SelectFromModel
to select the non-zero coefficients. Otherwise, the code is similar to the univariate tests.

Similar to the univariate tests, we get back an array with the selected features. Again, we will want to convert these to a DataFrame so we can get the selected columns.

C=1
, we’re dropping the time_since_last_project
column.Thanks to Kaggle.