Things to Know about Machine Learning

Tutorial: Doc2Vec and t-SNE

February 15, 2019

This post shows a tutorial of using doc2vec and the t-SNE visualization in Python for disease clustering. Of course, these tutorial codes can be used for any other types of … Read More

Data Science Statistical Significance Testing

Type I and Type II Errors

February 8, 2019

Two types of error are possible from a hypothesis test: Type I and Type II errors. Type I error, also known as a “false positive”, is the error of rejecting … Read More

Data Science Statistical Significance Testing

A statistical hypothesis is an assumption about the parameters describing a population (not a sample). This assumption may be true or false. Hypothesis tests (also called significance tests) are to … Read More

Data Science Visualization

Bar Chart / Stacked Bar Chart

January 27, 2019

First of all, let’s take a look at differences between a bar chart and a histogram. They look very similar, but they are different. Bar charts compares discrete variables while … Read More

Data Preprocessing Data Science Machine Learning

Feature Scaling

January 25, 2019

Feature scaling is one of most important feature engineerings for many machine learning algorithms (Decision trees don’t need feature scaling necessarily). Most of the algorithms require similar scales for numerical … Read More

Data Analysis Examples Data Science Machine Learning NLP

Analysis of Text: Tolstoy’s War and Peace

January 25, 2019

On this post, we will look at an example of analysis of text data that are Tolstoy’s War and Peace books. You can download a plain text file from http://www.gutenberg.org/ebooks/2600. … Read More

Deep Learning RNN

Introduction to Recurrent Neural Networks

January 21, 2019

Recurrent Neural Networks (RNNs) are a class of neural networks for modeling sequential data such as stock prices, an audio clip, a DNA sequence, a sequence of video frames, a … Read More

Ensemble Learning Machine Learning

Random Forests

January 15, 2019

Random forests are an ensemble method (generally, bagging) constructing multiple decision trees on random samples with replacement of the training set. It is a supervised learning algorithm and can be … Read More

Data Science Machine Learning NLP

A General Approach to Preprocessing Text Data

January 13, 2019

You can find the whole process from preprocessing to visualization in another post Analysis of Text: Tolstoy’s War and Peace if you are interested in looking at the whole source … Read More

Data Science Visualization

Box Plot

January 11, 2019

A box plot (also called a whisker plot) is useful to visualize the distribution of data and find outliers. This plot displays the five-number summary: minimum, first quartile, median, third … Read More

Things to Know about Machine Learning

Posts

Gradient Descent

Batch vs. Online Learning

Knowledge Graph Completion

What are Embeddings?

Batch Normalization

Gradient Boosting

The Normal Equation

Handling Categorical Values

How to Deal with Missing Values

Underfitting vs. Overfitting