To start learning Pandas, we need to install Pandas first and then Jupyter Notebook.
The very first thing which is obvious is to include the header for pandas:
import pandas as pd
Creating Data
There are two core objects in pandas: the DataFrame and the Series.
DataFrame:
A DataFrame is a table. It contains an array of individual entries, each of which has a certain value.
Each entry corresponds to a row (or record) and a column. DataFrame entries are not limited to integers, it can have strings too.

We can modify the label of rows and/or columns. The list of row labels used in a DataFrame is known as an Index.

Series:
A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list (a single column of a DataFrame).
We can assign column values to the Series. However, a Series does not have a column name, it only has one overall name
.

Reading data files
Data can be stored in any of a number of different forms and formats. By far the most basic of these is the humble CSV file.
We’ll use the pd.read_csv()
function to read the data into a DataFrame.

We can use the shape attribute to check how large the resulting DataFrame is.

Naive accessors
Native Python objects provide good ways of indexing data. We can access the property of an object by accessing it as an attribute. A book object, for example, might have a title property, which we can access by calling book.title. Columns in a pandas DataFrame work in much the same way.
Now, consider we have a DataFrame as below:

The name of our DataFrame is “reviews” and we want to access the “country” property, we will have this:

Here are some other ways of accessing to properties by using the indexing “[]” in Python.


Indexing in pandas
pandas has its own accessor operators, “loc” and “iloc”. For more advanced operations, these are the ones you’re supposed to be using.
Index-based selection:
This one refers to selecting data based on its numerical position in the data.
Note: uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10
will select entries 0,...,9
.
Here are some example of this kind of access:



Label-based selection:
In this paradigm, it’s the data index value, not its position.
Note: indexes inclusively. So 0:10
will select entries 0,...,10
.
Here are some examples of this type of access:


Manipulating the index
The set_index()
method can be used to manipulate the index. Below is one example of it.

Conditional selection
Let’s explain this section with an example. Suppose that we’re interested specifically in better-than-average wines produced in Italy.
We can start by checking if each wine is Italian or no:

This result can then be used inside of “loc” to select the relevant data:

We also wanted to know which ones are better than average. Wines are reviewed on a 80-to-100 point scale, so this could mean wines that accrued at least 90 points.
We can use the ampersand (&
) to bring the two questions together:

Assigning data
Assigning data to a DataFrame is easy. There are some example of it below:


Summary functions
It restructures the data in some useful way.
For example, consider the describe()
method:

The mean()
function:

The unique()
function:

The value_counts()
method:

Maps
A function that takes one set of values and “maps” them to another set of values. In data science we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later. Maps are what handle this work, making them extremely important for getting your work done!
There are two mapping methods that you will use often. (“map()” and “apply()”)
Map(): We wanted to remean the scores the wines received to 0.

Apply(): is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.

NOTE: map()
and apply()
return new, transformed Series and DataFrames, respectively. They don’t modify the original data they’re called on.
Groupwise analysis
One of the important functions is “value_counts()” that is just a shortcut of “groupby()”.

Now, groupby()
created a group of reviews which allotted the same point values to the given wines. Then, for each of these groups, we grabbed the points()
column and counted how many times it appeared.
For example, to get the cheapest wine in each point value category, we can do the following:

By using “apply()” we can then manipulate the data in any way we see fit. For example, here’s one way of selecting the name of the first wine reviewed from each winery in the dataset:

Multi-indexes
A multi-index differs from a regular index in that it has multiple levels. For example:


However, in general, the multi-index method you will use most often is the one for converting back to a regular index, the reset_index()
method:

Sorting
Looking again at countries_reviewed
we can see that grouping returns data in index order, not in value order. To get data in the order want it in we can sort it ourselves. The sort_values()
method is handy for this.

sort_values()
defaults to an ascending sort, where the lowest values go first. However, we can change that order to descending as follows:

Dtypes
The data type for a column in a DataFrame or a Series is known as the dtype.
We use the dtype
property to grab the type of a specific column. For instance, we can get the dtype of the price
column in the reviews
DataFrame:



Missing data
Background knowledge: entries missing values are given the value NaN
, short for “Not a Number”. For technical reasons these NaN
values are always of the float64
dtype.
Pandas provides some methods specific to missing data. To select NaN
entries you can use pd.isnull()
(or its companion pd.notnull()
).

“fillna()” is for replacing missing values is a common operation.

Renaming
This function lets you change index names and/or column names.
For example, to change the points
column in our dataset to score
, we would do:

Both the row index and the column index can have their own name
attribute. The complimentary rename_axis()
method may be used to change these names.

Combining
Pandas has three core methods for doing this. In order of increasing complexity, these are concat()
, join()
, and merge()
. Most of what merge()
can do can also be done more simply with join()
, so we will omit it and focus on the first two functions here.
The simplest combining method is concat()
. Given a list of elements, this function will smush those elements together along an axis.

The middlemost combiner in terms of complexity is join()
. join()
lets you combine different DataFrame objects which have an index in common.

Thanks to Kaggle.