Pandas

framework

python

data science

Pandas is a python library for data wrangling and data cleaning.

Basic concept

Each observation is in its own row (and preferably has an index-value). Each feature/variable is in its own column and has a unique name.

Data Frames

my_df = pd.DataFrame(data=numpy_array, index=[0,1,2], columns=["col1", "col2", "col3"])

Reshaping data

From multiple columns to two columns

pd.melt(my_df)

Col1 has former column names, col2 has values

Spread rows into columns

my_df.pivot(columns='col1', values=
'col2')

Col1’s values are used as column names, col2’s values are used in respective column’s values.

Append one dataframe to another

pd.concat([my_df1, my_df2]) # below / vertically
pd.concat([my_df1, my_df2], axis=1) # to the right / horizontally

Reset the index to row numbers

my_df.reset_index()

First and last rows

my_df.head(n=10) 
my_df.tail(n=10)

Sample several rows

my_df.sample(n=10)

Select specific rows

my_df.iloc[10:20]

Select specific columns

my_df["col1"]
my_df[["col1", "col2"]]

Select rows based on condition

my_df[my_df['col1'] > 42]

Select rows based on a list of values in a column

my_df[my_df['col1'].isin(['value1', 'value2'])]

Remove duplicate rows

my_df.drop_duplicates()

Select rows excluding certain values in a column

my_df[~my_df['col1'].isin(['value1', 'value2'])]

Replace values in a column

my_df['col1'] = my_df['col1'].replace(['old_value1', 'old_value2'], ['new_value1', 'new_value2'])

Group by a column and calculate mean of another column

my_df.groupby('col1')['col2'].mean()

Plot a histogram of a column

my_df['col1'].plot.hist()

Plot a scatter plot of two columns

my_df.plot.scatter(x='col1', y='col2')