Pandas

framework
python
data science
Pandas is a python library for data wrangling and data cleaning.
Basic concept
Each observation is in its own row (and preferably has an index-value). Each feature/variable is in its own column and has a unique name.
Data Frames
my_df = pd.DataFrame(data=numpy_array, index=[0,1,2], columns=["col1", "col2", "col3"])

Reshaping data

From multiple columns to two columns
pd.melt(my_df)
Col1 has former column names, col2 has values
Spread rows into columns
my_df.pivot(columns='col1', values=
'col2')
Col1’s values are used as column names, col2’s values are used in respective column’s values.
Append one dataframe to another
pd.concat([my_df1, my_df2]) # below / vertically
pd.concat([my_df1, my_df2], axis=1) # to the right / horizontally
Reset the index to row numbers
my_df.reset_index()

Slice DataFrames

First and last rows
my_df.head(n=10) 
my_df.tail(n=10)
Sample several rows
my_df.sample(n=10)
Select specific rows
my_df.iloc[10:20]
Select specific columns
my_df["col1"]
my_df[["col1", "col2"]]
Select rows based on condition
my_df[my_df['col1'] > 42]
Select rows based on a list of values in a column
my_df[my_df['col1'].isin(['value1', 'value2'])]
Remove duplicate rows
my_df.drop_duplicates()
Select rows excluding certain values in a column
my_df[~my_df['col1'].isin(['value1', 'value2'])]

Data Manipulation

Replace values in a column
my_df['col1'] = my_df['col1'].replace(['old_value1', 'old_value2'], ['new_value1', 'new_value2'])

Data Aggregation

Group by a column and calculate mean of another column
my_df.groupby('col1')['col2'].mean()

Data Visualization

Plot a histogram of a column
my_df['col1'].plot.hist()
Plot a scatter plot of two columns
my_df.plot.scatter(x='col1', y='col2')