import pandas as pd \

quote from Wed McKinzey book.

Easy Questions

create a dataframe with uniform and normal…

axis=0 is columns
`data.isnull().sum() is null generates a boolean dataframe of same shape sum is an aggregate method which sums across the default axis of columns (axis=1) rows are axis=0 results in a pandas series (1 dimensional dataframe)

What is a `DataFrame` object?

Seeing the actual data is always important. What are some methods to get a quick sight of your raw data?

.head() .info()

DataFrame Creation: How would you create a DataFrame from a CSV file in Pandas?

The following is the first 5 rows of a dataset `alex-boba.csv`. Alex visits various boba tea shops, sampling drinks with diverse characteristics such as price, volume, sugar level, and whether they are organic, and subsequently rates each drink on a scale of 1 to 10.

Price ($)	Volume (ml)	Sugar Level (%)	Organic	Rating (out of 10)
4.74	399	1.9	No	10
5.58	477	61.8	Yes	4
5.01	543	61.2	No	1
4.72	585	61.7	Yes	6
4.12	447	94.4	Yes	1

↳ How would you load in the dataset in `pandas`, and then see a view of the dataset like in the table above in order to validate it?

boba = pd.read_csv('alex-boba.csv')
boba.head() # 5 by default

↳ Alex is a pretty careful guy, but just in case, you want to validate that their are no missing entries in his boba dataset. How can you do this?

boba.info()

This displays a summary of the boba dataset. It lists each column, along with the non-null count, and the datatype. The output might look something like this:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Price ($)          20 non-null     float64
 1   Volume (ml)        20 non-null     int64  
 2   Sugar Level (%)    20 non-null     float64
 3   Organic            20 non-null     object 
 4   Rating (out of 10) 20 non-null     int64  
 
dtypes: float64(2), int64(2), object(1)
memory usage: 928.0+ bytes

↳ Read the output above. Are there any nulls in the dataset? What is the datatype of values in the “Organic” field?

There are no nulls, since every column has a non-null count of 20.
The ‘Organic’ field has an ‘object’ datatype, which typically corresponds to strings or mixed types.
The other variables are float64 and int64. Read here if you don’t know what that means.

↳ What does `boba.isnull().sum()` do? What would the expected output for this dataset look like, given that there are no null values?

The expression boba.isnull().sum() performs two operations on the boba DataFrame:

boba.isnull(): This creates a new DataFrame of the same shape as boba, where each element is True if the corresponding element in boba is a missing value (NaN or None), and False otherwise.
.sum(): This method then sums these True (treated as 1) and False (treated as 0) values for each column. Essentially, it counts the number of missing values (NaN or None) in each column.

Given that there are no null values in the dataset, the expected output would be a Series where each entry corresponds to a column in the boba DataFrame, and the value for each column would be 0, indicating that there are no missing values in any column. The output would look something like this:

Price ($)            0
Volume (ml)          0
Sugar Level (%)      0
Organic              0
Rating (out of 10)   0
dtype: int64

↳ In `pandas`, what is the default axis for methods such as `.sum()`. What even is an axis in `pandas`?

In Pandas, the default axis for the .sum() method is axis=0.
This means that the method will sum the data along the vertical axis, or in other words, it will sum data down rows for each column.
When applied to a DataFrame, it returns the sum of each column.

Okay, now that the `boba` dataset is all loaded in, explain how you could replace the values in the `boba rating` column . It should be `bad` if the boba rating is less than 6, and `good` otherwise.

bins = (0, 6.5, 10)
bin_names = ['bad', 'good']
boba['rating'] = pd.cut(boba['rating'], bins=bins, labels=bin_names)

↳ How can you confirm that the `ratings` column has been transformed into only ‘bad’ and ‘good’ values? How can you verify that `good` is ranked higher than `bad`?

boba['rating'].unique()

This will output:

['bad', 'good'] 
Categories (2, object): ['bad' < 'good']

Showing that the column only contains bad and good values. Pandas is also super helpful, and shows the internal order of the categories (although this is optional). Remember, categorical data can be ordinal, meaning that the categories have an ‘ordering’.

This ordering can be important in certain operations, like sorting or when the categorical data is used in a machine learning model, as it dictates how the categories are interpreted in relation to each other.

Clearly, in this case, it is important that 'bad' < 'good'.

If you need to select a single column from a DataFrame, what syntax would you use?

↳ How would you filter rows in a DataFrame where a specific column’s value is greater than a certain threshold?

↳ What command would you use to check for missing values in a DataFrame?

↳ How can you fill missing values in a DataFrame with the mean of their respective columns?

Fill missing data

↳ What is the command to group data in a DataFrame and calculate the sum of a specific column for each group?

Data aggregation

↳ Can you demonstrate how to perform an inner join between two DataFrames in Pandas?

↳ How would you sort a DataFrame by the values in a particular column?

↳ How can you set a specific column as the index of a DataFrame?

↳ After setting a new index, how can you reset it back to the default integer index?

↳ How do you select rows based on index labels using the `.loc` accessor?

↳ How do you select rows based on integer index using the `.iloc` accessor?

↳ What is the syntax for selecting DataFrame rows where a certain condition is met?

↳ How do you find the unique values in a DataFrame column?

↳ What command would you use to count the occurrences of each unique value in a DataFrame column?

↳ Given a Pandas Series, how would you replace all values greater than a specified threshold with a fixed value, and then calculate the mean of this modified Series?

↳ How can you quickly generate descriptive statistics for each column in a DataFrame?

↳ If you need to add a new column to a DataFrame based on values from existing columns, what approach would you take?

↳ Finally, how do you export a DataFrame to a CSV file?

What is `.json`

JavaScript Object Notation

↳ How is `json` related to `JavaScript`?

Not at all

What are some ways to get some basic information about a dataframe?

How can you see each feature in a dataset along with the number of non-null values, and their datatypes

Hard Questions

How can you return a DataFrame containing all duplicated items?

df.duplicated().drop_duplicates()

~df.isin(...) and rename()

Walk through the 3 steps in receiving json data from an API, and saving to a file in human readable `.json`

I have a list of tickets. Every ticket can have any number of ‘tags’ (e.g `is_important`). List three different ways to represent your tickets in a `pandas` DataFrame.

String of comma-separated values

If I knew the number of tags per ticket wasn’t very large, I would choose to store it as space or comma-separated string.

1. Storing Tags as Comma-Separated Strings:

    ticket_id    tags
    1           "tag1, tag2, tag3"
    2           "tag2, tag3"
    3           "tag1, tag3"

Boolean Flags for Each Tag

If I knew I was going to perform analysis on individual tags, such as aggregating or using it as input into a machine learning model, I would create a separate column, or boolean flag, for each tag. Then I would indicate the existence of a tag using True or False. This would result in a very wide DataFrame.

2. Boolean Flags for Each Tag:

    ticket_id    tag1    tag2    tag3
    1            True    True    True
    2            False   True    True
    3            True    False   True

Multi-hot Encoded Columns

This is essentially the same idea as boolean flags, but instead of True/False, use 1 and 0. This would be useful for analysis and machine learning applications further down the pipeline.

3. Multi-hot Encoded Columns:

    ticket_id    tag1    tag2    tag3
    1            1       1       1
    2            0       1       1
    3            1       0       1

Separate rows for each tag

If the ticket_id didn’t have to be uniquely identifying, then I could simply create a tag column and then use a new row for each new tag assigned to a ticket. In this case, each ticket will appear multiple times in the DataFrame, once for each of its tags.

4. Separate rows for each tag:

    ticket_id    tag
    1            tag1
    1            tag2
    1            tag3
    2            tag2
    2            tag3
    3            tag1
    3            tag3

Explain the difference between using `loc` and `iloc` to index a DataFrame.

Indexing

Simply mean searching, splicing, accessing certain rows/columns.

Success .iloc

Index based using 0, 1, 2, 3, ... like how traditional lists are indexed in Python

Success .loc

Label based using the worded name of the row/column. E.g the column 'passive_income'

Given a DataFrame `df`, how would you select the first 5 rows and all columns?

df.iloc[0:5, : ]

##### ↳ How would you select the first 3 columns and all rows?

df.iloc[ : ,0:3]

↳ How would you select the first 3 rows and first 5 columns?

df.iloc[0:5,0:3]

↳ How would you select rows named `'a'` through `'e'`, and all columns?

df.loc['a':'e', :]

↳ How would you select all columns `'col_1'` to `'col_3'` and all rows?

df.iloc[ : ,'col_1':'col_3']

↳ How would you select all rows named `'a'` through `'e'` and all columns `'col_1'` to `'col_3'`?

df.iloc['a':'e','col_1':'col_3']

⚘ DSX.com

Explorer

2.4 pandas

Easy Questions §

What is a DataFrame object? §

Seeing the actual data is always important. What are some methods to get a quick sight of your raw data? §

DataFrame Creation: How would you create a DataFrame from a CSV file in Pandas? §

The following is the first 5 rows of a dataset alex-boba.csv. Alex visits various boba tea shops, sampling drinks with diverse characteristics such as price, volume, sugar level, and whether they are organic, and subsequently rates each drink on a scale of 1 to 10. §

↳ How would you load in the dataset in pandas, and then see a view of the dataset like in the table above in order to validate it? §

↳ Alex is a pretty careful guy, but just in case, you want to validate that their are no missing entries in his boba dataset. How can you do this? §

↳ Read the output above. Are there any nulls in the dataset? What is the datatype of values in the “Organic” field? §

↳ What does boba.isnull().sum() do? What would the expected output for this dataset look like, given that there are no null values? §

↳ In pandas, what is the default axis for methods such as .sum(). What even is an axis in pandas? §

Okay, now that the boba dataset is all loaded in, explain how you could replace the values in the boba rating column . It should be bad if the boba rating is less than 6, and good otherwise. §

↳ How can you confirm that the ratings column has been transformed into only ‘bad’ and ‘good’ values? How can you verify that good is ranked higher than bad? §

If you need to select a single column from a DataFrame, what syntax would you use? §

↳ How would you filter rows in a DataFrame where a specific column’s value is greater than a certain threshold? §

↳ What command would you use to check for missing values in a DataFrame? §

↳ How can you fill missing values in a DataFrame with the mean of their respective columns? §

↳ What is the command to group data in a DataFrame and calculate the sum of a specific column for each group? §

↳ Can you demonstrate how to perform an inner join between two DataFrames in Pandas? §

↳ How would you sort a DataFrame by the values in a particular column? §

↳ How can you set a specific column as the index of a DataFrame? §

↳ After setting a new index, how can you reset it back to the default integer index? §

↳ How do you select rows based on index labels using the .loc accessor? §

↳ How do you select rows based on integer index using the .iloc accessor? §

↳ What is the syntax for selecting DataFrame rows where a certain condition is met? §

↳ How do you find the unique values in a DataFrame column? §

↳ What command would you use to count the occurrences of each unique value in a DataFrame column? §

↳ Given a Pandas Series, how would you replace all values greater than a specified threshold with a fixed value, and then calculate the mean of this modified Series? §

↳ How can you quickly generate descriptive statistics for each column in a DataFrame? §

↳ If you need to add a new column to a DataFrame based on values from existing columns, what approach would you take? §

↳ Finally, how do you export a DataFrame to a CSV file? §

What is .json §

↳ How is json related to JavaScript? §

What are some ways to get some basic information about a dataframe? §

How can you see each feature in a dataset along with the number of non-null values, and their datatypes §

Hard Questions §

How can you return a DataFrame containing all duplicated items? §

Walk through the 3 steps in receiving json data from an API, and saving to a file in human readable .json §

I have a list of tickets. Every ticket can have any number of ‘tags’ (e.g is_important). List three different ways to represent your tickets in a pandas DataFrame. §

Explain the difference between using loc and iloc to index a DataFrame. §

Given a DataFrame df, how would you select the first 5 rows and all columns? §

↳ How would you select the first 3 rows and first 5 columns? §

↳ How would you select rows named 'a' through 'e', and all columns? §

↳ How would you select all columns 'col_1' to 'col_3' and all rows? §

↳ How would you select all rows named 'a' through 'e' and all columns 'col_1' to 'col_3'? §

Graph View

Table of Contents

Backlinks

Easy Questions

What is a `DataFrame` object?

Seeing the actual data is always important. What are some methods to get a quick sight of your raw data?

DataFrame Creation: How would you create a DataFrame from a CSV file in Pandas?

The following is the first 5 rows of a dataset `alex-boba.csv`. Alex visits various boba tea shops, sampling drinks with diverse characteristics such as price, volume, sugar level, and whether they are organic, and subsequently rates each drink on a scale of 1 to 10.

↳ How would you load in the dataset in `pandas`, and then see a view of the dataset like in the table above in order to validate it?

↳ Alex is a pretty careful guy, but just in case, you want to validate that their are no missing entries in his boba dataset. How can you do this?

↳ Read the output above. Are there any nulls in the dataset? What is the datatype of values in the “Organic” field?

↳ What does `boba.isnull().sum()` do? What would the expected output for this dataset look like, given that there are no null values?

↳ In `pandas`, what is the default axis for methods such as `.sum()`. What even is an axis in `pandas`?

Okay, now that the `boba` dataset is all loaded in, explain how you could replace the values in the `boba rating` column . It should be `bad` if the boba rating is less than 6, and `good` otherwise.

↳ How can you confirm that the `ratings` column has been transformed into only ‘bad’ and ‘good’ values? How can you verify that `good` is ranked higher than `bad`?

If you need to select a single column from a DataFrame, what syntax would you use?

↳ How would you filter rows in a DataFrame where a specific column’s value is greater than a certain threshold?

↳ What command would you use to check for missing values in a DataFrame?

↳ How can you fill missing values in a DataFrame with the mean of their respective columns?

↳ What is the command to group data in a DataFrame and calculate the sum of a specific column for each group?

↳ Can you demonstrate how to perform an inner join between two DataFrames in Pandas?

↳ How would you sort a DataFrame by the values in a particular column?

↳ How can you set a specific column as the index of a DataFrame?

↳ After setting a new index, how can you reset it back to the default integer index?

↳ How do you select rows based on index labels using the `.loc` accessor?

↳ How do you select rows based on integer index using the `.iloc` accessor?

↳ What is the syntax for selecting DataFrame rows where a certain condition is met?

↳ How do you find the unique values in a DataFrame column?

↳ What command would you use to count the occurrences of each unique value in a DataFrame column?

↳ Given a Pandas Series, how would you replace all values greater than a specified threshold with a fixed value, and then calculate the mean of this modified Series?

↳ How can you quickly generate descriptive statistics for each column in a DataFrame?

↳ If you need to add a new column to a DataFrame based on values from existing columns, what approach would you take?

↳ Finally, how do you export a DataFrame to a CSV file?

What is `.json`

↳ How is `json` related to `JavaScript`?

What are some ways to get some basic information about a dataframe?

How can you see each feature in a dataset along with the number of non-null values, and their datatypes

Hard Questions

How can you return a DataFrame containing all duplicated items?

Walk through the 3 steps in receiving json data from an API, and saving to a file in human readable `.json`

I have a list of tickets. Every ticket can have any number of ‘tags’ (e.g `is_important`). List three different ways to represent your tickets in a `pandas` DataFrame.

Explain the difference between using `loc` and `iloc` to index a DataFrame.

Given a DataFrame `df`, how would you select the first 5 rows and all columns?

↳ How would you select the first 3 rows and first 5 columns?

↳ How would you select rows named `'a'` through `'e'`, and all columns?

↳ How would you select all columns `'col_1'` to `'col_3'` and all rows?

↳ How would you select all rows named `'a'` through `'e'` and all columns `'col_1'` to `'col_3'`?