() What is a distribution?
() What are two ways to describe it?
  1. Location
  2. Spread
() Are there any more ways to describe a distribution?

Q2) What is sampling error?

☞ Using a sample to draw insights about a larger population is a great idea, however it comes with the possibility of introducing a sampling error. Sampling error arises due to the discrepancies between the sample and the population from which the sample is drawn.

For example, imagine you want to know the proportion of people who watch anime in your school. Obviously, you lack the time and social confidence to go around asking every single student in the school. Instead, what you do



Q2) Name some sampling methods for data analysis (match to approproiatae scenario, just ;list em)

Sampling can be particularly useful with data sets that are too large to efficiently analyze in full – for example, in big data analytics applications or surveys. Identifying and analyzing a representative sample is more efficient and cost-effective than surveying the entirety of the data or population.

An important consideration, though, is the size of the required data sample and the possibility of introducing a sampling error

Random Sampling

Stratified Sampling

  • If we want to control the proportions of particular classes in our sample
  • E.g you have a population where 1/3 of the population are foxes and 2/3 is pandas, it’s possible that if you did purely random sample, you could end up with a sample that had only foxes.
  • Let’s say for the sake of your experiment, you need both foxes and pandas in your random sample, since you’re trying to compare which animal is responds better to jazz music.
  • If you want to keep the sample as random as possible individual-wise, but ensure that the proportions reflect the population for your experiment, you can first ‘stratify’ the population based on this class (category) of interest, and then just do a random sample on each new strata (group) - proportionally.
🐼🐼🐼🐼🦊🐼🐼🐼🐼
🐼🐼🦊🦊🦊🐼🐼🦊🐼
🐼🐼🦊🦊🐼🦊🐼🐼🐼
🐼🐼🐼🦊🐼🐼🦊🦊🐼
🐼🦊🦊🐼🐼🐼🦊🐼🦊
🦊🦊🦊 🐼🐼🐼🐼🐼🐼
🦊🦊🦊 🐼🐼🐼🐼🐼🐼
🦊🦊🦊 🐼🐼🐼🐼🐼🐼
🦊🦊🦊 🐼🐼🐼🐼🐼🐼
🦊🦊🦊 🐼🐼🐼🐼🐼🐼

Cluster Sampling

Q: A State vehicle licensing authority wishes to determine the percentage of cars ‘on the road’ that are unregistered. Is it clear what population is intended here - if not, discuss options for defining it? For one of your options, outline a sampling method to obtain a random sample. Comment on one actual solution used in this case - which was to sample cars parked at a major suburban shopping centre

A: One ambiguity here is what is meant by ’on the road’ and a proxy might be ’in active use’ though that would need more careful definition and would be hard to check without talking to the driver. If usage across the whole state is of interest it might make sense to deal with metropolitan and country usage separately. One sampling option would be to choose a number of intersections across the relevant region (these could be chosen weighted by traffic volumes if known) and then sample cars that pass by at random (could be done systematically choosing every nth car or at random through a random number generator). This would even out the chances for ‘in active use’ cars to be chosen as opposed to sampling the carpark at a suburban shopping centre which will only reflect the catchment area for that centre and most likely has other biases - for example those making use of an unregistered vehicle may be circumspect about its use in a very public forum of that sort. Choosing intersections is a form of so-called ‘cluster sampling’ which is often used for practical reasons - if you need to collect a reasonable amount of data with limited collectors. Of course technology may offer attractive options - say if road traffic cameras could id registration plates automatically - subject to ethical issues.



Statistics below this: (move to)


What is the difference between a point estimate and a confidence interval?

A point estimate is a single number - your ‘best guess’ of some population parameter like the mean, given the data.

A confidence interval is built around this point estimate and gives you



QX.1) What is the margin of error for a confidence interval?

The margin of error is simply half the width of the confidence interval

e.g If your 95% confidence interval tells you the mean stock price is (\100, $200)\pm $50$

By definition, the confidence interval is:

QX.2) What about a one-sided or asymmetric confidence interval?

QX.3) How do you decrease the margin of error?

Larger sample size, since the


QX.4) Why does the margin of error go down when you increase the sample size?


5 Number Summary and Boxplot being simply a visual representation of it


QX.5) There is an upcoming election, and it is highly competitive between candidate A and candidate B. What size margin of error would you require?

**NB)** In a highly competitive (close) election, a margin of error of 0.5% (or 0.005) may very well be what is required to obtain a clear confidence interval which foreshadows the upcoming result. I.e, an interval that does not contain $p=1/2$; if the interval contains 0.5, that implies that we don't really have much confidence in who has the majority vote since 'it's a draw' is a feasible vote proportion.

For example, House of Representative electorates contain on average 113,500 voters, so the 38,416 sample size required for it to be useful is like holding a mini election, and costly to implement with a genuine random sample.



Logarithms

#logarithms


Statistical Models vs Machine Learning Models

Roughly speaking, machine learning models are just statistical models which require computers to execute due to the large volume of calculations required.


What is the difference between ‘inference’

Fundamentally they are abstractions of the same concept

https://www.mlstack.cafe/blog/probability-interview-questions, https://www.mygreatlearning.com/blog/statistics-interview-questions/



  1. What is population mean and sample mean?
  2. What is population standard deviation and sample standard deviation?
  3. Why population s.d. has N degrees of freedom while sample s.d. has N-1 degrees of freedom? In other words, why 1/N inside root for pop. s.d. and 1/(N-1) inside root for sample s.d.? (Here)
  4. What is the formula for calculating the s.d. of the sample mean?
  5. What is confidence interval?
  6. What is standard error?