+ 3

Shouldn't my results average around 66.6?

I am trying to wrap my old head around statistics - standard deviation. So I set up this code where I create random numerical data and try to figure out, how many of the values are within one standard deviation. To my fuzzy understanding, that should be about 2/3 of the values... But somehow my results turn out a lot lower, occasionally even below 50. Have I misunderstood how this works, or is there an error in my code? https://code.sololearn.com/cqZ6PS390gf0/?ref=app

15th Mar 2021, 5:39 PM
HonFu
HonFu - avatar
21 Réponses
+ 6
Two thirds is roughly the expected standard deviation of a normal distribution, not a random distribution as you are creating. An example of normal distribution would be something like height or IQ of 100 people - where most people are fairly close to the average. A random distribution would be where someone being 1m tall is equally likely as someone being 2m tall.
15th Mar 2021, 8:46 PM
Russ
Russ - avatar
+ 3
From my understanding your code is correct. Mean should be clear. Than you calc |mean - value|^2 for each value, sum it and divide it by the number of values. That gives you the variance. To get the standard deviation you take the square root of the variance. I tried a java version: https://code.sololearn.com/cscPUj6PtO2N/?ref=app I came to the same result like Coder Kitten mentioned.
15th Mar 2021, 8:04 PM
Denise Roßberg
Denise Roßberg - avatar
+ 3
It feels like you are trying to filter some data out, and that's not really what statistics are about. All the data are as meaningful as each other, so just showing which data lie within a certain range isn't really done. As others have said, the SD is just a value that represents how widely spread a set of data is. Pointing out which datapoints lie within 1SD of the mean doesn't really help analysis of that data in any way.
17th Mar 2021, 9:28 AM
Russ
Russ - avatar
+ 3
I think the main point is the difference between a normal distribution and a random distribution, stated by Russ. There you will find your 2/3... "The shape of the normal distribution is a vertical cross-section through a bell. It is continuous and symmetrical, with its peak at the mean of the distribution. It has two points of inflexion, one on each side of the mean at a distance sigma. The ordinate f(z) at any given value of z is the probability density at z. The total area under the curve is 1, the total probability of the distribution. The area under any portion of the curve, say between z_1 and z_2, represents the proportion of the distribution lying in that range. For instance, slightly more than two-thirds of the distribution lies within one standard deviation of the mean, i.e. between µ-sigma and µ+sigma; about 95% lies in the range µ-2*sigma to µ+2sigma; and 99.73% lies within three standard deviations of the mean." -(Webster, R. & Oliver, M. A., Geostatistics for Environmental Scientists, 2007)
17th Mar 2021, 11:13 AM
Brian M.
Brian M. - avatar
+ 3
visph, it would be reasonable to assume that the specifications are... well, specific. Unfortunately there were many cases in the past where the behavior of certain task test protocols was erratic. You would pass with solutions that clearly shouldn't have worked, and fail with solutions everyone agreed on as correct.
19th Mar 2021, 12:06 PM
HonFu
HonFu - avatar
+ 2
your code seems correct... standard deviation is the measure of data dispersion around mean, so it may contains as many values not related with 2/3 of initial length... it depends of data set provided (if I correctly understand it: I'm not at all a statitician ;P) however, I don't know if "within one standard deviation" means bounds included or not (low < v < high or low <= v <= high) ^^
15th Mar 2021, 6:42 PM
visph
visph - avatar
+ 2
Coder Kitten, Denise Roßberg, visph, okay, so it was really only a problem of me not understanding what results to expect. Thank you all - I guess I'll play around with some real data set and see what I get. :)
15th Mar 2021, 8:40 PM
HonFu
HonFu - avatar
+ 2
The way you have previously worded your question, querying whether < or <= should be used in a range lower _ value _ higher is possibly the cause of confusion. Consider a small dataset {4, 4, 6, 6}. Here the mean is 5 and the sd is 1. So, depending on whether you choose < or <=, either all the data is in the range, or none of it is. It seems strange to declare that none of the data is typical, so using <= might be more appropriate. But what does that tell you if it is represented like that? The data within 1sd is {4,4,6,6}, but that would also be true if the original dataset were {0,0,4,4,6,6,10,10}. Imagine how many other datasets would reproduce the same subset. Doesn't seem to tell you a lot when analysing it this way. I think that, while you were asking *technically* the same question as "If my child's height were exactly 1sd below the mean, is it normal?", I feel the way it has been put to us was slightly misleading with your intent. Hope that helps.
19th Mar 2021, 10:24 AM
Russ
Russ - avatar
+ 2
Russ, sorry if I have been misleading you guys... The topic is still new and confusing for me, so it's hard to precisely know myself what I'm asking. 😅 But well, yeah, it's actually purely technical. Would a child of that and that height - being exactly on the threshold - be counted as 'normal' or not? Or: If I'm standing on the threshold of a front door, am I in the house or outside? In these statistical matters, it's probably a range anyway, right? There's no clear cut between regular and irregular, just because the standard deviation draws an arbitrary line there... However, for the mother of the child, this little distinction - within or without - might have quite an emotional impact. 😉 Anyway, the data science tutorials make you do stuff like that, determining which values are within one standard deviation. And since they hide the tests, you're sort of left to speculation and try and error, figuring out how exactly it is determined.
19th Mar 2021, 11:37 AM
HonFu
HonFu - avatar
+ 2
I just remembered the YouTube channel StatQuest by Josh Starmer. He covers a huge amount of different statistical topics. In my opinion he does a great job explaining these topics in an easy, understandable way. Even though this thread is a few days old, I’m still going to add a link to his video: “Statistics Fundamentals: The Mean, Variance and Standard Deviation." https://m.youtube.com/watch?v=SzZ6GpcfoQY&t=624s
24th Mar 2021, 4:07 PM
Brian M.
Brian M. - avatar
+ 2
Thanks, Brian M., gonna check that out!
24th Mar 2021, 4:08 PM
HonFu
HonFu - avatar
+ 1
In case, anyone is still reading along: So how is it, limits included or not? Let's assume this data: 1, 2, 3, 4, 5, 6, 7 Standard deviation from the mean 4 would be 2. So it's the range from 2 to 6. Now do 2 or 6 still belong to the 'middle' or not?
17th Mar 2021, 12:17 AM
HonFu
HonFu - avatar
+ 1
HonFu that was my implicit question...
17th Mar 2021, 5:00 AM
visph
visph - avatar
+ 1
Coder Kitten, seems I'm not really grasping the meaning of any of this yet. 😓 So there isn't really a standard to decide which values of a data set are within a standard deviation? I'm basically just wondering if you correctly would write ... lower <= value <= upper ... or ... lower < value < upper ... or if this is just not defined.
17th Mar 2021, 8:37 AM
HonFu
HonFu - avatar
+ 1
Hm, okay, so standard deviation is not really used to determine which values are in a certain range. Yet I feel like I've often read statements like Brian M. quoted: 'XY lies within a standard deviation of Z' or whatever. So, if there's no general definition about if the 'standard deviation line' belongs to 'within' or not, how can we make those statements to begin with? Okay, if there are like 10,000 values and the stddev is some multiple places float, then who cares, one might say. Yet it seems sloppy to me not to be clear about it, yet make statements like the aforementioned. 🤔
19th Mar 2021, 12:07 AM
HonFu
HonFu - avatar
+ 1
Data that lie "within one standard deviation of the mean" can be thought as being "typical" or "normal". Data that lie outside of it might be considered "atypical" or "unusual". Consider an example where you think your child may have some sort of growth defect because he seems quite small compared to his age. You take him to the doctor and your doctor neasures him and declares him to be, although on the short side, within 1sd of the mean for his age, and thus quite normal. That would probably come as a relief to you. That would be an example of where making the statement of "within 1sd of the mean" would carry some weight.
19th Mar 2021, 10:20 AM
Russ
Russ - avatar
+ 1
HonFu however, any of sololearn task will give input with such edge cases (at least for free accounts: I don't know for 'pro' ones, but it would be surprising if it wasn't specified in such case)... I guess that task should specify to include or not bound if data set provided such edge case(s)... else, as you said, we must test if with or without include bounds works ^^
19th Mar 2021, 12:03 PM
visph
visph - avatar
+ 1
yes sure: sololearn is far to be perfect... and I've encountered some such tasks where simply changing order of operation would result in sligthly different float result as example ^^ that's why I specified that if such case occurs "as you said, we must test if with or without include bounds works" ;)
19th Mar 2021, 12:14 PM
visph
visph - avatar
+ 1
HonFu Context could have been quite useful here 😉 but at least we've had some good discussion from this. It seems it is simply testing you to make sure you can calculate mean and sd correctly. It should clarify whether it wants you to include data lying exactly on the limit or not. But this is not something that is done as part of any analysis on any set of data. As you have already discovered, SL Code Coaches are not always strictly precise enough on its descriptions. It would simply be a matter of trial and error to work out what it wants from you 😊
19th Mar 2021, 12:45 PM
Russ
Russ - avatar
+ 1
Here, btw, one task from the tutorial 'Python for Data Science': 'You are given an array that represents house prices. Calculate and output the percentage of houses that are within one standard deviation from the mean.' Would be nice to know what 'within' precisely means in this case, right? Actually, this time, it doesn't even matter - both works. 😉
20th Mar 2021, 5:49 PM
HonFu
HonFu - avatar