Sure. looking at your original picture, I see a huge variance between the spike values and the normal values:
I interpreted your question as wanting the minimum and maximum "normal" values, so my idea was to simply set a minimum and maximum allowed value [what could be called a normal range] and use it to filter the set of numbers. In your case we have [2,829,077, 2,829,078, 20, 2,829,080, 2,829,081, 2,829,084, 100,000,000] with the extraneous values being obvious: [20 and 100,000,000]
I imagined that there would be an expected range of values. and you simply wanted the largest and smallest one that was normal, and since the change in value was so large, it seemed simple to arbitrarily set a range that filters out the unwanted numbers.
Then you said this:
Since the variance was so high [20 vs 2 million and 100 million vs 2 million, I imagined the possibility of an average that was far from the normal values, so taking the average [aka mean] seemed like a bad idea to me. Therefore, I suggested sorting the values and taking the middle one [aka median]. Sorting the values ensures that the extraneous low values are on one end of the list, and the extraneous high values are at the other end of the list. In this way, the middle value in the list will almost certainly within the normal acceptable range.
Quick Note: In an ideal world, we would fix whatever it is that causes the extraneous values, so that none of this work is necessary, but unfortunately, I don't know enough about your implementation to guide you in how to go about this.
It's always possible that I've completely misunderstood a question, and consequently, my answer doesn't work or is wrong, but with the preliminary explanation of my reasoning out of the way, I'll break down the code with notes to try to make it clear:
data = dataset.getColumnAsList(dataset.getColumnIndex('value'))`
#This gets the data from the value column in the form of a list: [1, 4, 100000, 8, -100000, 12]
Collections.sort(data)
#This puts the list in order, so any extraneous values will be at the beginning or end of the list
#[-100000, 1, 4, 8, 12, 100000]
medianIndex = int(float(len(data)) * 0.5)
#This takes the length of the list, and divides it by two.
#In this case there are 6 elements, and dividing it by two would produce an even value of 3,
#but there is always the possibility that the number of elements will be fractional,
#so since len returns an integer [a non-fractional number] I convert the int to a float first and multiply
#by 0.5. I'm not always sure if this conversion is necessary, but I'm in the habit of doing it,
#and I know that by doing it, I am guaranteed in this case to get a .5 value that will always round up
#You could also divide by two here, but in coding, I'm also in the habit of fractional multiplication instead of division because it eliminates the possibility of undefined values.
#Obviously the resultant calculation of `float(len(data)) * 0.5` is going to be a float, which is useless for use as an index value to get specific elements from a list, so I convert it back to an int.
#In the event of a fractional value, it will naturally round up, so I will actually have the median index.
#However, if it is an even value, I will have the index to the left of the exact middle
medianValue = (float(data[medianIndex]) + float(data[medianIndex + 1])) * 0.5 if len(data) % 2 == 0 else data[medianIndex]
#To be technically correct as a statistical median value, if there is an even number of terms, then the average of the middle two terms becomes the median, even if that resultant term doesn't exist in the list.
#Otherwise, the exact middle term is the median value.
#This could also be written like this:
if len(data) % 2 == 0: #if there are an even number of terms in the list
medianValue = (float(data[medianIndex]) + float(data[medianIndex + 1])) * 0.5 #Take the average of the two middle terms
else:
data[medianIndex] #just take the middle term
#Remember that the idea here is to find the exact middle term in the list of terms because it is almost guaranteed to not be extraneous
valueTolerance = 500
minimumAllowedValue = medianValue - valueTolerance
maximumAllowedValue = medianValue + valueTolerance
#I have no way of knowing what the expected range of a given list would be, so I've arbitrarily picked 500,
#but if I have a known good value that is relatively close to center in the list that I am trying to filter extraneous values out of,
#I imagine that I can do this by creating a minimum acceptable value and a maximum acceptable value that is the middle(ish) value plus or minus some number.
filteredValues = [value for value in data if minimumAllowedValue <= value <= maximumAllowedValue]
#I usually use comprehension because I've learned that it is more efficient than looping, but this line of code could be written in this way as a loop:
filteredValues = []
for value in data:
if minimumAllowedValue <= value and value <= maximumAllowedValue:
filteredValues.append(value)
#Either way, this produces a new list that only has values that are in the expected range [a list with no extraneous values]
minValue = min(filteredValues)
maxValue = max(filteredValues)
#Finally, we simply take what I assumed we were after, the minimum and maximum values that are not extraneous.