Handle Spike or overflow value in dataset

Hi Guys,

I have dataset i want to take minimum and maximum value from dataset

but i have one issue some Spike value or low value is coming in dataset inbetween.

I don't want that value to be taken as maximum value and mininum value

spike and low value

can any one help me . how to negotiate spike value and low value and take the maximun and minimum value from dataset

You could probably just filter out the extraneous values by setting a minimum allowed value and a maximum allowed value in your list comprehension.

Example:

data = dataset.getColumnAsList(dataset.getColumnIndex('value'))
minimumAllowedValue = 1500000
maximumAllowedValue = 3500000
minValue = min(value for value in data if minimumAllowedValue <= value <= maximumAllowedValue)
maxValue = max(value for value in data if minimumAllowedValue <= value <= maximumAllowedValue)
print minValue, maxValue

can i able to get most repeated range value (example 200 to 300 - one value)

then i can set minimum and maximum value based on the range

taking average of dataset will give the value or any other method?

Would take a long time as you'd need to setup those ranges and then spend time working out how many values need to be a in a range for that to be counted as your average range.

Do you know what process is creating these values. Is there an expected range you're expecting to see?

Could take the mean and add +- some value to that mean as your filtered range

Looking at the size of those spikes, I don't believe the mean is a good idea. He needs the median or the mode.

1 Like

Yeah probably right mean is not a great option with this set of data. Will depend if you're expecting this to change much over time, or is supposed to remain about the same.
Median or Mode might not be best if it's a measurement which is expected to rapidly ramp up when a process starts etc.

This is looking like a totalizer value. I'd say the median would be the one to use here. Using statsutils' quantiles, you should be able to filter out spikes in both directions.

Statistics Extensions for Ignition

data = dataset.getColumnAsList(dataset.getColumnIndex('value'))

s = shared.statsutils.Stats(data)

print s.get_quantile(.25)
print s.get_quantile(.75)

Hi Jordan ,

you are correct totalizer tag.

soo my problem is

meter are resting
so i need to take minimum and maximum all the reset and i need to sum it up

how to do that


metere is resting in this image. suddenly value is droping.

then i need to taken minimum and maximum before reset and minimum and maximum after reset and i need to sum it up

soo first how to know meter is reseting in dataset (multiply reset also will posible in single day how to hande that in script?)
then how to get minimum and maximum for each reset

can you please help me

whether this script will solve my issue?

data = dataset.getColumnAsList(dataset.getColumnIndex('value'))

s = shared.statsutils.Stats(data)

print s.get_quantile(.25)
print s.get_quantile(.75)

For meter conumsption report based on the selection time frame

You can also get the median value using collections to sort the list, and then, simply get the middle value using half the length of the list to get the index of the middle value.

Example:

from java.util import Collections
data = dataset.getColumnAsList(dataset.getColumnIndex('value'))
Collections.sort(data)
medianIndex = int(float(len(data)) * 0.5)
medianValue = (float(data[medianIndex - 1]) + float(data[medianIndex])) * 0.5 if len(data) % 2 == 0 else data[medianIndex]
valueTolerance = 500
minimumAllowedValue = medianValue - valueTolerance
maximumAllowedValue = medianValue + valueTolerance
filteredValues = [value for value in data if minimumAllowedValue <= value <= maximumAllowedValue]
minValue = min(filteredValues)
maxValue = max(filteredValues)
print minValue, maxValue

valueTolerance = 500

Is this value is fixed for any dataset ?

Calculate that however you want. Perhaps just do valueTolerance = .25*float(medianValue) to do plus or minus 25% of whatever the median is. I just arbitrarily threw 500 in there as an example.

1 Like

I didn't understand how the script works. if possible can you explain me.
why we taking median value first and what is purpose of value tolerance.

Sure. looking at your original picture, I see a huge variance between the spike values and the normal values:
image

I interpreted your question as wanting the minimum and maximum "normal" values, so my idea was to simply set a minimum and maximum allowed value [what could be called a normal range] and use it to filter the set of numbers. In your case we have [2,829,077, 2,829,078, 20, 2,829,080, 2,829,081, 2,829,084, 100,000,000] with the extraneous values being obvious: [20 and 100,000,000]

I imagined that there would be an expected range of values. and you simply wanted the largest and smallest one that was normal, and since the change in value was so large, it seemed simple to arbitrarily set a range that filters out the unwanted numbers.

Then you said this:

Since the variance was so high [20 vs 2 million and 100 million vs 2 million, I imagined the possibility of an average that was far from the normal values, so taking the average [aka mean] seemed like a bad idea to me. Therefore, I suggested sorting the values and taking the middle one [aka median]. Sorting the values ensures that the extraneous low values are on one end of the list, and the extraneous high values are at the other end of the list. In this way, the middle value in the list will almost certainly within the normal acceptable range.

Quick Note: In an ideal world, we would fix whatever it is that causes the extraneous values, so that none of this work is necessary, but unfortunately, I don't know enough about your implementation to guide you in how to go about this.

It's always possible that I've completely misunderstood a question, and consequently, my answer doesn't work or is wrong, but with the preliminary explanation of my reasoning out of the way, I'll break down the code with notes to try to make it clear:

data = dataset.getColumnAsList(dataset.getColumnIndex('value'))`
#This gets the data from the value column in the form of a list: [1, 4, 100000, 8, -100000, 12]

Collections.sort(data)
#This puts the list in order, so any extraneous values will be at the beginning or end of the list
#[-100000, 1, 4, 8, 12, 100000]

medianIndex = int(float(len(data)) * 0.5)
#This takes the length of the list, and divides it by two.
#In this case there are 6 elements, and dividing it by two would produce an even value of 3, 
#but there is always the possibility that the number of elements will be fractional,
#so since len returns an integer [a non-fractional number] I convert the int to a float first and multiply
#by 0.5. I'm not always sure if this conversion is necessary, but I'm in the habit of doing it,
#and I know that by doing it, I am guaranteed in this case to get a .5 value that will always round up
#You could also divide by two here, but in coding, I'm also in the habit of fractional multiplication instead of division because it eliminates the possibility of undefined values.
#Obviously the resultant calculation of `float(len(data)) * 0.5` is going to be a float, which is useless for use as an index value to get specific elements from a list, so I convert it back to an int.
#In the event of a fractional value, it will naturally round up, so I will actually have the median index. 
#However, if it is an even value, I will have the index to the left of the exact middle

medianValue = (float(data[medianIndex]) + float(data[medianIndex + 1])) * 0.5 if len(data) % 2 == 0 else data[medianIndex]
#To be technically correct as a statistical median value, if there is an even number of terms, then the average of the middle two terms becomes the median, even if that resultant term doesn't exist in the list.
#Otherwise, the exact middle term is the median value.
#This could also be written like this:
if len(data) % 2 == 0: #if there are an even number of terms in the list
     medianValue = (float(data[medianIndex]) + float(data[medianIndex + 1])) * 0.5 #Take the average of the two middle terms
else:
     data[medianIndex] #just take the middle term
#Remember that the idea here is to find the exact middle term in the list of terms because it is almost guaranteed to not be extraneous

valueTolerance = 500
minimumAllowedValue = medianValue - valueTolerance
maximumAllowedValue = medianValue + valueTolerance
#I have no way of knowing what the expected range of a given list would be, so I've arbitrarily picked 500,
#but if I have a known good value that is relatively close to center in the list that I am trying to filter extraneous values out of,
#I imagine that I can do this by creating a minimum acceptable value and a maximum acceptable value that is the middle(ish) value plus or minus some number.

filteredValues = [value for value in data if minimumAllowedValue <= value <= maximumAllowedValue]
#I usually use comprehension because I've learned that it is more efficient than looping, but this line of code could be written in this way as a loop:
filteredValues = []
for value in data:
     if minimumAllowedValue <= value and value <= maximumAllowedValue:
          filteredValues.append(value)
#Either way, this produces a new list that only has values that are in the expected range [a list with no extraneous values]
 
minValue = min(filteredValues)
maxValue = max(filteredValues)
#Finally, we simply take what I assumed we were after, the minimum and maximum values that are not extraneous.
5 Likes

Thank you so much for clear explanation

1 Like

I have tested same script with one tag dataset. i am getting some variation in value

what i am getting through script is

image

But what i am getting in manual calculation

note - sometimes multiple roll over also occur

dataset i have used

data_test.xlsx (92.7 KB)

It's easy enough to develop an algorithm that takes rollovers into account.

Example (using the value column from your spreadsheet):

data = [9998424, [...] ,2630]
startingValue = data[0]
rolloverOccursAt = 10000000
rollOvers = 0
totalKWH = 0
for index in range(1, len(data)):
	deviation = data[index] - data[index-1]
	if deviation < 0:
		rollOvers += 1
	totalKWH = (data[index] + (rollOvers * rolloverOccursAt)) - startingValue
print totalKWH

Output: 4206

...but if you throw one of those extraneous values into the mix, it won't work, so it looks like you need a way to ignore them.

It looks like the typical increment in your dataset is relatively predictable, so if we are iterating through, and we encounter an increment that doesn't qualify as a rollover and is an obvious anomaly, we can reasonably ignore it.

Example:

data = [
	9998424, 9998425, 9998426, 9998427, 9998428, 9998429, 9998430,
	1000000000, 9998431, 9998432, 9998433, 920, 998434, 9998435,
	9998436, [...], 2630, 2630
]
startingValue = data[0] 
rolloverOccursAt = 10000000
rollOvers = 0
totalKWH = 0
maxAllowedIncrement = 500 #any non-rollover increment greater than this will be ignored
for index in range(1, len(data)):
	change = data[index] - data[index-1]
	if (change < 0 and abs(rolloverOccursAt - data[index-1]) < maxAllowedIncrement): #check if it's a legit rollover
		rollOvers += 1
		totalKWH = (data[index] + (rollOvers * rolloverOccursAt)) - startingValue
	elif (change >= 0 and abs(change) < maxAllowedIncrement): # verify incrementation 
		totalKWH = (data[index] + (rollOvers * rolloverOccursAt)) - startingValue
print totalKWH

Output: 4206

1 Like

One doubt
you given rolloverOccursAt = 10000000 (but rollover won't occur everytime after 10000000 sometimes it will rollover from 300000 range to 200 like that)

and this also you given maxAllowedIncrement = 500 statically

how to choose rollover and maxAllowedIncrement dynamically?

Is the rollover point a system parameter that you could get? How is that normally determined?

What are you wanting on maxAllowed increment? Is there a huge variance between one system and another that this can't be generalized? Looking at the extraneous values in your example, it looks like you have a lot of wiggle room with that number. The extraneous increments were off by millions.

sudden drop will occur

Manually Calculation example in image

same way i need to implement that through script

if multiple rollover occur in selected time frame dataset then
image

i need to do this calculation for each roll over

that's y i confused how to do that