[Perspective] onFileReceived - CSV File UTF-8 vs ASCII

Davide3i · March 28, 2022, 1:28pm

Hello.
I'm trying to import a csv file in a Perspective table using the onFileReceived event of an Upload Button. This is the code I'm using:

import csv
from StringIO import StringIO
	
try:
	reader = csv.reader(StringIO(event.file.getString("UTF-8")), delimiter=";")
	header = reader.next()
	converted_data = [[column.replace(',', '.') for column in row] for row in reader]
	self.view.custom.table.data = system.dataset.toDataSet(header, converted_data)
except:
	import traceback
	logger = system.util.getLogger("import_recipe_parameters")
	logger.error("Error: %s" % traceback.format_exc())

Still, I'm having some issue with csv files saved with the CSV UTF-8 format by Excel, while the ones simply saved as CSV files are correctly read.

Using chardec, I can see that the file which is correctly imported is encoded like this:
ok_file.csv: ascii with confidence 1.0
The one giving me issues is encoded like this:
nok_file.csv: UTF-8-SIG with confidence 1.0

Are there any ways to make my code compatible with both encoding formats?

reader = csv.reader(StringIO(event.file.getString("UTF-8-SIG")), delimiter=";")

doesn't seem to do the trick.

From Perspective documentation:

event.file.getString()
Fetches the incoming file data and attempts to parse it as a string via UTF-8 (Eight-bit UCS Transformation Format) encoding. Defaults to UTF-8 (super common), but can use other character sets. Passed as a string, for example getString("UTF_16BE).

Thank you.

kcollins1 · March 28, 2022, 3:42pm

I took a look at this for you, and from what I can discern (in attempting to reproduce a similar file with Excel), Excel is producing a UTF-8 with BOM (byte order mark). Thankfully, you should be able to detect this and fairly easily remove it (below is your code above with modifications):

	import csv
	from StringIO import StringIO
	
	logger = system.util.getLogger("import_recipe_parameters")
	file_bytes = event.file.getBytes()
	
	# Check to see if this is UTF-8 with BOM
	if bytearray.fromhex("ef bb bf") == bytearray(file_bytes[0:3]):
		# Strip first three bytes
		file_bytes = file_bytes[3:]
		
	try:
		# Read in from file_bytes.tostring() now...
		reader = csv.reader(StringIO(file_bytes.tostring()), delimiter=";")
		header = reader.next()
		converted_data = [[column.replace(',', '.') for column in row] for row in reader]
		self.view.custom.table.data = system.dataset.toDataSet(header, converted_data)
	except:
		import traceback
		logger.error("Error: %s" % traceback.format_exc())

Davide3i · March 29, 2022, 6:11am

Thank you, @kcollins1! That did the trick.
I didn’t know about UTF-8 files with BOM, but I’ll leave some references for other people to read.
StackOverflow
Wikipedia

Samg23 · June 30, 2022, 3:06pm

this code works like a charm for me as well, but I don’t fully grasp how the formatting in this:

converted_data = [[column.replace(',', '.') for column in row] for row in reader]

section works. Is it possible to add a piece of code somewhere in there that checks if the column’s value is a float/decimal then rounds it if it is? just replacing column.replace with round(column) doesn’t work for me since some of the columns contain strings so I’m trying to add an if or try clause within it but just can’t get the formatting down if it even allows that.

Davide3i · June 30, 2022, 3:22pm

Hello @Samg23.
If I correctly understood what you’re trying to achieve, I think you can change that line of code with something like this (untested):

converted_data = [[round(column) for column in row if isinstance(column, float)] for row in reader]

Samg23 · June 30, 2022, 3:43pm

Yes, something like that however I want to keep the strings as well, that code returns an error saying the header is mismatched with the converted_data, I’m assuming this is because it skips everything that isn’t a float, is there a way to add an else statement to that just returning the exact value of the column if it isn’t a float?

lrose · June 30, 2022, 3:43pm

Comprehensions are great, but they aren’t always the right tool for the job. ~~They don’t allow for branching logic.~~ In this case it is best to just use a traditional for loop.

converted_data = []
for row in reader:
    converted_row = []
    for column in row:
        if isinstance(column,float):
            converted_row.append(round(column))
        else:
            converted_row.append(column.replace(',','.'))
    converted_data.append(converted_row)

Of course other conditions can be added.

Davide3i · June 30, 2022, 3:44pm

You can add an else clause, like shown here.

Samg23 · June 30, 2022, 3:53pm

I figured that, but I was always getting an error with mismatched headers when trying to reformat the code, I figured it just meant that format couldn't be used in that scenario, but yours works so it looks like that was down to my user error.

Thank you for that, worked like a charm.

Both solutions work but don't round which most likely just means the number I'm getting from the excel sheet isn't a float so I'll have to play around after lunch to find what the general formatting defaulted it to, still new to all of this haha.

Thank you both!

Davide3i · June 30, 2022, 6:47pm

Try to check the output of type(column). If it’s a string, it doesn’t make sense to use isinstance(), but you should cast your column as a float/int.

converted_data = [[round(float(column)) for column in row if **something** else **something else**] for row in reader]

Samg23 · June 30, 2022, 7:42pm

I ended up going with the traditional for loop and changing the if/then to a try/except:

converted_data = []
for row in reader:
	converted_row = []
	for column in row:
		try:
			new = float(column)
			converted_row.append(round(new,2))
		except:
			converted_row.append(column.replace('.','.'))
	converted_data.append(converted_row)

I left the column.replace in doing nothing because I’m lazy

lrose · June 30, 2022, 8:00pm

Honestly, you should catch the specific fault, that way if a general fault occurs your code doesn’t fail.

To each their own. Glad you got it sorted.

Samg23 · June 30, 2022, 8:01pm

Good call, will do.