[Perspective] onFileReceived - CSV File UTF-8 vs ASCII

Hello.
I’m trying to import a csv file in a Perspective table using the onFileReceived event of an Upload Button. This is the code I’m using:

import csv
from StringIO import StringIO
	
try:
	reader = csv.reader(StringIO(event.file.getString("UTF-8")), delimiter=";")
	header = reader.next()
	converted_data = [[column.replace(',', '.') for column in row] for row in reader]
	self.view.custom.table.data = system.dataset.toDataSet(header, converted_data)
except:
	import traceback
	logger = system.util.getLogger("import_recipe_parameters")
	logger.error("Error: %s" % traceback.format_exc())

Still, I’m having some issue with csv files saved with the CSV UTF-8 format by Excel, while the ones simply saved as CSV files are correctly read.

Using chardec, I can see that the file which is correctly imported is encoded like this:
ok_file.csv: ascii with confidence 1.0
The one giving me issues is encoded like this:
nok_file.csv: UTF-8-SIG with confidence 1.0

Are there any ways to make my code compatible with both encoding formats?

reader = csv.reader(StringIO(event.file.getString("UTF-8-SIG")), delimiter=";")

doesn’t seem to do the trick.

From Perspective documentation:

event.file.getString()
Fetches the incoming file data and attempts to parse it as a string via UTF-8 (Eight-bit UCS Transformation Format) encoding. Defaults to UTF-8 (super common), but can use other character sets. Passed as a string, for example getString("UTF_16BE).

Thank you.

I took a look at this for you, and from what I can discern (in attempting to reproduce a similar file with Excel), Excel is producing a UTF-8 with BOM (byte order mark). Thankfully, you should be able to detect this and fairly easily remove it (below is your code above with modifications):

	import csv
	from StringIO import StringIO
	
	logger = system.util.getLogger("import_recipe_parameters")
	file_bytes = event.file.getBytes()
	
	# Check to see if this is UTF-8 with BOM
	if bytearray.fromhex("ef bb bf") == bytearray(file_bytes[0:3]):
		# Strip first three bytes
		file_bytes = file_bytes[3:]
		
	try:
		# Read in from file_bytes.tostring() now...
		reader = csv.reader(StringIO(file_bytes.tostring()), delimiter=";")
		header = reader.next()
		converted_data = [[column.replace(',', '.') for column in row] for row in reader]
		self.view.custom.table.data = system.dataset.toDataSet(header, converted_data)
	except:
		import traceback
		logger.error("Error: %s" % traceback.format_exc())
4 Likes

Thank you, @kcollins1! That did the trick.
I didn’t know about UTF-8 files with BOM, but I’ll leave some references for other people to read.
StackOverflow
Wikipedia