How to clean a string from invisible characters?

Hi!

I have a script that reads from an excel. In that excel there is a list of names and configuration to import as tags.

However, sometimes Excel writes nonprintable characters and those are not valid for tag names.
Is there any way to clean a string to only have valid tag names characters? The idea is to clean it so I only get the intendeed name.

Here my problem. THe WJ invisible character...

Thx in advance!

Spaces are allowed in tag names, even if some people severely dislike this.

However, in your example, and in a quick test I did (8.1.35), Ignition is ignoring the WJ anyway?

My problem is creating the tags.

[Error_Configuration("The name '⁠NonAvailableStatus' is not a valid tag name")]

Try translate

>>> import string
>>> s = "foo\r\nbar"
>>> s.translate(None, string.whitespace)
'foobar'
2 Likes

Or a small regexp works
import re
''.join(re.findall('\w+' ,'Non⁠Available’)),

My Integration Toolkit has mungeColumnName() functions (expression and script) that will do this for you. Originally created to make dataset column names into jython-acceptable variable names for the view() expression function.

2 Likes

I think I'm going to do this:

''.join(
	c for c in text
	if c in set("0123456789_ '-:()") or unicodedata.category(c)[0] == 'L'
)

Note: I didn't check if is a bad implementation for high amount of data

Does work but for me but removes the accents (I'm spanish).

Thing like this Compresión turn into this Compresin.

Then, instead of \w, use [0-9A-Za-zÀ-ÿ].
It should cover your needs

''.join(re.findall("[0-9A-Za-zÀ-ÿ]+" ,'Compresión'))

Edit : forget that, it does not work for a tag name.

1 Like

Apache StringUtils

from org.apache.commons.lang3 import StringUtils

stringList = ['Compresión', 'Crème brûlée', 'über', 'garçon', 'Señor']

for s in stringList:
	print StringUtils.stripAccents(unicode(s))

output:

Compresion
Creme brulee
uber
garcon
Senor
>>> 
1 Like

Had some time to sit with this a bit more. If you're working with unicode, it's usually wise to import unicode_literals

StringUtils() seems like it can do everything you need

from __future__ import unicode_literals
from org.apache.commons.lang3 import StringUtils

def normalize(stringIn, replaceDict={'':''}):
	return StringUtils.stripAccents(StringUtils.replaceEach(StringUtils.normalizeSpace(stringIn), replaceDict.keys(), replaceDict.values()))

# Dictionary of any odd values you want to filter for.
# e.g: WordJoin is '\u2060' and has to be used it is not considered whitespace. 
replaceDict = {'\u2060' : ''}


stringList = ['  Compres\u2060ión', 'Crème\r\nbrûlée', 'über', 'garçon', 'Señor']

for s in stringList:
	repr(normalize(s, replaceDict))

output:

"u'Compresion'"
"u'Creme brulee'"
"u'uber'"
"u'garcon'"
"u'Senor'"
>>>