How to find the most similar string from left to right?

I have a dict like this:

{
folder1/tag1: True
folder1/tag2: False
folder2/*: True  # This means that for every item in folder2 the value is True
}

However, I search by the complete path. Something like this:

folder1/tag1
folder1/tag2
folder2/tag1
folder2/tag2

I want to search by folder2/tag1 and get folder2/* configuration.

Any idea on how to do that in some efficient way?

Thx in advance!

Note: We are using this to load/unload a provider. One of our projects currently has ~30k tags so I don't want to have a loop for each entry... 30k tags with a loop of 6 levels is a bit too much!

Maybe something like...

CONFIG = {
	'folder1/tag1': True,
	'folder1/tag2': False,
	'folder2/*': True 
}

def getConfig(path):
	if CONFIG.has_key(path):
		return CONFIG[path]
	try:
		wildcardPath = path.split('/')[0] + '/*'
		if CONFIG.has_key(wildcardPath):
			return CONFIG[wildcardPath]
	except:
		pass
	return None

That works for 1 level.
My current solution is something similar but with a loop.

I'm trying to avoid that.

Not sure it's possible without some kind of loop, but you could convert the wildcard paths to a Regular Expression and test against that:

import re 

CONFIG = {
	'folder1/tag1': True,
	'folder1/tag2': False,
	'folder2/*': True 
}

def getConfig(path):
	if CONFIG.has_key(path):
		return CONFIG[path]
	for c in CONFIG.keys():
		if '*' in c:
			if re.match(c.replace('*', '.*'), path):
				return CONFIG[c]
	return None

This is a good candidate for LLM code generation, though it's not a one-shot success (at least Gemini really likes Python 3 syntax). This might look intimidating, but the basic insight is to swap your linear dictionary for a tree-like structure - basically turn:

customer_config = {
    'folder1/tag1': True,
    'folder1/tag2': False,
    'folder2/*': True,
    'folder3/folder4/*': False,
    'folder3/folder4/tag1': True,
    '*': False  # Global wildcard
}

Into

--- Generated Config Tree ---
{
  "*": false,
  "folder1": {
    "tag1": true,
    "tag2": false
  },
  "folder2": {"*": true},
  "folder3": {"folder4": {
    "*": false,
    "tag1": true
  }}
}

And then you can split each input path along the separator and make a series of lookups.

LLM output
def build_config_tree(config_dict):
    """
    Parses a dictionary of path patterns into a nested dictionary tree for
    efficient lookup.

    Args:
        config_dict (dict): A dictionary where keys are path strings
                            (e.g., 'folder/tag' or 'folder/*') and values
                            are booleans.

    Returns:
        dict: The root of the generated configuration tree.
    """
    root = {}
    if not config_dict:
        return root

    for path, value in config_dict.items():
        parts = path.split('/')
        current_node = root
        # Iterate through path parts to build the nested dictionary structure
        for i, part in enumerate(parts):
            # If it's the last part, assign the boolean value directly
            if i == len(parts) - 1:
                current_node[part] = value
            else:
                # For intermediate parts, ensure a dictionary exists and move down
                current_node = current_node.setdefault(part, {})
    return root

def find_config_value(root, search_path):
    """
    Finds a configuration value for a given path by traversing the dictionary tree.

    This function finds the most specific match for the path. If an exact
    match isn't found at any level, it uses the value from the nearest
    ancestor's wildcard ('*') entry.

    Args:
        root (dict): The root of the configuration dictionary tree.
        search_path (str): The complete path to look up (e.g., 'folder2/tag1').

    Returns:
        bool or None: The matched boolean value, or None if no rule is found.
    """
    parts = search_path.split('/')
    current_node = root
    best_match_value = None

    # Check for a global wildcard '*' at the root level first.
    if '*' in root and isinstance(root['*'], bool):
        best_match_value = root['*']

    for part in parts:
        # Before descending, check if a wildcard at the current level provides a
        # new, more specific fallback value.
        if isinstance(current_node, dict) and '*' in current_node and isinstance(current_node['*'], bool):
            best_match_value = current_node['*']

        # Try to follow the specific path part.
        if isinstance(current_node, dict) and part in current_node:
            current_node = current_node[part]
        else:
            # If the specific path part doesn't exist, we stop and return
            # the last wildcard value we found.
            return best_match_value

    # If we finished the loop, the full path was traversed.
    # If the final node holds a boolean, it's the most specific setting.
    if isinstance(current_node, bool):
        return current_node
    # If the final node is a dictionary (e.g., path is 'folder1' but only
    # 'folder1/tag1' is defined), check for a wildcard at this level.
    elif isinstance(current_node, dict) and '*' in current_node and isinstance(current_node['*'], bool):
        return current_node['*']
    else:
        # Otherwise, we fall back to the best match found along the path.
        return best_match_value

# ===========================================================================
# Example Usage
# ===========================================================================

# 1. Your configuration dictionary
customer_config = {
    'folder1/tag1': True,
    'folder1/tag2': False,
    'folder2/*': True,
    'folder3/folder4/*': False,
    'folder3/folder4/tag1': True,
    '*': False  # Global wildcard
}

# 2. Build the tree once from your configuration
config_tree_root = build_config_tree(customer_config)

# 3. Define the paths you want to search for
paths_to_check = [
    'folder1/tag1',          # Should find the specific value: True
    'folder1/tag2',          # Should find the specific value: False
    'folder2/tag1',          # Should match 'folder2/*' and return: True
    'folder2/any_other_tag', # Should also match 'folder2/*' and return: True
    'folder1/tag3',          # Should match global '*' and return: False
    'folder3/folder4/tag1',  # Should find specific value: True
    'folder3/folder4/tag2',  # Should match 'folder3/folder4/*' and return: False
    'another_folder/tag',    # Should match global '*' and return: False
]

# 4. Perform lookups using the find_config_value function
print "--- Configuration Lookups ---"
for path in paths_to_check:
    value = find_config_value(config_tree_root, path)
    print "Searching for '{path}' -> Found: {value}".format(path=path, value=value)

print "--- Generated Config Tree ---"

print system.util.jsonEncode(config_tree_root, 2)
3 Likes

Certainly more efficient than my solution, especially if the config dict is very large.

I think that's the best idea I currently have.
I was a bit worried that looking inside a dictionary inside a dictionary inside a dicitionary inside a... was gonna generate too much computing cost.

I've been doing some test and it has really not much impact.

So I think I'm going to go with the tree search.

  1. Search most left level in dict
  2. Check if that level is the configuration
  3. If not, go to step 1, if it is, return configuration

This is bread and butter stuff for computers. Until/unless the number of folders multiplied by the number of tag paths you're searching for gets into the millions, I wouldn't even worry about it.

Dictionary lookup is ~constant time - the key to be checked is hashed and then the lookup happens ~immediately, so it's quite efficient CPU wise compared to any "linear" scan where the number of elements in the list directly correlates to the time it takes to iterate. O(1) vs O(n). The tradeoff is higher memory consumption to "prebuild" the list, but Java is very good at allocating small objects for a short period of time.

1 Like

Yep! However I was worried that being a Jython implementation worked different or something.
As you may know, python is not known for its efficiency..

I only hope that memory consumption doesn't gets to much when getting into deep paths or configurations.

Thx a lot, I really appreciate your help!

1 Like