MQTT and Ignition data questions

I will preface this by saying that I am just the data/infrastructure person working on trying to get a real-time feed of tag data for this start-up. The goal is to use some ML for predictive controls inputs and increase regulatory compliance among other things. I did not install or configure the Ignition system initially and I only provided the AWS infrastructure as requested by the folks that did.

Phase 1 is 1 site, Phase 2 will have at least two different sites and thus we need a way to split the data in the lake by site.

Goals -
Near real-time (every 60 seconds batched for all tags) data delivery
ML Model retrieves data and outputs a 2-minute prediction for specific points
Dynamic and automatic site data split and ingestion into the s3 Lake and the ML Models

Implementation -
Lambda pulls from Aurora Postgres Historian cross-account to the lake in s3 using a 1-minute schedule and a SQL query.

Issues -
Costs: IO from Aurora is not inexpensive
Multiple sqlt_data_#_yyyy_mm tables: For some reason there are 2 different tables getting data fed into them and each one has a different tagid in it. This is not insurmountable but the tagid being different for the same tagpath is problematic on the analytics side.
Real-time data in the future: While technically feasible the extra table feed (which seems to happen randomly) and latency additions will make this less desirable.

So, with those issues, even as minor as they are, I went looking for another option that looked like it would provide a less expensive per site option that would also provide that real-time feed eventually. Enter the MQTT feed. There is an Edge with the Transmission module (Cirrus I believe v 4.0.27) and then the IOT Core in AWS along with an EC2 instance with Ignition Gateway and the MQTT Engine.

I created a lambda to use an IOT Rule to grab the data from the engine and then I blew through $2500 in like 2 days because I apparently built the IOT Rule without enough filtering so that initially was bad. I did my best to create a better rule, and I think I have a good filter now.

The issue I am seeing is tagid: None for all tag paths. I could create a mapping table in s3 and pick them up in the process of writing to the lake but it adds some non-trivial latency to the process when reading from s3.

Questions:
Is tagid supposed to be in the message? If Yes, where should I go to figure out why they aren’t? If No, is there a way to send the tagid → tagpath mapping via the feed so I can be sure to have the latest mappings?

Is there something I should do to stop the multiple historian tables? Having two different tagids in the historian is not ideal for referential integrity.

Is there a better choice than an AWS lambda for the MQTT → s3 lake for this data? I looked at Kinesis but it seems like that will be more expensive. I looked at IOT bridge → SiteWise and that is about $825 a month just for those. (Which may be cheaper than any other options)

Is there a potential to use something like a sqllite on the EC2 instance to provide the necessary historian which can be a more limited set of historical data because we have historical in s3? (I am thinking 90 days on the instance and everything else on s3)

I did manage to setup the automatic failover on the edge gateway for a loss of connection so that we hopefully don’t lose data, but I am not sure it will work that way. Happy to read any documents or other articles. I have done a lot of internet/forum searches, but any help is appreciated. In my research I suspect I am not doing a good job with search terms.