There's a lot to unpack here, and without seeing your whole system holistically it's really hard to answer in more than vague terms, but I'll give it a shot.
First off, in the general case, you can assume that network operations are going to be dramatically slower than operations on the same machine, which are themselves going to be dramatically slower than operations in the same process.
This is true to such a degree that if you're doing multiple network operations in a sense it doesn't matter how efficient your in-process code is going. This is where multithreading can make sense - if you must make multiple network operations, doing them in parallel instead of in series is naturally going to reduce wall clock time.
Now, speaking of 8.1 (and prior) versions specifically, there's memory efficiency as a target axis.
As you describe, any Ignition operation involving the database is actually being done by the gateway, even if it's on behalf of a particular client.
Thus, there is always time ineffiency built in; the request flows from the client to the database and back. This is unavoidable.
Client -> Gateway -> Database -> Gateway -> Client.
The only benefit of StreamingDatasetWriter is to improve memory efficiency.
Because the gateway is purely serving as a 'proxy' in this case, there's no need for it to buffer a complete response in memory before immediately discarding it to drain down the write to the client.
The queryHistory API is designed around this plan, hence StreamingDatasetWriter, and the concrete implementation designed for use in RPC, com.inductiveautomation.ignition.gateway.servlets.gateway.GatewayResponseStreamingDatasetWriter, which is unfortunately not public API in 8.1.
However, all of this tangential discussion, while interesting, is almost completely moot.
For one, the way RPC works changed entirely in 8.3 making it much easier to immediately start streaming results of any kind down to the client (improving time efficiency and not just memory), thus somewhat circumventing the limited nature of streaming datasets.
But the most important thing is:
That's not that much data.
That it takes 40 seconds indicates a pathologic issue in your database/historian. I've seen historians with millions of tags, returning hundreds of thousands of data points in each query, much faster than tens of seconds. If the query itself is slow, then the thing to focus on is your database/historian, and all the rest of the optimization in the world is moot.