The present invention includes the following steps: loading a master policy, a plurality of sub-policies, and environment data; wherein the sub-policies have different inference costs; selecting one of the sub-policies as a selected sub-policy by using the master policy; generating at least one action signal according to the selected sub-policy; applying the at least one action signal to an action executing unit; detecting at least one reward signal from a detecting module; training the master policy using at least one real inference cost of the at least one reward signal and an expected inference cost of the selected sub-policy to minimize inference cost; the present invention trains the master policy using Hierarchical Reinforcement Learning with an asymmetrical policy architecture, thus allowing the master policy to reduce inference cost while maintaining satisfying performance for a deep neural network model. |