We present a methodology for enhancing the throughput of semantic video segmentation tasks on an embedded cluster containing multiple embedded processing elements (ePEs). The methodology embraces a scalable master-slave hierarchy and features a global and local key management scheme for allocating video frames to different ePEs. The master ePE divides each video frame into frame regions, and dynamically distributes different regions to different slave ePEs. Each slave ePE executes either a segmentation path or a flow path: the former is highly accurate but slower, while the latter is faster but less accurate. A lightweight decision network is employed to determine the execution path for each slave ePE. We propose a global and local key management scheme to facilitate the execution of the embedded cluster, such that the average processing latency of each frame is significantly reduced. We evaluate the performance of our methodology on a real embedded cluster in terms of accuracy and frame rate, and validate its effectiveness and efficiency for various ePE configurations. We further provide a detailed latency analysis for different configurations of ePEs.