here: the input_ids and running score product are an order of magnitude smaller to store than needed to actually run the beams so you can actually cache a huge amount of them, and only run the highest probability ones it's quite efficient [they also share prefix sequences]