SigLIP 2: A better multilingual vision language encoder

I tried running the zeroshot classification example and got ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (input_idsin this case) have excessive nesting (inputs typelistwhere typeintis expected). transformers version=4.49.0.dev0

I tried adding both padding=True and truncation=True to no avail. i also tried padding="max_length"

EDIT:
it seems to work if my labels are all the same length. doing some debugging, i see that in zero_shot_image_classification.py, the padding provided to the tokenizer is forced to be max_length anyway here (L148-149)

padding = "max_length" if self.model.config.model_type == "siglip" else True
text_inputs = self.tokenizer(sequences, return_tensors=self.framework, padding=padding, **tokenizer_kwargs)

and yet, if my labels have variable lengths, the outputs are not the same length, and so calling torch.tensor on that ultimately failsi did spot this warning in my terminal as well:

Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.