POLYT5: Turning Polymer Design Into a Text-to-Text Problem

Most polymer-design workflows still follow a familiar path: first build property models, then screen a library of candidates generated by heuristics, rules, or expert intuition.

That workflow has been useful, but it has a built-in bottleneck. The model can only evaluate what the enumeration step gives it. If the candidate library covers a tiny part of chemical space, or if the enumeration rules carry user-imposed biases, the downstream screening workflow inherits those limits.

In our POLYT5 work, we tried to flip that workflow.

Instead of treating polymer generation as a separate pre-modeling step, we asked whether polymer design could be framed directly as a text-to-text learning problem. If a model can learn the grammar of polymer repeat units at scale, then the same foundation model can transfer to two coupled tasks: predicting properties and generating new polymer structures conditioned on desired targets.

Why a polymer-native language model?

General-purpose language models are powerful, but polymers have their own syntax, constraints, and recurring structural motifs. A repeat unit is not just a sentence in ordinary language. It has chemistry, connectivity, validity, synthesizability, and long-range dependencies that matter.

POLYT5 is based on the T5 encoder-decoder architecture and was trained on more than 100 million polymer structures represented using SELFIES. The encoder-decoder design is useful here because the model can both read polymer strings bidirectionally and generate output sequences for downstream tasks.

The practical idea is simple:

Pre-train the model on polymer structures so it learns a polymer-specific language.
Fine-tune it for property prediction across thermal, electronic, and solubility-related tasks.
Fine-tune it again for conditional generation, where a target property value becomes the prompt and the model outputs hypothetical polymer structures.

This makes prediction and generation feel like two sides of the same representation problem rather than disconnected pieces of a pipeline.

The dielectric polymer design test

We used high-temperature dielectric polymer design as an end-to-end test case. The target was not one property in isolation. The goal was to find candidates satisfying several constraints at once:

dielectric constant greater than 3
bandgap greater than 4 eV
glass transition temperature above 400 K
melt-processability constraints
solubility in practical, lower-impact solvents such as water or ethanol

POLYT5 generated a large candidate pool conditioned on a target glass-transition temperature. We then applied a sequence of fine-tuned property models and screening filters to identify promising dielectric polymers. From millions of generated structures, the workflow narrowed the space to more than 18,000 promising candidates.

One representative top candidate was selected for experimental synthesis and validation. The measured properties showed strong agreement with the model predictions, within the expected model uncertainty. That is the part I find most satisfying: the workflow did not stop at a generated string or a virtual-screening table. It reached an experimentally checked material.

What POLYT5 contributes

The main contribution is not just that a transformer can predict polymer properties. The more interesting point is that a polymer-native foundation model can support a full design loop:

learn polymer syntax from large-scale pre-training
predict thermal, electronic, and solubility properties after fine-tuning
generate chemically valid hypothetical polymers conditioned on property targets
screen candidates using multi-objective design criteria
connect the computational workflow to synthesis and characterization

We also wrapped the workflow in an agentic interface that connects POLYT5 with a general-purpose language model. This lets a user query the pipeline in natural language for property prediction or generative design, lowering the barrier for people who may not want to interact directly with model scripts.

That interface is still only one layer around the core scientific model, but I think it points toward a useful direction: materials-design tools should become easier to ask questions of, not only easier to run.

What I want to write about next

This paper opens several threads that are worth unpacking in future posts:

Why SELFIES is useful for polymer generation.
What T5 adds compared with decoder-only polymer language models.
How conditional generation differs from virtual screening.
How to think about validity, novelty, and synthesizability in generated polymers.
Why closing the loop with experiment matters, even for one candidate.
How agentic interfaces can make materials informatics workflows more usable without hiding the scientific assumptions.

For me, POLYT5 is a step toward a more direct design loop: describe the material target, generate candidates in a chemically meaningful representation, screen them with linked predictors, and then move the most promising ideas toward experiment.

Read the paper: POLYT5: an encoder-decoder foundation chemical language model for generative polymer design.