Case Study: NLG for Medical Health Education

Dingxiang Doctor receives thousands of health questions every day in its comment sections and backend. To help users resolve various health doubts, the platform provides multiple formats of content. For high-frequency, common questions, hundreds of physicians and experts compile materials and produce professionally written, easy-to-understand FAQ-style entries in a "health encyclopedia". There are also many in-depth expert popular-science articles that explain the reasoning behind medical knowledge.

These two content types are well structured: they have relatively tidy headlines, assigned departments, and classifications by medical entities or health topics. Users can reach the content conveniently via search. However, medicine is a complex subject and many conditions vary greatly between individuals. As an analogy based on recent widespread COVID infections:

A pony wants to cross a river and asks a squirrel, "Is the water deep?" The squirrel replies, "Very deep! My friend drowned crossing it!" The pony then asks a dog, which says, "It's pretty deep; I barely swam across." The pony finally asks a cow. The cow smiles and asks, "Do you need ibuprofen?"

Proactive popular-science content cannot exhaustively cover every detail or possibility. If a user has a personalized question that is not answered in an article, they can choose paid online consultation to ask a doctor directly. After a consultation, the record is kept confidential by default. If the user agrees, the consultation can be made public: sensitive information is anonymized and the record enters the search index. Through user search, the data will appear in the "public questions" section.

Anyone who has worked on information retrieval knows that improving search effectiveness requires both better semantic matching and more structured source data. Keen readers will have noticed that public consultation records are automatically given titles. For long-text retrieval, the title is an important indexing field: it captures the core intent, helps semantic matching between document and query, and improves user reading experience. Titles are typically written by editors, who can abstract and summarize complex material and clarify main logical threads. In our scenario, we want the model to do something similar: extract the user's chief complaint and generate a fluent interrogative sentence.

Three years ago, when RNN-Seq2Seq architectures were mainstream, the team experimented extensively with summarization. Under that technical background, we focused on structural tricks to improve long-text encoding and on copy mechanisms to better identify key information, or on integrating external knowledge entities. But constrained by the encoder capacity, results were often unsatisfactory. Occasional promising cases existed, but output stability was insufficient for production use.

Over the following three years, pretrained models advanced rapidly, and NLG performance improved substantially. For our task, Google’s T5 became the base framework. After several rounds of labeled-data tuning, the baseline reached a passable, readable state. Before deployment, however, several issues remained to solve: low-resource generation causing disfluency, factual inconsistency across generated sentences, and overly long input texts.

In recent years, abstractive summarization has been an active NLP research area. The remainder of this article reviews relevant academic work and discusses strategies to mitigate the issues above.

1. Combining Summarization with Multi-Task or Multi-Objective Training

Since the rise of pretrained models, multi-task methods have been a common solution. In summarization, for example, one can add a factual consistency task to improve faithfulness or a fluency task to improve readability. One influential multi-task approach in summarization is Google’s PEGASUS from 2019.

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

This paper introduced a novel pretraining task called Gap Sentences Generation (GSG) and achieved state-of-the-art results on 12 datasets. For a text paragraph, GSG masks out certain sentences in the encoder input and uses the masked sentences as the decoder target during training.

The task is simple but effective. To make the masked sentences more like a "summary," the paper selects sentences for masking by computing Rouge scores between each sentence and the rest of the paragraph, then masking the top-N sentences. Rouge can be replaced by other metrics; for example, FactPEGASUS augments the selection metric with FACTCC to improve factual consistency.

BRIO: Bringing Order to Abstractive Summarization

This paper ranked highly on several benchmarks. It addresses two weaknesses of autoregressive models:

(1) Autoregressive models are often biased during generation.

(2) Training data for generative summarization typically maps a document to a single reference summary, so the model learns a point-to-point distribution, which is suboptimal.

BRIO frames training in two phases. The first phase trains with MLE loss. In the second phase, the model is trained with a combination of ranking loss and MLE loss; the two-phase process is repeated until convergence.

The ranking task uses beam search to generate multiple candidate summaries and ranks them by Rouge against the source text. The ranking loss is driven by candidate probabilities and their ranking positions, encouraging the model to assign higher probability to higher-quality (higher-Rouge) candidates. Compared with MLE loss, the ranking loss learns a distribution over candidates rather than a single point. By taking the difference between log probabilities of two candidate sentences, it also mitigates word-frequency bias.

Calibrating Sequence Likelihood Improves Conditional Language Generation

From the same group as PEGASUS, this is a current state-of-the-art method. After fine-tuning, the authors introduce a calibration stage: a two-stage multi-task training scheme. Although it differs from BRIO in training schedule, the approach is conceptually similar.

The paper performs many experiments and summarizes conclusions:

(1) For multi-task loss choice, a simple rank loss often performs best.

(2) For ranking generated sentences, metrics like BERTScore, decoder scores, or Rouge yield similar results.

(3) Candidate generation: beam search outperforms diverse beam search and nucleus sampling.

(4) Alternative multi-task losses such as KL-divergence and cross-entropy perform similarly.

(5) For entering the second training stage, using perplexity to select the best checkpoint works well.

The paper also shows that combining multi-task training with a two-stage schedule reduces reliance on beam-search tricks during prediction.

FRSUM: Towards Faithful Abstractive Summarization via Enhancing Factual Robustness

This EMNLP2022 paper from Baidu improves factual consistency by introducing adversarial-attack-style training. The authors add a perturbation h to encoder outputs before the decoder. They require that under perturbation, the probability of the correct factual span remains higher than other plausible entity spans from an adversarial candidate set. Factual spans are predefined error-prone spans, and adversarial candidates are constructed from the input text. Ablation shows both perturbation and adversarial training contribute to improved factual consistency.

2. Long-Text Summarization Strategies

Current approaches for long-text summarization follow two directions: use sparse-attention architectures like BigBird or Longformer to extend input length, or reduce input length through truncation and greedy selection.

How Far are We from Robust Long Abstractive Summarization?

This EMNLP2022 paper conducts an info-density experiment for long texts. For English articles averaging 6k tokens, the most informative region is around 1k–2k tokens, followed by 0–1k. Thus, naively truncating to the top 512 tokens may miss the most informative parts and produce suboptimal summaries.

The paper experiments with two methods. Method A does not restrict input length and uses sparse attention (Longformer's local attention). Method B limits input length and applies greedy selection (reduce-then-summarize), choosing sentences greedily by Rouge. With input length constraints of 1k, 4k, and 8k tokens, Method A shows slight gains from increased input length even when switching from full to sparse attention. For Method B, limiting to 1k tokens with full attention yields the best results; increasing length or using sparse attention does not help. Using a 1k-length greedy selection aligns better with the model's pretraining setting, avoids modifying attention or position embeddings, removes redundant information, and reduces the risk of the generative model drifting.

Investigating Efficiently Extending Transformers for Long Input Summarization

Google's Pegasus-X uses sparse attention to support up to 16k tokens and is available on Hugging Face. The paper introduces a staggered block-local Transformer, where different layers use different local-attention ranges. This stacked design is analogous to CNN receptive fields: higher layers, via staggered attention, can reach a very large receptive field. Experiments show that staggered block-local Transformers can improve performance even when global attention is present.

A Multi-Stage Summarization Framework for Long Input Dialogues and Documents

This ACL paper proposes a split-then-summarize pipeline and achieves state-of-the-art results on some long-text datasets. The process decomposes long-document summarization into N coarse summarization stages followed by one fine summarization stage, using different models for each stage (N+1 models total). The authors note that sharing a model across stages degrades performance.

The coarse-stage training data is created by greedily matching paragraphs from the original document to the target summary to maximize Rouge. The final fine summarizer receives inputs in the 1k–2k token range rather than forcing compression within a fixed maximum K. The paper argues that compressing to K more aggressively tends to produce overly short coarse summaries; concatenating those short summaries can introduce noise.

3. Leveraging Graph Structures

Documents contain structural cues such as paragraph membership, coreference relations, and keyword presence. Representing these cues with graphs can improve summarization.

HEGEL: Hypergraph Transformer for Long Document Summarization

This paper enhances extractive summarization using document-structure information. Although extractive in nature, its graph-construction approach is instructive: sentences, their paragraphs, topics, and keywords are nodes in a hypergraph. Topics and keywords are extracted beforehand. Sentences are encoded, then two layers of hypergraph attention aggregate node and edge information to obtain better sentence representations for extraction.

Abstractive Summarization Guided by Latent Hierarchical Document Structure

This EMNLP paper proposes a HierGNN to learn sentence dependencies via graph structure to improve abstractive summarization. Experiments show improvements for both pretrained and non-pretrained models.

Summary

Pretrained models have improved summarization significantly, but production systems still face issues such as disfluent sentences and factual inconsistency, which are less tolerable in serious medical popularization. Beyond the methods discussed above for improving generation, we recommend strengthening post-generation evaluation. In addition to business rules, we perform secondary scoring on generated summaries using metrics such as perplexity, Rouge, domain classification results, and counts of entities contained in the summary to filter or re-rank outputs.