• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Researchers at HSE in St Petersburg Develop Superior Machine Learning Model for Determining Text Topics

Researchers at HSE in St Petersburg Develop Superior Machine Learning Model for Determining Text Topics

© iStock

They also revealed poor performance of neural networks on such tasks

Topic models are machine learning algorithms designed to analyse large text collections based on their topics. Scientists at HSE Campus in St Petersburg compared five topic models to determine which ones performed better. Two models, including GLDAW developed by the Laboratory for Social and Cognitive Informatics at HSE Campus in St Petersburg, made the lowest number of errors. The paper has been published in PeerJ Computer Science.

Determining the topic of a publication is usually not difficult for the human brain. For example, any editor can easily tag this article with science, artificial intelligence, and machine learning. However, the process of sorting information can be time-consuming for a person, which becomes critical when dealing with a large volume of data. A modern computer can perform this task much faster, but it requires solving a challenging problem: identifying the meaning of documents based on their content and categorising them accordingly.

This is achieved through topic modelling, a branch of machine learning that aims to categorise texts by topic. Topic modelling is used to facilitate information retrieval, analyse mass media, identify community topics in social networks, detect trends in scientific publications, and address various other tasks. For example, analysing financial news can accurately predict trading volumes on the stock exchange, which are significantly influenced by politicians' statements and economic events.

Here's how working with topic models typically unfolds: the algorithm takes a collection of text documents as input. At the output, each document is assessed for its degree of belonging to specific topics. These assessments are based on the frequency of word usage and the relationships between words and sentences. Thus, words such as ‘scientists,’ ‘laboratory,’ ‘analysis,’ ‘investigated,’ and ‘algorithms’ found in this text categorise it under the topic of ‘science.’

However, many words can appear in texts covering various topics. For example, the word ‘work’ is often used in texts about industrial production or the labour market. However, when used in the phrase ‘scientific work,’ it categorises the text as pertaining to ‘science.’ Such relationships, expressed mathematically through probability matrices, form the core of these algorithms.

Topic models can be enhanced by creating embeddings—fixed-length vectors that describe a specific entity based on various parameters. These embeddings serve as additional information acquired through training the model on millions of texts. 

Any phrase or text, such as this news item, can be represented as a sequence of numbers—a vector or a vector space. In machine learning, these numerical representations are referred to as embeddings. The idea is that measuring spaces and detecting similarities becomes easier, allowing comparisons between two or more texts. If the similarities between the embeddings describing the texts are significant, then they likely belong to the same category or cluster—a specific topic.

Scientists at the HSE Laboratory for Social and Cognitive Informatics in St Petersburg examined five topic models—ETM, GLDAW, GSM, WTM-GMM and W-LDA, which are based on different mathematical principles:

  • ETM is a model proposed by the prominent mathematician David M. Blei, who is one of the founders of the field of topic modelling in machine learning. His model is based on latent Dirichlet allocation and employs variational inference to calculate probability distributions, combined with embeddings.
  • Two models—GSM and WTM-GMM—are neural topic models.
  • W-LDA is based on Gibbs sampling and incorporates embeddings, but also uses latent Dirichlet allocation, similar to the Blei model.
  • GLDAW relies on a broader collection of embeddings to determine the association of words with topics.

For any topic model to perform effectively, it is crucial to determine the optimal number of categories or clusters into which the information should be divided. This is an additional challenge when tuning algorithms.

Sergei Koltsov

Sergey Koltsov, primary author of the paper, Leading Research Fellow, Laboratory of Social and Cognitive Informatics

Typically, a person does not know in advance how many topics are present in the information flow, so the task of determining the number of topics must be delegated to the machine. To accomplish this, we proposed measuring a certain amount of information as the inverse of chaos. If there is a lot of chaos, then there is little information, and vice versa. This allows for estimating the number of clusters, or in our case, topics associated with the dataset. We applied these principles in the GLDAW model.

The researchers investigated the models for stability (number of errors), coherence (establishing connections), and Renyi entropy (measuring the degree of chaos). The algorithms' performance was tested on three datasets: materials from a Russian-language news resource Lenta.ru and two English-language datasets - 20 Newsgroups and WoS. This choice was made because all texts in these sources were initially assigned tags, allowing for evaluation of the algorithms' performance in identifying the topics.

The experiment showed that ETM outperformed other models in terms of coherence on the Lenta.ru and 20 Newsgroups datasets, while GLDAW ranked first for the WoS dataset. Additionally, GLDAW exhibited the highest stability among the tested models, effectively determined the optimal number of topics, and performed well on shorter texts typical of social networks.

Sergey Koltsov, primary author of the paper, Leading Research Fellow, Laboratory of Social and Cognitive Informatics

We improved the GLDAW algorithm by incorporating a large collection of external embeddings derived from millions of documents. This enhancement enabled more accurate determination of semantic coherence between words and, consequently, more precise grouping of texts.

GSM, WTM-GMM and W-LDA demonstrated lower performance than ETM and GLDAW across all three measures. This finding surprised the researchers, as neural network models are generally considered superior to other types of models in many aspects of machine learning. The scientists have yet to determine the reasons for their poor performance in topic modelling.

See also:

Researchers Present the Rating of Ideal Life Partner Traits

An international research team surveyed over 10,000 respondents across 43 countries to examine how closely the ideal image of a romantic partner aligns with the actual partners people choose, and how this alignment shapes their romantic satisfaction. Based on the survey, the researchers compiled two ratings—qualities of an ideal life partner and the most valued traits in actual partners. The results have been published in the Journal of Personality and Social Psychology.

Trend-Watching: Radical Innovations in Creative Industries and Artistic Practices

The rapid development of technology, the adaptation of business processes to new economic realities, and changing audience demands require professionals in the creative industries to keep up with current trends and be flexible in their approach to projects. Between April and May 2025, the Institute for Creative Industries Development (ICID) at the HSE Faculty of Creative Industries conducted a trend study within the creative sector.

From Neural Networks to Stock Markets: Advancing Computer Science Research at HSE University in Nizhny Novgorod

The International Laboratory of Algorithms and Technologies for Network Analysis (LATNA), established in 2011 at HSE University in Nizhny Novgorod, conducts a wide range of fundamental and applied research, including joint projects with large companies: Sberbank, Yandex, and other leaders of the IT industry. The methods developed by the university's researchers not only enrich science, but also make it possible to improve the work of transport companies and conduct medical and genetic research more successfully. HSE News Service discussed work of the laboratory with its head, Professor Valery Kalyagin.

Children with Autism Process Sounds Differently

For the first time, an international team of researchers—including scientists from the HSE Centre for Language and Brain—combined magnetoencephalography and morphometric analysis in a single experiment to study children with Autism Spectrum Disorder (ASD). The study found that children with autism have more difficulty filtering and processing sounds, particularly in the brain region typically responsible for language comprehension. The study has been published in Cerebral Cortex.

HSE Scientists Discover Method to Convert CO₂ into Fuel Without Expensive Reagents

Researchers at HSE MIEM, in collaboration with Chinese scientists, have developed a catalyst that efficiently converts CO₂ into formic acid. Thanks to carbon coating, it remains stable in acidic environments and functions with minimal potassium, contrary to previous beliefs that high concentrations were necessary. This could lower the cost of CO₂ processing and simplify its industrial application—eg in producing fuel for environmentally friendly transportation. The study has been published in Nature Communications. 

HSE Scientists Reveal How Staying at Alma Mater Can Affect Early-Career Researchers

Many early-career scientists continue their academic careers at the same university where they studied, a practice known as academic inbreeding. A researcher at the HSE Institute of Education analysed the impact of academic inbreeding on publication activity in the natural sciences and mathematics. The study found that the impact is ambiguous and depends on various factors, including the university's geographical location, its financial resources, and the state of the regional academic employment market. A paper with the study findings has been published in Research Policy.

Group and Shuffle: Researchers at HSE University and AIRI Accelerate Neural Network Fine-Tuning

Researchers at HSE University and the AIRI Institute have proposed a method for quickly fine-tuning neural networks. Their approach involves processing data in groups and then optimally shuffling these groups to improve their interactions. The method outperforms alternatives in image generation and analysis, as well as in fine-tuning text models, all while requiring less memory and training time. The results have been presented at the NeurIPS 2024 Conference.

When Thoughts Become Movement: How Brain–Computer Interfaces Are Transforming Medicine and Daily Life

At the dawn of the 21st century, humans are increasingly becoming not just observers, but active participants in the technological revolution. Among the breakthroughs with the potential to change the lives of millions, brain–computer interfaces (BCIs)—systems that connect the brain to external devices—hold a special place. These technologies were the focal point of the spring International School ‘A New Generation of Neurointerfaces,’ which took place at HSE University.

New Clustering Method Simplifies Analysis of Large Data Sets

Researchers from HSE University and the Institute of Control Sciences of the Russian Academy of Sciences have proposed a new method of data analysis: tunnel clustering. It allows for the rapid identification of groups of similar objects and requires fewer computational resources than traditional methods. Depending on the data configuration, the algorithm can operate dozens of times faster than its counterparts. Thestudy was published in the journal Doklady Rossijskoj Akademii Nauk. Mathematika, Informatika, Processy Upravlenia.

Researchers from HSE University in Perm Teach AI to Analyse Figure Skating

Researchers from HSE University in Perm have developed NeuroSkate, a neural network that identifies the movements of skaters on video and determines the correctness of the elements performed. The algorithm has already demonstrated success with the basic elements, and further development of the model will improve its accuracy in identifying complex jumps.