Dataset title: Topic Models from GAFAM Discourse
Dataset author(s): Martin Mölder (University of Tartu) 0000-0002-9701-1771, Edoardo Mollona (University of Bologna), Alessio Diana (University of Bologna)
Dataset contact person: Martin Mölder (University of Tartu), 0000-0002-9701-1771, martin.molder@ut.ee
Dataset license: this dataset is distributed under CC-BY 4.0
Date of publication: 21.03.2025
Project information: INCA (Increase Corporate Political Responsibility and Accountability, funded by European Union Horizon Europe Programme. Grant Agreement no. 101061653; https://inca-project.eu/

Dataset files
=============

- articles_metadata.csv

Metadata about all the documents that were used for the topic modeling analysis.  

- document_topic_matrix.csv

A file were each row corresponds to a document in the corpus that was used for the analysis, each column indicates a topic that the model determined and the cell values indicate the estimated proportion of the topic in that particular document. 

- topic_associations.csv

This is a summary file that contains information about the topics, their role in the corpus and their associations to companies and sentiment. 

- topic_words_100.csv

A file that contains the 100 most characteristics word for each topic together with their probabilities. 

- word_topic_matrix.csv

A file that contains the topic probabilities of each word in the corpus that was used for the analysis.

Dataset documentation
=====================

Dataset summary
---------------

This dataset contains the results of an LDA topic model with 200 topics that was applied to a combined corpus of newspaper articles about GAFAM (Google, Amazon, Facebook, Apple, Microsoft) companies (113 123 texts). The objective of this analysis was to analyze what discourses arise in public in relation to GAFAM companies and how the company discourse related to public media discourse about GAFAM in Europe. 

For an initial summary of this model and the data that is part of this data set, see here: https://inca-project.eu/highlights/frequency_map.php 

The information that is below contains what has been published on the website mentioned above. 

Disclaimer: the data that is published here does not contain the original texts that were used for this analysis, because they belong to their respective authors. More detailed information about the sources of the GAFAM texts that were used for this analysis is available here: https://datadoi.ee/handle/33/570 

Data sources
------------

The data for the following countries was obtained from Dow Jones Factiva: UK, Germany, France, Ireland, Spain, Italy, Switzerland, Portugal, Austria, Poland. Relevant articles from the most prominent newspapers were identified though the company tags that were provided by Factiva. Media data for Estonia and Finalnd as well as the GAFAM texts was gathered by the project team. Media data for the Netherlands was obtained from Lexis Nexis through one of the project partners.

Model description
-----------------

Topic Modeling

Topic modeling, also known as Latent Dirichlet Allocation (LDA), introduced by Blei et al. (2003, see also Blei 2012 for a short overview), has gained considerable ground over the last two decades as a tool for exploring and classifying the content of large text corpora. It's one of the most common methods for analysing extensive text collections. Topic modeling is an unsupervised, inductive method that automatically detects topics from a corpus, though more complex versions allow for guiding the model towards specific discourses that can to some extent be pre-defined (see Eshima et al. 2023). The model outputs consist of probabilities for each text in the corpus to belong to any of the detected topics and probabilities for each word in the corpus to belong to each of the topics. Regarded as a statistical language model (DiMaggio et al. 2013), it estimates these probabilities through, in a way, reverse engineering the way a text is produced in natural language. It assumes that texts can be made up of various topics and that the same words can be part of different topics. The model starts guessing probabilities of topics for texts and probabilities of words for topics until these probabilities give as close as possible a reproduction of the original word frequencies in a text. The model does not consider the sequence of words in a document, just their overall frequency - it is a so-called bag-of-words model. Despite being a bag-of-words model, it reflects the idea that meaning is relational (Mohr and Bogdanov 2013), because it groups words into topics in such a way that some words have a higher probability of occurring together in texts than others. It is therefore especially relevant for an analysis of concepts like framing, polysemy, and heteroglossia (DiMaggio et al. 2013). A topic in the context of this method is a probability distribution over a corpus's vocabulary, helping identify words likely to co-occur in text. These co-occurring words usually share a common theme or discourse, often interpreted as a frame that presents a specific viewpoint (DiMaggio et al. 2013; Heidenreich et al. 2019; Gilardi et al. 2020; Ylä-Anttila et al. 2021). The method's ability to capture meaning's relationality also aligns it with various discourse analysis strands, from critical discourse analysis to post-structuralist theories of discourse (Aranda et al. 2021; Jacobs and Tschötschel 2019). There are various statistical implementations for topic models. In our analysis we use the “tomotopy” library (Lee 2022) in Python, because of its speed in estimation as well as its functionality. The package implements the basic LDA model (Blei et al. 2003) as well as various subsequent developments of this model. For the analysis that we report here, we used the basic model, because of the exploratory nature of the task. Estimating a complicated topic model can be computationally very demanding, especially for a large corpus of text, while fitting the basic model is relatively fast.

Sentiment analysis

There is no gold standard for sentiment analysis - i.e. the detection of emotional content in textual data - and there are various both dictionary-based as well as machine learning and language model based approaches that have been suggested and validated over recent years. A dictionary-based approach uses a sentiment dictionary, which is a pre-defined set of e.g. positive and negative words, to count emotional words in a text. Such counts would then characterise the overall emotional content of a text. In recent years such approaches have been supplemented by language models that have been trained to classify emotions on the basis of annotated texts for which their level of emotionality is known. As a first step in our sentiment analysis of the combined GAFAM and media text corpus, we used a selection of various methods to determine the emotionality of texts:

The Lexicoder Sentiment Dictionary (Young and Soroka 2012) as implemented in the “quanteda” (Benoit et al. 2018) package in R.

-	The Flair sentiment classifier (Akbik et al. 2018), Link
- 	The Vader sentiment analysis tool (Hutto and Gilbert 2014), Link
-	The Twitter-roBERTa-base for Sentiment Analysis model (Camacho-collades et al. 2022; Loureio et al. 2022), Link

We estimated a sentiment score according to these methods for each of our texts and then used principal component analysis to aggregate the estimates from each of the different methods. The first principal component accounts for 69.9% of the variance in all the separate sentiment scores that we derived from the various methods and is thus a very good summary of all of them. We scaled the principal component so that higher scores indicate more positive emotions and use it thus as a sentiment measure for the texts in our corpus.

References

Tang, Y. et al. (2020) 'Multilingual Translation with Extensible Multilingual Pretraining and Finetuning,' arXiv (Cornell University) [Preprint]. https://doi.org/10.48550/arxiv.2008.00401.
Fan, A., Bhosale, S., Schwenk, H., Ma, Z., El-Kishky, A., Goyal, S., Baines, M., Celebi, O., Wenzek, G., Chaudhary, V., Goyal, N., Birch, T., Liptchinsky, V., Edunov, S., Grave, E., Auli, M. & Joulin, A., 2020. Beyond English-Centric Multilingual Machine Translation. arXiv preprint arXiv:2010.11125.
Costa-jussà, M. R., Cross, J., Çelebi, O., Elbayad, M., Heafield, K., Heffernan, K., & NLLB Team. (2022). No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
Reber, U. (2019). 'Overcoming Language Barriers: Assessing the Potential of Machine Translation and Topic Modeling for the Comparative Analysis of Multilingual Text Corpora', Communication Methods and Measures, 13:2, 102-125, DOI: 10.1080/19312458.2018.1555798.
Young, L. and Soroka, S. (2012) 'Affective news: the automated coding of sentiment in political texts,' Political Communication, 29(2), pp. 205-231. https://doi.org/10.1080/10584609.2012.671234.
Akbik, A., Blythe, D. a. J. and Vollgraf, R. (2018) 'Contextual string embeddings for sequence labeling,' International Conference on Computational Linguistics, pp. 1638-1649. https://aclanthology.info/papers/C18-1139/c18-1139.
Hutto, C., & Gilbert, E. (2014, May). Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media (Vol. 8, No. 1, pp. 216-225).
Camacho-collados, J., Rezaee, K., Riahi, T., Ushio, A., Loureiro, D., Antypas, D., Boisson, J., Espinosa Anke, L., Liu, F. & Martínez Cámara, E., et al. (2022) “TweetNLP: Cutting-Edge Natural Language Processing for Social Media” Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Abu Dhabi, UAE: Association for Computational Linguistics, pp. 38-49. Available at: https://aclanthology.org/2022.emnlp-demos.5.
Loureiro, D. et al. (2022) 'TimeLMs: Diachronic Language Models from Twitter,' arXiv (Cornell University) [Preprint]. https://doi.org/10.48550/arxiv.2202.03829.
Blei, D.M. (2012) 'Probabilistic topic models,' Communications of the ACM, 55(4), pp. 77-84. https://doi.org/10.1145/2133806.2133826.
Blei, D.M., Ng, A.Y. and Jordan, M.I. (2003) 'Latent dirichlet allocation,' Journal of Machine Learning Research, 3, pp. 993-1022. https://doi.org/10.5555/944919.944937.
Eshima, S., Imai, K. and Sasaki, T. (2023) 'Keyword-Assisted topic models,' American Journal of Political Science [Preprint]. https://doi.org/10.1111/ajps.12779.
DiMaggio, P., Nag, M. and Blei, D.M. (2013) 'Exploiting affinities between topic modeling and the sociological perspective on culture: Application to newspaper coverage of U.S. government arts funding,' Poetics, 41(6), pp. 570-606. https://doi.org/10.1016/j.poetic.2013.08.004.
Mohr, J.W. and Bogdanov, P. (2013) 'Introduction—Topic models: What they are and why they matter,' Poetics, 41(6), pp. 545-569. https://doi.org/10.1016/j.poetic.2013.10.001.
Heidenreich, T. et al. (2019) 'Media Framing Dynamics of the 'European Refugee Crisis’: A Comparative topic modelling approach,' Journal of Refugee Studies, 32(Special_Issue_1), pp. i172-i182. https://doi.org/10.1093/jrs/fez025.
Gilardi, F., Shipan, C.R. and Wüest, B. (2020) 'Policy diffusion: the Issue-Definition Stage,' American Journal of Political Science, 65(1), pp. 21-35. https://doi.org/10.1111/ajps.12521
Ylä-Anttila, T., Eranti, V. and Kukkonen, A.K. (2021) 'Topic modeling for frame analysis: A study of media debates on climate change in India and USA,' Global Media and Communication, 18(1), pp. 91-112. https://doi.org/10.1177/17427665211023984.
Aranda, A.M. et al. (2021) 'From Big Data to Rich Theory: Integrating Critical Discourse Analysis with Structural Topic Modeling,' European Management Review, 18(3), pp. 197-214. https://doi.org/10.1111/emre.12452.
Jacobs, T. and Tschötschel, R. (2019) 'Topic models meet discourse analysis: a quantitative tool for a qualitative approach,' International Journal of Social Research Methodology, 22(5), pp. 469-485. https://doi.org/10.1080/13645579.2019.1576317.
Lee, M., (2022). bab2min/tomotopy: 0.12.3. [software] Version v0.12.3. Zenodo. Available at: https://doi.org/10.5281/zenodo.6868418. DOI: 10.5281/zenodo.6868418.


Codebook
--------

articles_metadata.csv

-	company. Name of the company that the article is about. 
- 	country. Source country of the text. 
- 	date. Date of the text. 
- 	domain. Whether is is a media text or a GAFAM text. 
-	lang. Original language of the text. 
- 	sen_flair_neg. Flair sentiment classifier, negative score for article text. 	
-	sen_flair_pos. Flair sentiment classifier, positive score for article text.	
-	sen_lsd_polarity. Lexicoder Sentiment Dictionary, polarity score. 	
-	sen_pca_score. Principal component analysis score for sentiments. 	
-	sen_textblob_polarity. Textblob (https://textblob.readthedocs.io/en/dev/) polarity score. 	
-	sen_textblob_subjectivity. Textblob (https://textblob.readthedocs.io/en/dev/) subjectivity score. 	
-	sen_trans_neg. Twitter-roBERTa-base negative score. 	
-	sen_trans_neu. Twitter-roBERTa-base neutral score.	
-	sen_trans_pos. Twitter-roBERTa-base positive score.	
-	sen_vader_compound. Vader sentiment analysis tool, compound score. 	
-	sen_vader_neg. Vader sentiment analysis tool, negative score. 		
-	sen_vader_neu. Vader sentiment analysis tool, neutral score. 		
-	sen_vader_pos. Vader sentiment analysis tool, positive score. 		
-	source. Source of the text. 	
-	title. Title of the text (original language).	
-	title_trans. Title of the text, translation. 

topic_associations.csv

-	number. Topic number from the LDA model.
-	label. Initial label assigned to the topic (provided by Chat-GPT).
-	label_new. Revised label for the topic.
-	words. 10 most characteristic words for the topic. 
-	topic_prop_overall. Overall proportion of the topic in the corpus. 
-	topic_rank_overall. Overall rank of the topic (in terms of prevalence) in the corpus. 
-	topic_prop_media. Proportion of the topic in the media part of the corpus. 
-	topic_rank_media. Rank of the topic in the media part of the corpus. 
-	topic_prop_companies. Proportion of the topic in the GAFAM part of the corpus.
-	topic_rank_companies. Rank of the topic in the GAFAM part of the corpus. 
-	sentiment_corr. Topic-sentiment correlation (using the sentiment PCA score).
-	sentiment_label. Sentiment label for the topic (abs. r=0.05 as threshold).
-	domain_association. Whether the topic is more prevalent in the media or GAFAM part of the corpus. 
-	company_association. Is the topic more prevalent in relation to one particular company. 


Version notes
=============

Version 1.0 

The initial version of the data set.