Corpus approaches to social media

Convenors: Sofia Rüdiger (University of Bayreuth) and Daria Dayter (University of Basel)

Language-centered research on online interactions has been steadily gaining momentum since the days of early Web 2.0. What started as a predominantly qualitative endeavor (Herring 1996, Androutsopoulos 2006) has steadily evolved towards large datasets and big data, especially in studies concerned with social media. This trend can be exemplified with the work of Zappavigna (2012, 2015), the What’s up, Switzerland? project (Ueberwasser and Stark 2017), or in sentiment analysis studies where computer scientists venture into the territory of language analysis with varying degrees of success (see, e.g., Taboada et al. 2011, Mostafa 2013, Ordenes et al. 2017). Corpora creation in this field, however, offers a set of new challenges that pre-internet or even Web 1.0 researchers did not have to reckon with.

This workshop will therefore focus on the collection, analysis, and processing of corpora of single- and multi-modal, synchronous and asynchronous communication on different social media platforms and channels and the challenges connected to these research endeavors. We invite submissions from scholars working on a range of social media, such as Facebook, Twitter, LinkedIn, SMS and WhatsApp, Snapchat, Instagram, gaming chats, blog comments sections, wiki discussions, and YouTube comments. The contributions will describe various aspects of data collection, annotation, processing, and exploitation of machine-readable corpora for research in the humanities. The workshop thus brings together language-centered research on interactive social media in linguistics, communication studies, media studies, and social sciences with research questions from the fields of corpus and computational linguistics, language technology, and text analytics.

We intend to create a forum for corpus linguists to address the following challenges of corpus-based linguistic studies in the realm of social media:

  • Ethical issues of accessing and harvesting data and making it available as a part of “open data” initiatives, especially in multimodal analysis when removing an image impoverishes the analysis
  • Legal issues of accessing and harvesting data, and the question of our social responsibility as scientists outweighing legal concerns (cf. the case of Fivethirtyeight sharing a corpus of Russian trolls’ tweets)
  • Difficulty obtaining data which is often very rich in personal information and subjects therefore being reluctant to donate their WhatsApp chats or Facebook conversations
  • Technical challenges of collecting and storing corpora (including how-to talks, sharing experiences in using available tools such as Trendalyzer, Tweet Visualiser, twXplorer, DiscoverText, Twitter StreamGraph, WebAnno)
  • Annotation of social media corpora: inter-coder reliability; reconciling the need for tailor-made annotation with the standardization drive
  • Lemmatization, POS tagging, syntactic parsing, and named entity recognition

Call for Papers

We welcome contributions which address the issues mentioned above as standalone subjects, but also invite presentations approaching these matters within the framework of concrete corpus-linguistic studies of social media, for example, in the realm of sociolinguistics, discourse analysis, translanguaging and code-switching, applied linguistics, multimodality, as well as descriptions of social media registers. Abstracts of max. 500 words (including references) should be submitted online via The deadline for abstract submission is 15 December 2018.

Notification of acceptance will be sent out by January 10.


Androutsopoulos, Jannis, ed. 2006. Sociolinguistics and Computer-Mediated Communication. Special issue of the Journal of Sociolinguistics 10(4).

Herring, Susan C., ed. 1996. Computer-Mediated Communication: Linguistic, Social and Cross-Cultural Perspectives. Amsterdam: John Benjamins.

Mostafa, Mohamed M. 2013. “More than Words: Social Networks’ Text Mining for Consumer Brand Sentiments.” Expert Systems with Applications 40(10): 4241–4251.

Ordenes Francisco Villarroel, Stephan Ludwig, Ko de Ruyter, Dhruv Grewal and Martin Wetzels. “Unveiling What Is Written in the Stars: Analyzing Explicit, Implicit, and Discourse Patterns of Sentiment in Social Media.” Journal of Consumer Research 43(6): 875–894.

Taboada, Maite, Julian Brooke, Milan Tofiloski, Kimberly Voll and Manfred Stede. 2011. “Lexicon-Based Methods for Sentiment Analysis.” Computational Linguistics 37(2): 267–307.

Ueberwasser, Simone and Elisabeth Stark. 2017. “What’s up, Switzerland? A Corpus-Based Research Project in a Multilingual Country.” Linguistik Online 84(5): n.p. Last accessed on 09.10.2018.

What’s Up, Switzerland? Last accessed on 07.10.2018.

Zappavigna, Michele. 2012. Discourse of Twitter and Social Media. London: Continuum.

Zappavigna, Michele. 2015. “Searchable Talk: The linguistic Functions of Hashtags.” Social Semiotics 25(3):