2. Data & Sources

Contributors to this chapter: Johan Miörner, Xiao-Shan Yap, Djamila Lesch & Jonas Heiberg

In this chapter, we outline the data requirements and give two concrete examples of how different types of documents (newspaper articles and project reports) have been identified and coded in actual STCA applications.

Data requirements and potential sources

One of the most challenging aspects of doing STCA is to identify and access a suitable source of textual data. As with all empirical work, this is a crucial step of the research process, which sets limitations on how you can design the research project.

Specific requirements for STCA include:

  • Representativeness: STCA requires a source of data that can be argued to have a representative coverage of patterns in a particular field. This makes it important how you define the boundaries of the field of study on the one hand and how well your data source corresponds to these boundaries on the other. For example, if using newspaper articles, you should be able to argue why the selection of newspapers is likely to represent what is going on in the particular field. Similarly, if using interview data, you need to think about how the interviewees may represent or inform about (or not) the field studied.
  • Code-ability: the source data needs to be available in a text format that is possible to code in NVivo. Generally, all textual data sources fulfil this requirement. If using interviews as a data source, these will thus have to be transcribed before being available for coding. The same goes for videos – STCA has been used with transcribed videos as one of several data sources.
  • Temporal consistency: the requirement to obtain temporal consistency in terms of data depends on the research question(s) and the case selected. If the ambition is to make a dynamic analysis of field-level configurations of a sector over time with more or less mature developments, the source data coverage should ideally be consistent throughout the study periods. However, if the research question is interested in the dynamic development of a sector that is emerging or that only gains salience during specific periods or over the course of the entire study period, the source data coverage may grow or vary throughout the periods.
  • Homogeneity: there may be substantial differences in how actors frame concepts in different document stocks, for example newspaper articles versus governmental (?) committee reports, which may result in limitations on the comparability of concept networks across different actors or periods. The homogeneity requirement does not mean that only one document stock can be used, but that the researcher should be aware of the potential problems with how actors frame concepts differently in different stocks.

Finding data sources that fulfil these three criteria is no easy task. Early applications of STCA used newspaper articles as the primary data source and focused on public and/or political debates around novel technologies, business models and policies. By zooming in on critical phases of expected re-configurations, e.g. during crises, natural disasters, political turmoil, etc., it was possible to further delineate the field and time period, and to identify relevant newspaper sources. Another way to ensure representativeness has been to focus on articles in trade journals or newsletters that have an explicit aim of covering ongoing debates in a particular field.

STCA has also been applied with data from interviews. This is suitable when the analysis focuses on delineated parts of the broader field, for example urban/regional innovation systems, specific policy circles, groups of niche actors or grassroots movements. A key enabling factor for doing STCA with interviews as the data source is to be able to reach the representativeness of “all” actor types in a population. Care needs to be taken when selecting interview partners so that the selection of interviewees represents the part of the field you are focusing on.

In addition, STCA has been used with bibliographic data, more specifically the abstracts of articles published in academic journals, for the analysis of scientific fields. Here, the representativeness and temporal consistency can argued to be high. A key consideration is to define what (sub-)field can be meaningfully captured by the chosen data source. Other data sources that might fulfil these criteria and ought to be explored in the future include company reports, project reports, policy documents, press releases and comments in social media. Steps have been taken to include the latter in particular applications of STCA, by analyzing the reactions to different storylines found through an analysis of discursive newspaper data.

Example of STCA using newspaper articles from NexisUni

One major source of data for conducting STCA is NexisUni – an English newspaper repository that contains news articles, magazine articles, government and legal documents, etc. These document stocks are offered over long periods with wide geographical distribution, which enables the reconstruction of longer-term socio-technical developments across places.

Finding useful data in large depositories like the NexisUni may be a challenge, given the vast amount of documents generated. In general, researchers should first try out some search strings at the search engine based on relevant keywords that were identified beforehand and (un)select accordingly the different categories (such as types of documents, regions, and periods of analysis).

A detailed guide to use NexisUni along with technical strategies for search strings can be found at the NexisUni website.

In the following, we will share additional tips for finding useful data with NexisUni based on an exemplary case study. The example presented here is drawn from Bandau (2021). In this case, the research question(s) was more or less clear in the beginning. The researcher  already identified the volume of the data s/he was looking for, based on the feasibility of the study in terms of time and also the kinds of documents (e.g. government documents are often much longer therefore more time-consuming than news articles to be coded). In this case, a feasible, final amount of articles was estimated at between 150 to 200 news articles. However, a preliminary review of the articles on the sector in the database indicated that about 40% – 50% of those articles as irrelevant due to duplication, content, etc. Therefore, the search string should identify about 350 – 400 articles to allow manual filtration after.

The case study selected was the development of the global satellite navigation sector from year 2000 to 2020. The researcher had already identified the research problem and questions based on a background study conducted and s/he is interested in finding more about the dynamics of international conflict or cooperation among the four different global navigation satellite systems (GNSS): namely the US-owned Global Positioning System, Chinese-owned BeiDou, Russian-owned GLONASS, and the EU’s Galileo. In terms of geographical distribution, the chosen case study expected a tendency in the dataset to focus on the US, given the globally leading status of the country in the GNSS sector. In addition, the study aimed for a mixture of the US and ‘international’ sources, with the latter inclusive of the European Union, Russia, and the rest of the countries as appropriate.

To ease our explanations, Table 1 shows a list of selected search string trials conducted for this study. The search process is distinguishable into four major steps, although not necessarily so for every research. Note that the tips provided here draw exclusively from experiences from one case study. More experiences will have to be collected to form a more generalizable search protocol.

Refer to the NexisUni guide to learn how to use term connectors and proximity connectors when formulating search strings.

Based on a background study (e.g. literature review, desk research), the researcher already identified several keywords beforehand, and began with search string trials. As shown in Table 1, Step 1 involved trying out different search strings (no.1, no.2, and no.3) based on the general keywords identified to help obtain an overview of the sector’s historical development over a long time range. It can be seen that all three search strings led to an amount of articles that are beyond feasibility. The focus from here on was to reduce the amount of articles while ruling out other potential alternatives. In Step 2, search string No. 4 and No. 5 show examples of trying out certain modifications (e.g. increasing the minimum occurrence of certain words, using ‘navigation systems’ instead of ‘satellite navigation’, adding ‘geopolitic’). No. 4 led to an even increasing number of articles whereas No. 5 led to an overall reduced amount of articles but with a downward trend over time. Depending on the research objective, it is important to note that this study was interested in an upward trend, which indicates that the dataset generated represents a growing discursive theme in the selected sense-making platform. In the current case, results from No. 4 and No. 5 therefore showed that those search strings were either too broad or too narrow and that search strings identified in Step 1 were sufficiently good. This helped rule out other potential alternatives, and a better strategy would be to build on those search strings already identified in Step 1.

Table 1: Search process at NexisUni

Source: Adapted from Bandau (2021). Note: The actual search process for each of the four steps may involve more iterations and variants of trials.

To reduce the number of articles while maximizing relevant content, Step 3 is about narrowing the search strings from Step 1. Based on the preliminary review, it seems that the relevant articles tend to mention one of the four GNSS. Therefore, keywords of GNSS, GPS, BeiDou, GLONASS, and Galileo, were included in the search string. Comparing the results of Step 1 and Step 3 indicates that adding the names of those navigation systems indeed helped reduce the numbers.  All of them, however, led to downward development trends in the number of articles over time. This indicated that those search strings might not have captured the most relevant and salient discursive topics in the news platform during those periods.

It seemed that reducing the numbers of keyword occurrences and modifying the keywords of no. 5 would be a next strategy. Step 4 therefore focuses on getting an upward development trend while reducing the number of articles to a feasible level. After different trials, search string no. 8 as an example reduced the numbers of occurrence (hopefully to find an upward trend) while specifying the appearance of ‘satellite navigation’ to avoid getting an even larger set of data. It represents a potential search string to be used, given a generally upward trend except in the last few years, and a number that comes close to feasibility but may be too little after manual filtration.

Finally, search string no. 9 built on search string no. 8 by broadening the keywords to include also ‘navigation satellite’, ‘navigation system’, and ‘tension’, while maintaining the numbers of occurrence. It derived a suitable amount of articles (375 before manual filtration), provided a general upward trend, and had an appropriate geographical distribution for the sources. The slight decline in the last few years was justifiable, given the general shift of focus in the space sector towards increasing satellite constellations and space debris, while the international public media in 2020 had a swift turn towards the coronavirus pandemic in general. The dataset from search string no. 9 was manually filtered to exclude any duplicates or irrelevant news. The final dataset used for coding was 177 articles, which is in line with the aim and feasibility of the study. 

Example of STCA using project documents

Substantive approaches to STCA based on the identification of concrete activities and/or institutional changes that actors bring about in a given area (see Chapter 3), opens up for the use of a wide range of documents beyond newspaper articles and other ‘discursive’ sources of data. Examples of activities include investments, promotional activities, and new market development. Potential documents which report on these include investment reports, annual reports, strategy documents, or project documents. Since the focus is on changes in the activities of specific actors, the approach presented here depends on zooming in on one or a subset of actors that can be considered representative of a socio-technical system. For this purpose, their relevance in a field should be determined by, for example, describing their business or project activities in detail.

The following example is based on a study of how changes in the socio-technical configuration promoted by the World Bank in the water and sanitation sector (see Lesch, 2022), to illustrate how the identification of appropriate documents could look like.

As the analysis was based on World Bank project activities, the first step was to identify the projects that were to be studied in detail. For this purpose, the project database available from the World bank website was used as the core data source, and criteria for filtering the project database were defined according to the research questions. The search was limited to a specific time period, a minimum investment volume, and the economic sectors mentioned above. Depending on the research needs, one can adjust the criteria until a suitable number of projects is reached. In this particular analysis, we included about 100 projects which turned out to be sufficient to provide meaningful network results. For meaningful representations of, for example, subsets of projects, we would have needed a higher number of individual projects in our dataset.

Since STCA is based on coding textual data, the next step is to identify a type of project document that provides information about both technologies (e.g., those to be funded) and associated institutions, such as the goals to be achieved by the funded technology or the values motivating the implementation of particular projects. In this particular case, ensuring temporal consistency and homogeneity over time is critical. The selected WB project documents included at least one section that followed the same structure over time and across projects, which was used as a basis for the analysis. Restricting the analysis to a section of the documents also helps to address two additional caveats. First, it reduces the number of pages to be coded, since project documents are often much longer than, for example, newspaper articles. Second, it reduces the risk of coding “buzzwords” that are often included in project documents to highlight the social relevance of a project but are not project-specific. Coding buzzwords leads to the creation of network links that do not reflect the socio-technical configuration per se, but rather the societal focus in a given time period.

In the World Bank example, the coding was limited to the ‘project description’ section of the project appraisal document, the document that is published once all project details have been negotiated and accepted by governments and the World Bank. In this way, it is possible to relate the identified socio-technical configurations to actual (and future) developments on the ground. However, it is also conceivable to select documents at another stage of project development, such as final assessment reports.

Finally, if deciding to focus on more than one actor in the analysis, in this case extending the analysis beyond the World Bank, special care should be taken to ensure temporal consistency and homogeneity between project documents from different actors, which are likely to vary substantially in structure and content.


Bandau, S. (2021) Emerging Institutions in the Global Space Sector: An Institutional Logics Approach. Utrecht University.

Lesch, D. (2022) The Role of Global Actors in Sustainability Transitions – Exploring a Change Vehicle Trajectory in the Water and Sanitation Sector. MSc, Lund University, Lund.

Leave a Comment