Journal of Artificial Intelligence Research 79 (2024) 971-1000 Submitted 04/2023; published 03/2024 © 2024 The Authors. Published by AI Access Foundation under Creative Commons Attribution License CC BY 4.0. Cultural Bias in Explainable AI Research: A Systematic Analysis Uwe Peters U.PETERS@UU.NL Department of Philosophy, Utrecht University 3512 BL Utrecht, The Netherlands Mary Carman MARY.CARMAN@WITS.AC.ZA Department of Philosophy, University of the Witwatersrand 2050 Johannesburg, South Africa Abstract For synergistic interactions between humans and artificial intelligence (AI) systems, AI outputs often need to be explainable to people. Explainable AI (XAI) systems are commonly tested in human user studies. However, whether XAI researchers consider potential cultural differences in human explanatory needs remains unexplored. We highlight psychological research that found significant differences in human explanations between many people from Western, commonly individualist countries and people from non-Western, often collectivist countries. We argue that XAI research currently overlooks these variations and that many popular XAI designs implicitly and problematically assume that Western explanatory needs are shared cross-culturally. Additionally, we systematically reviewed over 200 XAI user studies and found that most studies did not consider relevant cultural variations, sampled only Western populations, but drew conclusions about human-XAI interactions more generally. We also analyzed over 30 literature reviews of XAI studies. Most reviews did not mention cultural differences in explanatory needs or flag overly broad cross-cultural extrapolations of XAI user study results. Combined, our analyses provide evidence of a cultural bias toward Western populations in XAI research, highlighting an important knowledge gap regarding how culturally diverse users may respond to widely used XAI systems that future work can and should address. 1. Introduction To combine the strengths and mitigate the limitations of human intelligence and AI, increasingly more hybrid human-AI (HHAI) systems (e.g., AI-assisted human experts) are being developed and used (e.g., for clinical decision-making) (Chen et al., 2020). Successful HHAI systems involve and depend on close proactive collaborations, trust, and mutual understandability between humans and AI systems (Bansal et al., 2019). Yet, many of the AI systems that are now frequently used in high-stakes decision-making domains are opaque, i.e., they operate in ways too computationally complex even for AI developers to fully understand (Burrell, 2016). This opacity raises questions about these systems’ trustworthiness and can undermine successful HHAI collaborations: If the humans that are part of human-AI hybrid systems cannot understand why an AI produces the output it does, they may lack meaningful control over it in their collaboration with the model (Akata et al., 2020). AI explainability is thus vital for human-AI interactions. One main approach to dealing with this challenge is to equip AI systems with XAI models developed to make opaque systems’ outputs understandable to humans (Arrieta et al., 2020). However, the ability to understand the explanations that XAI systems produce may differ between individuals (Wang & Yin, 2021), and it has been noted that XAI designers frequently PETERS & CARMAN 972 adopt a one-size-fits-all approach, suiting primarily only AI experts (Ehsan et al., 2021). This may result in XAI systems that leave many AI users’ explanatory needs in human-AI interactions unaddressed and that potentially operate in ways at odds with, for instance, the EU’s General Data Protection Regulation (Casey et al., 2019). Apart from interpersonal differences in expertise, culture, i.e., the set of attitudes, values, beliefs, and behaviors shared by a group of people and communicated from one generation to the next (Matsumoto, 1996), may also significantly influence what explanations people expect or prefer from AI systems thus affecting human-AI collaborations. The importance of cultural differences has been noted in several areas of AI research including AI ethics (e.g., people’s responding to moral dilemmas faced by autonomous vehicles; Awad et al., 2018) and calls for greater cultural inclusivity in AI developments and applications are increasing (Carman & Rosman, 2021; Linxen et al. 2021; Okolo et al., 2022). However, it remains unclear to what extent there are XAI-relevant cultural variations in explanatory needs. Several AI review papers have drawn XAI researchers’ attention to psychological findings on human explanations (Abdul et al., 2018; Miller, 2019). But they have not considered empirical work on cultural differences in human explanatory needs, leaving it unclear whether there are XAI-relevant cultural variations and what they would be. A related concern is that studies in the behavioral sciences, including the fields of human- computer interaction (HCI) and human-robot interaction (HRI), found that many researchers predominantly tested only individuals from Western, educated, industrialized, rich, and democratic (WEIRD) countries, even though WEIRD people comprise only 12% of the world population (Rad et al., 2018; Linxen et al., 2021; Seaborn et al., 2023). The field of XAI might have taken countermeasures and be less affected by WEIRD sampling. However, it has not been investigated whether that is so. In a recent systematic review focusing on XAI research in the Global South, Okolo et al. (2022) found only three XAI papers that engaged with or involved people from communities in the Global South. But the authors did not examine to what extent XAI studies outside the Global South may nevertheless be culturally diverse. More importantly, while several recent studies report that WEIRD sampling may severely limit the generalizability of HCI or HRI studies (Linxen et al., 2021; Seaborn et al., 2023), these studies do not yet control for the point that studies sampling only individuals from one kind of population can be unproblematic even if there are relevant cultural differences. After all, researchers may tailor their conclusions to their specific sample or study population, making clear that other populations remain to be explored. WEIRD sampling may only become questionable when findings are presented as if they apply beyond these populations and researchers produce ‘hasty generalizations’, i.e., conclusions whose scope is broader than warranted by the evidence and justification provided by the researchers (Peters & Lemeire, 2023). Relatedly, we recently found that the scope of generalizations in many XAI user studies was only poorly correlated with the size of the studies’ samples suggesting that hasty, overly broad extrapolations may have been common (Peters & Carman, 2023). However, no prior corpus analysis has investigated how broadly results are generalized across cultures in XAI user studies, leaving a significant gap in the previous work that highlights problems related to WEIRD sampling. Hasty generalizations of study results may obscure cultural variations in people’s XAI needs and increase the risk that large parts of the world population are overlooked in the development of XAI and HHAI systems. Analyzing XAI user studies for hasty generalizations is therefore vital. CULTURAL BIAS IN EXPLAINABLE AI 973 Here, we aim to fill the research lacunas just outlined. We offer three main contributions. First, by drawing on existing psychological studies, we argue that many popular XAI models are likely better aligned with the explanatory needs and preferences that were found in people from typically individualist, commonly WEIRD cultures than with those that were found in people from typically collectivist, commonly non-WEIRD cultures. We outline a range of cultural differences that may affect many people’s perception of XAI outputs, making them relevant for research on human-AI collaborations. Second, we analyzed an extensive corpus of over 200 XAI user studies to examine whether they indicate awareness of cultural variations in explanations, have diverse samples, or avoid overgeneralizing their results (e.g., to non-WEIRD populations that were not tested). We found that most of these studies failed on all three counts. Finally, to see whether these problems have been noticed in XAI user research, we also systematically analyzed more than 30 literature reviews of XAI user studies. Most reviews, too, did not indicate any awareness of relevant cultural variations in people’s explanatory needs. Nor did they mention the problems of WEIRD sampling and hasty, overly broad generalizations of results to non-WEIRD individuals in XAI user studies. Our analyses therefore provide evidence of both a significant cultural bias toward WEIRD populations and a knowledge gap on whether popular XAI models’ outputs are satisfactory across cultures. We end with a set of recommendations to culturally diversify XAI user studies. 2. Explainable AI Focusing on Internal Factors Risks Overlooking Collectivist Cultures Two broad categories of XAI techniques are often distinguished: transparent models, which are strictly interpretable because of their relatively simple structure (e.g., linear and logistic regression models, short decision trees), and post-hoc systems, which may either directly access or infer factors causally contributing to an opaque model’s decisions after its training (Arrieta et al., 2020). Post-hoc models currently dominate XAI designs for lay-users (Taylor & Taylor, 2021). Their outputs may be visual (e.g., saliency maps), numerical (e.g., importance scores), or textual (e.g., feature reports), and generally cite factors that are internal to an opaque model and determinative of its decision (Arrieta et al., 2020). In that sense, post-hoc XAI outputs are often thought to be analogous to the human way of explaining decisions in terms of internal mental states (belief, desires, etc.) (Adadi & Berrada, 2018; Zerilli, 2022) and frequently contain mentalistic notions (‘being confident’, ‘think’, ‘know’). Table 1 presents examples. Table 1: Four examples of internalist XAI outputs from XAI user studies Post-hoc XAI systems producing such internalist explanations have been criticized for being “algorithm-centered”, as they tend to ignore the social context in which AI systems operate (Ehsan et al., 2021). Yet, many AI researchers now hold that for lay-users, XAI systems should provide explanations that cite internal states that are viewed as analogues to human beliefs or desires because they are shorter, easier to understand, and people expect such explanations (De Graaf & Malle, 2017; Zerilli et al., 2019). Correspondingly, a “significant body of work in XAI (1) XAI: “I am C x( ) confident that y will be correct based on |S| past cases deemed similar to x.” (Waa et al., 2020, p. 4) (2) XAI: “Here is why the classifier thinks so [presentation of (e.g.) a decision tree].” (Yang et al., 2020, p. 1) (3) XAI: “Why this exercise? Wiski thinks your current level matches that of this exercise!” (Ooge et al., 2022, p. 3) (4) XAI: “ShapeBot knows this is a [AI output] because ShapeBot realizes [decision factors].” (Zhang et al. 2022, p. 10) PETERS & CARMAN 974 aims to explain ML [machine learning] systems by reducing their operations to a form that is amenable to belief-desire representation” and so “intentional stance” interpretations (Zerilli, 2022, p. 2). Hence, many currently popular XAI designs for lay-users rest on the assumption that people in general prefer internalist explanations of behavior, i.e., explanations invoking an agent’s intentional, inner states. However, none of the just cited papers that argue that XAI systems should provide such (intentional stance) explanations have so far reflected on whether this kind of explanation is equally used and accepted across all cultures. This is problematic, as the explanations that people prefer for a given decision or action are unlikely to be uniform cross-culturally. To illustrate this point, we will focus on internalist explanations and variations between individualist cultures, where a person’s self is often viewed as a discrete entity independent of others, and collectivist cultures, where a person’s self is often viewed as interdependent with others (Hampton & Varnum, 2020). While differences between these two cultures are not limited to particular regions (Fatehi et al., 2020), and people within a country are usually highly heterogeneous preventing a clear demarcation of cultures by countries (Oyserman et al., 2002), several recent studies found that WEIRD countries (e.g., the USA) were predominantly individualist whereas non-WEIRD countries (e.g., China) were predominantly collectivist cultures (Klein et al., 2018; Pelham et al., 2022). Figure 1 visualizes this evidence on the link between cultures and countries. Figure 1: World map displaying the geographical distribution of collectivist and individualist cultures; horizontal stripes indicate WEIRD countries (map was self-created using MapChart). The overlap between individualist cultures and WEIRD countries and collectivist cultures and non-WEIRD countries, respectively, is important here because psychological studies on human explanations consistently found that while participants from individualist, typically WEIRD cultures such as the USA, did tend to explain behavior primarily in terms of an agent’s internal mental features (e.g., attitudes, character, or beliefs), participants from collectivist, commonly non-WEIRD cultures such as India, Korea, Saudi Arabia, and China, instead preferentially explained behavior in terms of external factors including social norms, task difficulty, or economic circumstances (Miller, 1984; Cha & Nam, 1985; Al-Zahrani & Kaplowitz, 1993; CULTURAL BIAS IN EXPLAINABLE AI 975 Lillard, 1998). To illustrate the difference, suppose an observer sees, for instance, a nurse assisting a patient in trouble, or a man robbing a bank. If the observer has an externalist focus in their behavior explanation, they may hold that the nurse acts that way because she has the social role to look after patients, and the man committed the crime because of economic hardship, respectively, rather than internal factors such as beliefs or desires. Studies exploring such differences in externalist vs. internalist explanations found that many people from non-Western populations (i.e., Asian-Australian, Chinese-Malaysian, Filipino, Japanese, Mexican) more strongly endorsed ideas that suggested that internal traits did “not describe a person as well as roles or duties do, and that trait-related behavior changes from situation to situation” (Henrich et al., 2010, p. 12). Correspondingly, in Pacific societies, many people were found to be under the expectation to “refrain from speculating (at least publicly) about what others may be thinking” (Robbins & Rumsey, 2008, p. 407), and in some collectivist societies, “explanations of behavior seem to require an analysis of social roles, obligations, and situational factors” (Fiske et al., 1998, p. 915). These well-documented cultural differences (Lillard, 1998; Lavelle, 2021) matter for XAI development. We do not challenge that AI programmers or other expert AI users need to have insights into a system’s internals to debug it and so may prefer internalist XAI outputs (Bhatt et al., 2020). However, the findings just outlined cast doubts on the common view in XAI research that internalist explanations are analogous to how people in general, including lay users, preferentially explain behavior. The findings raise the possibility that potentially many individuals from collectivist cultures (which form 70% of the world population; Triandis, 1995) may often prefer or even require externalist explanations, i.e., explanations with more reference to context, social functions, norms, or others’ behavior than to internal states. If XAI systems produce predominantly only internalist explanations and do not sufficiently cite external factors, people in collectivist cultures may find them unsatisfactory and less trustworthy. To make the difference between internalist and externalist explanations with respect to XAI outputs more concrete, Table 2 provides four examples of potential externalist counterparts to the internalist XAI outputs from Table 1. Table 2: Pairs of internalist and potential externalist XAI outputs We currently lack the data to tell whether people from collectivist or individualist cultures will indeed respond differently to such XAI outputs because to the best of our knowledge (and based on the corpus analysis we report below), this has not yet been investigated. Our point here is that Internalist XAI: “I am C x( ) confident that y will be correct based on |S| past cases deemed similar to x.” Externalist XAI: “Y will be correct because my task is to find the most likely result based on |S| past cases deemed similar to x.” Internalist XAI: “Here is why the classifier thinks so [presentation of (e.g.) a decision tree].” Externalist XAI: “The classifier produced this output because classification rules specify that given x, y is the case.” Internalist XAI: “Why this exercise? Wiski thinks your current level matches that of this exercise!” Externalist XAI: “Why this exercise? In most Wiski users with your current level, this level matched that exercise!” Internalist XAI: “ShapeBot knows this is a [AI output] because ShapeBot realizes [decision factors].” Externalist XAI: “ShapeBot classifies this as a [AI output] because ShapeBot’s task is to do so when presented with [decision factors].” PETERS & CARMAN 976 given the evidence that we have from previous psychological studies, there is reason to believe that differential reactions to these two kinds of XAI outputs are likely to occur in many individuals of the relevant cultures. It would therefore be valuable if XAI researchers experimentally tested and compared users’ responses to the outlined internalist and externalist outputs. Given the current absence of explicit testing for or reflection on these potential differences, the popular use of “algorithm-centered” internalist post-hoc explanations in XAI developments (e.g., De Graaf & Malle, 2017; Zerilli et al., 2019; Zerilli, 2022; Ehsan et al., 2021) suggests that many XAI designs implicitly and problematically assume that Western explanatory needs and preferences are shared cross-culturally, revealing a cultural bias. There are other cultural differences in human explanations and related cognitive processes than the individualist/internalist and collectivist/externalist variation. To draw XAI researchers’ attention to them, in Table A1 in the Appendix, we present a range of psychological studies and reviews that we have not yet mentioned here and that strike us as especially relevant for XAI research. For instance, in experimental settings, participants from East Asia preferred more detailed explanations (Klein et al., 2014), indirect, contextualized communication (Wang et al., 2010), and similarity-based object categorization than Western participants did (Nisbett et al, 2001). All of that said, classifying cultures as individualist and collectivist or as WEIRD and non- WEIRD may not be the best way to account for cultural variations in explanation because this approach risks homogenizing and stereotyping users from the related countries. To mitigate this, XAI researchers may refrain from these dichotomies and instead investigate more broadly where users from different cultural backgrounds are satisfied with one type of explanation, in which cases they may require or prefer internalist versus externalist outputs, or whether their choice is application dependent. We do not intend the individualist/collectivist and WEIRD/non-WEIRD categories to be definitive of a culture (e.g., WEIRD and non-WEIRD groups are heterogeneous, not always clearly distinct, and should not be reified; Ghai, 2021). We only employ these categories here because they offer interpretative tools for examining cross-cultural differences that have already been used insightfully in other AI-related research (e.g., differences in algorithmic aversion; Liu et al., 2023) and do capture reliable (but fluid) cultural differences in human explanations between some members of WEIRD and non-WEIRD populations, making them relevant for XAI and HHAI research. To what extent are XAI researchers aware of the outlined cultural variations? To find out, we systematically reviewed XAI user studies. 3. A Systematic Analysis of XAI User Studies Adapting key components from the Preferred Reporting Items for Systematic Reviews and Meta- Analyses (PRISMA) framework (Moher et al., 2009) and following a protocol used in previous work (Peters & Carman, 2023; Peters & Lemeire, 2023), we reviewed XAI user studies to answer four research questions (RQ): RQ1. Do researchers that conduct XAI user studies indicate awareness that cultural variations may affect the generalizability of their results? RQ2. What is the cultural background of the samples that XAI user researchers test? RQ3. Do XAI researchers restrict their user study conclusions to their participants or study population, or generalize beyond them? RQ4. Is the scope of researchers’ conclusions related to the cultural diversity of their samples such that studies with broader conclusions are associated with more diverse samples? CULTURAL BIAS IN EXPLAINABLE AI 977 3.1 Methodology To identify relevant papers, in July 2022, we searched three major databases covering computer science and AI literature, i.e., Scopus, Web of Science, and arXiv, using a query containing 15 variants of key words related to XAI and end-users (for details, see Table A2, Appendix). The results were 2523 papers. After removing duplicates (n = 535), 1988 papers remained. Their titles and abstracts were scanned to find papers that met our selection criteria. Selection criteria. We included any primary study (article, conference paper, chapter) that surveyed people on AI-based explanations of AI decisions and was published between January 2012 and July 2022. We excluded reviews, surveys, theoretical (incl. philosophical) papers, unpublished drafts (vs., e.g., arXiv preprints), guidelines, position papers, tutorials, technical, or applied papers (e.g., only introducing new XAI models), studies or surveys on other AI features than specific XAI outputs (e.g., ‘algorithmic aversion’), and small-scale stakeholder or user studies with ≤ 5 participants, which is too small a sample to ensure robust generalizations (Cazañas et al., 2017). We also excluded non-English papers. Of the 1988 papers, 192 remained for further screening, during which forward snowballing produced 14 more papers, resulting in 206 articles for full-text analysis (Figure 2). Figure 2: PRISMA flowchart of the systematic review Data extraction. During full-text analysis, we (two researchers) independently classified papers by using pre-specified criteria (and a binary label, 0 = no; 1 = yes) to extract the following Papers identified from: Scopus n = 1554 Web of Science n = 426 ArXiv n = 543 Total n = 2523 Papers removed before screening: n = 535 (duplicates) Papers screened: n = 1988 Papers excluded: n = 1796 (Technical, applied papers: n = 1270 Theoretical papers, guidelines, perspectives, tutorials: n = 446 Reviews, overviews: n = 76 Small (≤ 5) user studies: n = 4) Papers sought for retrieval: n = 192 Additional papers identified via snowballing: n = 14 Papers included in review: n = 206 Id en tif ic at io n Sc re en in g In cl ud ed PETERS & CARMAN 978 information. Apart from publication year and participant recruitment practice (e.g., conventional sampling or Amazon Mechanical Turk (MTurk) crowdsourcing), we extracted XAI output type, classifying papers as ‘internalist’ when they tested XAI explanations purporting to capture models’ internal decision parameters (e.g., local feature importance), or as ‘externalist’ when they tested XAI explanations citing external factors (e.g., context, cultural norms, social situation) or involved XAI-user interaction (e.g., follow-up questions). Additionally, we classified papers on whether authors indicated awareness that culture can influence people’s responding to XAI outputs in ways affecting the study generalizability. We also extracted participants’ cultural background, operationalizing it as participants’ country or region (e.g., Europe) (Sawaya et al., 2017). Nationality or region is not always coextensive with cultural background (Taras et al., 2016). But it was typically the only clue of cultural belonging in the papers, and analyses found that alternative social aggregates (e.g., ethnicity) contributed only negligible explained variance to that already captured by nations (Akaliyski et al., 2021). Depending on the sample’s country or region, we also labelled a paper as ‘WEIRD’, ‘non-WEIRD’, or ‘mixed’ based on previous studies’ geographical categorizations (Klein et al., 2018; Yilmaz & Alper, 2019). Finally, we identified an article’s scope of conclusion based on the population to which results were generalized. Scientists commonly distinguish three types of populations: the target population, i.e., people to whom results are intended to be applied in real-world contexts (e.g., all users of a system X); the study population, i.e., users who are available and eligible for the study (e.g., US users meeting specific inclusion criteria); and the study sample, i.e., participants drawn from the study population (Banerjee & Chaudhury, 2010). We coded articles as ‘restricted’ if, throughout their text, authors did not extrapolate their findings beyond their sample or study population but instead used qualifiers (e.g., ‘our participants’), quantifiers (e.g., ‘some European users’), or past tense to limit their claims or recommendations to these populations, or otherwise indicated that they are study, sample, context, or culture specific. Authors may in contrast also describe results by using generics, i.e., unquantified generalizations that refer to whole categories of people not specific, explicitly quantified sub-sets of them (e.g., ‘Users prefer X’ vs. ‘Many (US, 75%, Western, etc.) users prefer X’). Or they may use other expressions that suggest that the results apply, for instance, to all non-experts, users, people, contexts, time, or cultures (for examples, see Table 3). If a paper had at least one such broad results claim in the mentioned sections, it was classified as ‘unrestricted’. Papers with both restricted and unrestricted claims were also labelled ‘unrestricted’ because manuscripts are typically revised multiple times. If authors do not qualify their broader claims in the revisions, there is reason to believe they consider their broader generalizations warranted. Reliability. For each classification, inter-rater agreement was calculated (Cohen’s κ). It was consistently substantial (between κ = .71 and .90). We additionally asked two project-naïve researchers to independently classify the scope of conclusion variable for 25% of the data using our criteria. Inter-rater agreement between their and our classifications was also substantial (κ = .66 and .74, respectively), providing an additional reliability control for this variable. All remaining disagreements were resolved by discussion before the data were analyzed. All our data are publicly accessible on an OSF platform here. 3.2 Results Most of the 206 XAI studies (94.7%, n = 195) in our sample were published between 2019 and 2022. Several studies used multiple recruitment practices, where 45.2% (n = 93) of all papers reported conventional sampling, followed by crowdsourcing via websites. The two most common CULTURAL BIAS IN EXPLAINABLE AI 979 websites were MTurk (29.1%, n = 60) and Prolific (9.2%, n = 19). While some papers tested multiple kinds of XAI outputs, 88.8% (n = 183) of the papers focused on internalist explanations, and only 14.6% (n = 30) mentioned external factors (incl. XAI-user interaction) as relevant for users’ perception of XAI outputs. Moreover, just 3.4% (n = 7) of the papers considered explanations that invoked external factors that may appear in collectivist explanations such as social rules, contexts, or social functions. None of the 206 papers explored potential differences in people’s responding to internalist versus externalist XAI outputs that we outlined above. RQ1. Do researchers that conduct XAI user studies indicate awareness that cultural variations may affect the generalizability of their results? 93.7% (n = 193) of the papers did not display any awareness (e.g., in discussion, limitation, or conclusion sections) that there may be cultural differences in how people perceive XAI outputs that can undermine broad extrapolations of results. Relatedly, these papers did not provide support (e.g., arguments or evidence) for the assumption that human explanatory needs are invariant across cultures. RQ2. What is the cultural background of the samples that XAI user researchers test? 48.1% (n = 99) of the papers did not report cultural information about their samples. Across the remaining 107 papers, 32 countries or regions were mentioned. The three most frequent ones were the US (n = 53), the UK (n = 13) and Germany (n = 12) (for details, see Table A3, Appendix). Moreover, from the 107 papers, 81.3% (n = 87) had only WEIRD samples, exceeding the numbers of papers with mixed samples (10.3%, n = 11) and with only non-WEIRD samples (8.4%, n = 9). RQ3. Do XAI researchers restrict their user study conclusions to their participants or study population, or generalize beyond them? Since 99 papers did not provide cultural information, they could have involved diverse samples. Broad generalizations may in this case be unproblematic. Since we could not determine cultural background in these papers, we analyzed only the remaining ones with this information (n = 107). 70.1% (n = 75) of them contained unrestricted conclusions, i.e., claims that suggested that the study results applied to all (e.g.) non- experts, people, users, consumers, humans, contexts, or time. Table 3 below presents examples (a full list of all unrestricted claims that we found in the papers can be found here). Table 3: Examples of unrestricted conclusions (1) “Our user study shows that non-experts can analyze our explanations and identify a rich set of concepts within images that are relevant (or irrelevant) to the classification process.” (Schneider & Vlachos, 2023, p. 4196) (2) “Our pilot study revealed that users are more interested in solutions to errors than they are in just why the error happened.” (Hald et al., 2021, p. 218) (3) “Our findings demonstrate that humans often fail to trust an AI when they should, but also that humans follow an AI when they should not.” (Schmidt et al., 2020, p. 272) (4) “We also found that users understand explanations referring to categorical features more readily than those referring to continuous features.” (Warren et al., 2022, p. 1) (5) “Both experiments show that when people are given case-based explanations, from an implemented ANN-CBR twin system, they perceive miss-classifications to be more correct.” (Ford et al., 2020, p. 1) (6) “Results indicate that human users tend to favor explanations about policy rather than about single actions.” (Waa et al., 2018, p. 1) (7) “Our findings suggest that people do not fully trust algorithms for various reasons, even when they have a better idea of how the algorithm works.” (Cheng et al., 2019, p. 10) PETERS & CARMAN 980 RQ4. Is the scope of XAI researchers’ conclusions related to the cultural diversity of their samples such that studies with broader conclusions are associated with more diverse samples? To address this question, focusing only on the papers with cultural information (n = 107), we first analyzed the scope of conclusions in the papers with only WEIRD, only non-WEIRD, and mixed samples. If studies with broader conclusions have more diverse samples, then one would predict that papers with unrestricted conclusions tend to have mixed samples, i.e., not either only WEIRD, or only non-WEIRD samples. Table 4 presents the comparisons. Unlike predicted, 90.7% (n = 68) of the 75 papers with unrestricted claims had in fact only WEIRD (84%, n = 63) or only non-WEIRD (6.7%, n = 5) samples. Moreover, if papers with broader conclusions had more diverse samples, then papers with unrestricted claims should include a higher proportion of papers with mixed samples compared to papers with restricted claims. However, a χ2 test showed that there was no evidence of a statistically significant difference of this kind (p = 0.51). Table 4: Distribution of papers with unrestricted and restricted conclusions by sample composition Furthermore, when relating the number of countries or regions sampled in each paper (which ranged from 1 to 19) to the scope of conclusion variable, we found that 82 papers mentioned only one country or region but nonetheless constituted 74.7% (n = 56) of all papers with unrestricted conclusions (n = 75) (Table A4, Appendix). Finally, to statistically analyze whether papers with unrestricted conclusions had more diverse samples than papers with restricted conclusions, we also conducted a Mann-Whitney U test (as our data were not normally distributed) with the number of countries/regions as our dependent scale variable and scope of conclusions as the categorical independent variable. We found no evidence that unrestricted papers had or were correlated with a statistically significant higher number of countries or regions in their samples than the restricted papers (p = 0.59). Our findings indicate significant shortcomings in many XAI user studies. But before interpreting the results, it is worth exploring whether researchers who have conducted literature reviews of XAI user studies have noticed any of the issues that we have just reported, i.e., a lack of awareness of relevant cultural variations, pervasive WEIRD sampling, or broad generalizations of XAI user study results from WEIRD to non-WEIRD populations. We therefore extended our systematic review to recent literature reviews of XAI user studies themselves. 4. A Meta-review of Reviews about XAI User Studies Following the same procedure as before, we explored four questions: RQ1. Do literature reviews about XAI user studies indicate that there may be cultural variations in explanatory needs that can affect the generalizability of study results? RQ2. Do these reviews comment on WEIRD population sampling in XAI user studies? RQ3. Do they comment on potential hasty generalizations in these studies? Scope of conclusion Sample background Restricted papers Unrestricted papers Total Only non-WEIRD 4 5 9 Mixed 4 7 11 Only WEIRD 24 63 87 Total 32 75 107 CULTURAL BIAS IN EXPLAINABLE AI 981 RQ4. Do the authors of reviews about XAI user studies restrict their own conclusions from these studies to particular samples or study populations or generalize beyond them? 4.1 Methodology To find reviews to analyze, in September 2022, we used the same three databases (Scopus, Web of Science, arXiv) and search strings as before but now added the specific restrictor “review” (for details, see Table A4, Appendix). The search results were 130 papers. 10 duplicates were removed. Titles and abstracts of the remaining 120 papers were scanned for articles meeting our inclusion criteria. We included any literature review of XAI user studies that was published between January 2012 and September 2022. We excluded any theoretical paper about AI principles or XAI guidelines, and any review paper about AI or XAI that did not focus on XAI user studies. Non-English publications were also excluded. 24 articles remained for further screening, during which forward snowballing produced 10 more papers, yielding 34 articles for full-text analysis (see Figure 4). Figure 4: PRISMA flow diagram for the meta-review During full-text analysis, we independently classified papers according to the following information (in addition to publication year). We applied a binary label (0 = no or 1 = yes) to a review if it (1) indicated that there might be cultural, contextual, or social variations in people’s perceptions of XAI outputs that are relevant for user research, (2) contained comments on WEIRD sampling in XAI studies, (3) noted the relevance of stating cultural, national, or regional background in user studies, or (4) commented on potential hasty generalizations of results in these Papers identified from: Scopus n = 62 Web of Science n = 19 ArXiv n = 49 Total n = 130 Papers removed before screening: n = 10 (duplicates) Papers screened: n = 120 Papers excluded: n = 96 (Theoretical papers, not reviews: n = 62 Reviews of AI/XAI in general, not XAI user studies: n = 34) Papers identified via snowballing: n = 10 Papers retrieved: n = 24 Papers included in review: n = 34 Id en tif ic at io n Sc re en in g In cl ud ed PETERS & CARMAN 982 studies. We also classified reviews according to the scope of conclusions that they drew from XAI user studies, employing the same restricted versus unrestricted distinction as previously. To ensure reliability for the classifications, we again calculated inter-rater agreement. It was consistently substantial (between κ = .78 and κ = 94). 4.2 Results In our sample of 34 literature reviews, most reviews (91.2%, n = 31) were published between 2019 and 2022 (Table A5, Appendix) RQ1. Do literature reviews about XAI user studies indicate that there may be cultural variations in explanatory needs that can affect the generalizability of study results? In 82.4% (n = 28) of the reviews, this did not happen. Moreover, from the 6 reviews that briefly referred to culture, just 2 mentioned variations relevant for XAI, noting that there are “clear cultural differences in preference for simple versus complex explanations” (Mueller et al., 2019, p. 77) and “differences in the preference for personalized explanations depending on their cultural background” (Sperrle et al., 2020, p. 5). However, neither paper elaborated on or offered an overview of XAI-relevant cultural differences. RQ2. Do reviews about XAI user studies comment on WEIRD population sampling in these studies? 94.1% (n = 32) of the reviews did not do so. Only 2 reviews (Sperrle et al., 2020; Laato et al., 2022) displayed sensitivity to the relevance of being explicit about XAI users’ cultural, national, or regional background in XAI user studies. However, these papers did not specifically review XAI user studies to explore the extent of WEIRD sampling, nor did they offer quantitative data on it. RQ3. Do reviews of XAI user studies comment on potential hasty generalizations in these studies? Just 1 of 34 reviews mentioned unwarranted extrapolations, writing that “perhaps the greatest challenge in the study of HAI [human-AI] teams […] is simply resisting the urge to overgeneralize experimental results” (Zerilli et al., 2022, p. 7). However, the authors did not provide quantitative evidence of the extent of hasty, overly broad generalizations in XAI papers and did not consider XAI-relevant cultural differences, or WEIRD sampling. They also themselves overgeneralized some study results when writing, for instance, that “in very simple automated tasks involving a single person, people tend to distrust automated aids whose errors they witness, unless an explanation is provided” (ibid). This brings us to RQ4. RQ4. Do the authors of reviews about XAI user studies restrict their own conclusions from these studies to particular samples or study populations generalize beyond them? From all 34 reviews, 82.4% (n = 28) involved generalizations beyond any particular sample or (e.g., national) study population. Table 5 presents examples. CULTURAL BIAS IN EXPLAINABLE AI 983 Table 5: Seven examples of unrestricted conclusions in XAI reviews 5. General Discussion and Recommendations Our analyses reveal significant methodological limitations in much of the currently available XAI user research. We briefly revisit the three main findings of our two reviews and introduce mitigation strategies for the problems that our results highlight. (1) Lack of sensitivity to cultural variations in explanatory needs. In the first analysis, we found that almost 90% of the XAI studies that we reviewed focused only on internalist explanations. As argued in Section 2, these explanations may better align with WEIRD individuals’ explanatory needs than with those of non-WEIRD people in collectivist cultures, who may prefer externalist explanations (Henrich et al., 2010; Lavelle, 2021). Externalist explanations involving factors often highlighted in collectivist cultures were only explored in less than 4% of all studies. Moreover, in about 90% of the papers, authors did not display any awareness of cultural differences such as those discussed in section 2 and outlined in Table A1. Our meta- review of literature reviews about XAI user studies additionally revealed that the vast majority of these reviews (> 80%) were not sensitive to cultural variations in people’s explanatory needs either. These findings suggest that XAI researchers routinely overlooked potentially relevant cultural differences that can affect human-AI interactions. To tackle these problems, we recommend that AI journals increase the cultural diversity of their reviewer pool to ensure viewpoint variation during manuscript evaluation (Linxen et al., 2021). Conference organisers, in turn, can use platforms such as OpenReview, which makes reviewer reports public thereby allowing for an extra level of accountability (Wang et al., 2021). We also recommend that the increasing emphasis in reviews of XAI work on relating psychological findings to XAI developments (e.g., Miller, 2019; Rong et al., 2022) be extended to include data on the cultural differences summarized in Section 2 and Table A1. (2) WEIRD sampling. We found that non-WEIRD populations were rarely sampled in XAI user research. This finding matches results from studies that explored sampling in HCI and HRI and report that 73-75% of papers tested only WEIRD populations (Linxen et al., 2021; Seaborn et al., 2023). However, our results suggest that the problem may be worse in XAI research, as more than 80% of XAI papers with relevant information involved only WEIRD participants. The findings of our meta-review add further weight to this problem because almost all (94%) of the reviews in our sample overlooked the predominately WEIRD sampling in XAI user studies. (1) “Users who are expert or self-confident in tasks that have been delegated to automation tend to ignore machine advice […].” (Zerilli et al., 2022, p. 4) (2) “It has been recognised in the literature that counterfactuals tend to help humans make causal judgments.” (Chou et al., 2022, p. 42) (3) “Users tend to anthropomorphize AI and may benefit from humanlike explanations.” (Laato et al., 2022, p. 14) (4) “The textual explanations generated with GRACE were revealed to be more understandable by humans in synthetic and real experiments.” (Islam et al., 2022, p. 20) (5) “Humans lose trust in the explanation when soundness is low.” (Gerlings et al., 2021, p. 5) (6) “People tend to prefer complex over simple explanations if they can see and compare both forms.” (Mueller et al., 2019, p. 77) (7) “People rarely expect an explanation that consists of an actual and complete cause of an event.” (Miller, 2019, p. 3) PETERS & CARMAN 984 There may be explanations for why WEIRD populations are over-represented. WEIRD countries may be a key market for XAI and hybrid human-AI designs. However, even if that is so, diverse sampling can still be advantageous because a significant number of people in WEIRD countries have diverse, non-WEIRD background (e.g., people from China, India, South America living in the USA) (Budiman & Ruiz, 2021). XAI products tested on more diverse users may thus ultimately be more profitable even in WEIRD countries, as they can appeal to a wider market. Another reason for the pervasive WEIRD sampling may be that XAI user studies are predominantly conducted in WEIRD countries and geographically diverse sampling can be complicated. However, experiments through online platforms (e.g., MTurk or Prolific) are often feasible and have wider reach, enabling more diverse sampling. 29.1% (n = 60) of the XAI studies we reviewed already used MTurk. Even then, though, caution is warranted, as research suggests that most MTurkers (80%) come from the USA (Keith & Harms, 2016). It is worth noting that comparative studies found that Prolific has more diverse, less dishonest, more attentive, and more reliable participants, providing higher quality data than MTurk (Peer et al., 2017). Yet, we found that only 9.2% of the reviewed XAI papers used Prolific, suggesting that many XAI researchers may be unaware of these differences. Thus, while conventional sampling should not be abandoned (e.g., it may be needed to study people without computer access), we recommend that XAI researchers recruit via Prolific, or LabintheWild, a crowdsourcing platform specifically developed to tackle the WEIRD sampling problem (Reinecke & Gajos, 2015), rather than MTurk to increase cultural diversity in user studies. We acknowledge, however, that testing culturally diverse groups can also come with conceptual challenges because culture itself can and has been defined in multiple ways (Baldwin et al., 2006), where different ways of operationalizing culture (e.g., ethnicity, values, collective traits, country of residence, citizenship, heritage, shared language; Taras et al., 2009, 2016) can make comparisons between XAI studies and assessments of appropriate levels of generalizations difficult. Many existing measures of culture draw on Hofstede’s (1980) methodology and his self- report questionnaire containing items about individualism/collectivism, power distance, uncertainty avoidance, and masculinity/femininity (Taras et al., 2009). Since Hofstede’s original questionnaire is perhaps too long for inclusion into XAI user studies (it contained 126 questions), for XAI studies investigating, for instance, individualist/collectivist differences, we recommend that researchers adapt the related items from this questionnaire, as it is validated. That said, Hofstede’s theory and methodology have also been criticized for being overgeneralizing (McSweeney, 2002), leading some technology researchers to use nationality as a proxy for culture instead (Ur & Wang, 2013; Sawaya et al., 2017). To capture that culture is a multidimensional construct, XAI researchers may therefore refrain from any single definition of culture and instead individually measure (via self-report items) users’ nationality, racial/ethnic background, country of residence, home language and the relevant aspect of Hofstede’s construct and then conduct regression analyses to identify and report the strongest predictor of responses to XAI outputs. This can enable insights into culture-related variations and may allow for comparisons and extrapolations across social groups and XAI user studies without invoking simplistic characterizations of culture. Finally, it is important to note that as many as 48.1% (n = 99) of the reviewed studies did not report cultural (country/region) information about their samples. While these studies may have involved diverse population samples, the absence of reporting suggests that this information was not considered relevant for replication or generalizability. Not reporting on cultural information may be justified within a given study. However, it could also reflect implicit assumptions and biases about whether findings from particular populations are more generalizable than findings CULTURAL BIAS IN EXPLAINABLE AI 985 from other populations (Cheon et al., 2020). We therefore recommend that researchers either report information about participants’ cultural backgrounds in ways discussed above or provide “constraints on generality” statements, specifying the study population and the basis for believing that the sample is representative of it or broader populations (for guidance on these statements, see Simons et al., 2017; Linxen et al., 2021). (3) Hasty generalizations of XAI study results. Most of the XAI studies we analyzed contained conclusions that presented findings as if they held for whole categories of people (e.g., experts, users, humans) even when they had only tested WEIRD populations or a single country. Generalizations from WEIRD to non-WEIRD populations need not be unwarranted. Researchers who produced such extrapolations might have had good grounds to assume that this particular dimension of demographic variation was irrelevant for their study. So, it does not follow from the evidence that XAI user researchers drew conclusions about populations much wider than their study population that these generalizations were unwarranted. However, if all the unrestricted conclusions we found had been based on researchers’ reflection on relevant or irrelevant demographic differences, there should have been an indication of this reflection in their papers. This is because to fully establish a study conclusion and make the study reproducible, all underlying assumptions that justify the conclusions (including the potential assumption that people’s explanatory needs or preferences are cross-culturally invariant) need to be made explicit. Yet, as noted, we found that more than 90% of the reviewed XAI user studies did not contain any evidence of reflection on XAI-relevant cultural differences or invariance. Furthermore, we could not find any evidence that XAI user studies with broader claims had or were associated with more diverse samples. Hence, the unrestricted conclusions in most of the reviewed XAI papers were hasty generalizations, i.e., claims whose scope was broader than warranted by the evidence and justification provided by the researchers. Since psychological research suggests that explanatory needs likely differ between WEIRD and non-WEIRD populations, as discussed in Section 2, the pervasive insufficiently supported extrapolations that we found from WEIRD samples to other populations may indicate a cultural “generalization bias” (Peters et al., 2022) toward WEIRD populations in many currently available XAI user studies. Our findings fill a significant gap in previous studies in HCI and HRI that reported generalizability problems related to WEIRD sampling in these fields (Linxen et al., 2021; Seaborn et al., 2023). This is because these studies did not measure the scope of the generalizations that researchers produced thus leaving it unclear whether methodological shortcomings were involved. Indeed, while hasty generalizations across cultures have been reported in other fields (e.g., Peters & Lemeire, 2023) until now it has remained unknown whether they also occur in XAI, allowing XAI researchers to ignore or deny their presence in the field of XAI. Our results block this potential move, which is important, as encountering overgeneralizations in XAI user studies is particularly disconcerting. XAI user study results can directly feed into the production of XAI that a wide range of people later interact with (Ehsan et al., 2021; Ding et al., 2022; Okolo et al., 2022). These results can affect the way human-AI hybrid systems are developed via influencing what XAI models are included in HHAI designs. Generalizing results to cultural groups to whom they do not apply can hide that certain XAI and human-AI hybrid systems may only meet the explanatory needs of individuals with a particular cultural background, raising ethical concerns about both explainability and inclusivity. We thus recommend that XAI studies be conducted in collaborations with researchers or participants from different cultures. To further mitigate hasty generalizations, XAI researchers should consider restricting their user study conclusions by using quantifiers (‘US users’, ‘our participants’, ‘many users’, etc.), qualifiers (e.g., ‘may’, ‘can’), or past tense (Peters et al., 2022). PETERS & CARMAN 986 Table A7 in the Appendix presents examples of restricted versions of the unrestricted conclusions from Table 3. 6. Limitations There are several constraints on the generalizability of our own analysis results. First, our literature search was limited to three major databases of scientific literature covering XAI studies. Second, there may also be XAI user studies that do not use our specific search terms and that we may have overlooked. Third, we focused only on English publications on XAI user research and may have overlooked, for instance, recent research from non-Western institutions that was not published in English. Indeed, the current boom of AI research, particularly in China, may significantly counteract the cultural bias in XAI user studies that we reported (Min et al., 2023). Future research analyzing the cultural diversity and models employed in Chinese language XAI user studies are therefore desirable. However, since most scientific studies are now published in English (Ramírez-Castañeda, 2020), our findings remain important because of the size and influence of English within the scientific community. Another limitation of our analyses is that we used country or region as proxies for culture, as it was typically the only culture-related information in the reviewed papers. Using this proxy ignores expatriates, mixed national demographics, and shared, technology-facilitated experience. We therefore welcome future XAI research reviews that explore other proxies for cultural background in XAI user studies. 7. Conclusion XAI systems play an increasingly more significant role in many human-AI interactions because they can make opaque AI models more trustworthy to people, facilitating human control over these models. XAI developers are thus doing important work that is directly relevant for hybrid human-AI (HHAI) systems. Here, we examined whether currently popular XAI systems for lay- users are equally suitable for people from different cultural backgrounds. We argued that XAI systems that produce internalist explanations (referring to mental states, e.g., beliefs) are currently popular but may cater primarily to the explanatory needs of people from individualist, typically WEIRD cultures. Psychological studies found that while most people from individualist cultures preferred human internalist explanations, people from collectivist, commonly non-WEIRD cultures tended to favor externalist explanations (referring to social roles, context, etc.). To help raise XAI and HHAI developers’ awareness of these and other cultural variations relevant for XAI design and human-AI interactions, we provided a table offering an empirically informed overview of them (see Table A1 in the Appendix). To support our claim that these variations are currently overlooked in XAI research, we analyzed 206 XAI user studies. Most of them contained no evidence that the researchers were aware of cultural variations in explanatory needs. Most studies also tested only WEIRD populations but researchers routinely generalized results beyond them. When we additionally analyzed 34 reviews of XAI user studies, we found that these problems went largely unnoticed even by most reviewers of these studies. In offering evidence of XAI-relevant cultural variations, of a widespread oversight of them in the field of XAI, and of pervasive WEIRD sampling paired with extrapolations to non-WEIRD populations, this paper uncovers both a cultural bias toward WEIRD populations and an important knowledge gap in the field of XAI regarding how culturally diverse users may respond to widely used XAI systems. If human-AI hybrids include XAI systems of the kind tested in most of the user studies we reviewed then these hybrids may inherit the mentioned cultural bias and be less inclusive than they appear and could be. We hope that our analyses help stimulate cross- and CULTURAL BIAS IN EXPLAINABLE AI 987 multi-cultural XAI user studies and improve the vital work that XAI and HHAI developers are doing in making AI systems more explainable and useful for all stakeholders no matter their cultural background. Acknowledgements We would like to thank Apolline Taillandier and Charlotte Gauvry for cross-checking the classifications of some of our key data. We are also very grateful for helpful comments from Caroline Gevaert, Benjamin Rosman, Alex Krauss, and three reviewers of this journal. We do not have funding to declare, and we grant JAIR the permission to publish this paper. UP conceived and designed the study, collected the data, did the data analysis, developed the argumentation, wrote the first draft, and did the editing. MC assisted with the collection of the data, data classification, revising, and editing of the paper. PETERS & CARMAN 988 Table A1: Select overview of psychological and HCI research with relevance for XAI design Author(s) Country/region Cultural variation Relation to XAI Howell (1981), Wierzbicka (1992), Lebra (1993) Incl. Peru, Japan, US Some non-Western cultures lack concepts comparable to Western psychological concepts ‘think’, ‘belief’, or ‘desire’ Use of mentalistic XAI framing (e.g., ‘AI thinks _’) Miller (1984), Al- Zahrani and Kaplowitz (1993), Morris and Peng (1994), Lee et al. (1996), Choi and Nisbett (1998) Incl. India, US, Korea, China, Japan, Saudi Arabia Non-Western (vs. Western) study participants referred to situations/social rules to explain behavior and were less susceptible to mistakenly explaining it through agents’ internal states when situational causes were available Preferences for internalist vs. externalist outputs Choi et al. (2003) Korea, US When explaining behavior, Korean participants preferred more contextual information than their US counterparts Preferences for XAI output scope Klein et al. (2014) Malaysia, US Malaysian participants preferred more detailed explanations for indeterminate situations; US participants favored simpler ones Preferences for XAI output complexity Hall and Hall (1990), Sanchez-Burks et al. (2003), Wurtz (2005), Rau et al. (2009), Wang et al. (2010), Lee and Sabanović (2014) High-context (e.g., Japan, China; South Korea), low-context (e.g., US, Germany) cultures Participants from Western, low-context cultures (communication with low use of non-verbal cues) preferred direct, explicit communication (e.g., “Drink water.”); participants from East Asian, high-context cultures preferred indirect, implicit communication (e.g., “Drinking water may alleviate headaches.”), e.g., in robot recommendations Preferences for XAI communication style Nisbett et al. (2001), Norenzayan et al. (2002), Varnum et al. (2010), Henrich et al. (2010), Klein et al. (2018) Western, East-Asian countries Western participants displayed more analytic thinking (i.e., rule-based object categorization, context-independent understanding of objects, formal logic in reasoning); East-Asian participants displayed more holistic thinking (i.e., similarity-based object categorization, focus on context, intuition in reasoning) Preferences for XAI output content (e.g., rule- based vs. example-based) Otterbring et al. (2022) East-Asian, US East-Asian participants preferred abstract figures representing conformity; US participants favored objects representing uniqueness Preferences for XAI output content and format Reinecke and Gajo (2014), Alexander et al. (2021) Incl. Russia, Macedonia, Australia, China, Saudi Arabia Regarding websites’ visual complexity/design attributes (layout, navigation, etc.), Australian users focused on textual items; Chinese users scanned the whole page Preferences for XAI output complexity and format Baughan et al. (2021) US, Japan Visual attention differences affected website search: Japanese participants remembered more and faster found contextual website information than US counterparts Preferences for XAI output complexity and format Van Brummelen et al. (2022) US, Singapore, Canada, NZ, Indonesia, Iran, Japan, India Non-WEIRD participants’ perspectives emphasized virtual agent artificiality; WEIRD perspectives emphasized human-likeness Social embedding can influence perceptions of AI Appendix CULTURAL BIAS IN EXPLAINABLE AI RESEARCH 989 Table A2: Systematic literature review search strings Scopus (searched July 2022): TITLE-ABS-KEY ("XAI" OR "Explainable AI" OR "transparent AI" OR "interpretable AI" OR "accountable AI" OR "AI explainability" OR "AI transparency" OR "AI accountability" OR "AI interpretability" OR "model explainability" OR "explainable artificial intelligence" OR "explainable ML" OR "explainable machine learning" OR "algorithmic explicability" OR "algorithmic explainability") AND ("end user" OR "end-user" OR "audience" OR "consumer" OR "user" OR "user study" OR "user survey" OR "developer") AND (LIMIT-TO (DOCTYPE,"cp") OR LIMIT-TO (DOCTYPE,"ar") OR LIMIT-TO (DOCTYPE,"ch")) AND (LIMIT-TO (PUBYEAR,2022) OR LIMIT-TO (PUBYEAR,2021) OR LIMIT-TO (PUBYEAR,2020) OR LIMIT-TO (PUBYEAR,2019) OR LIMIT-TO (PUBYEAR,2018) OR LIMIT-TO (PUBYEAR,2017) OR LIMIT-TO (PUBYEAR,2016) OR LIMIT-TO (PUBYEAR,2012)) AND (LIMIT-TO (LANGUAGE,"English")) Web of Science (searched July 2022): (ALL=("XAI" OR "Explainable AI" OR "transparent AI" OR "interpretable AI" OR "accountable AI" OR "AI explainability" OR "AI transparency" OR "AI accountability" OR "AI interpretability" OR "model explainability" OR "explainable artificial intelligence" OR "explainable ML" OR "explainable machine learning" OR "algorithmic explicability" OR "algorithmic explainability")) AND ALL=("end user" OR "end-user" OR "audience" OR "consumer" OR "user" OR "user study" OR "user survey" OR "developer") and Article or Proceedings Papers or Early Access or Book Chapters(Document Types) and English (Languages) Refined by all ‘Publication Years’ (2012-01-01 to 2022-12-31) ArXiv (searched July 2022): Query: order: -announced_date_first; size: 50; date_range: from 2012-01-01 to 2022-12-31; classification: Computer Science (cs); include_cross_list: True; terms: AND all="XAI" OR "Explainable AI" OR "transparent AI" OR "interpretable AI" OR "accountable AI" OR "AI explainability" OR "AI transparency" OR "AI accountability" OR "AI interpretability" OR "model explainability" OR "explainable artificial intelligence" OR "explainable ML" OR "explainable machine learning" OR "algorithmic explicability" OR "algorithmic explainability"; AND all="end user" OR "end-user" OR "audience" OR "consumer" OR "user" OR "user study" OR "user survey" OR "developer" PETERS & CARMAN 990 Table A3: Frequency of nationalities/regions in the reviewed XAI studies Table A4: Numbers of countries/regions in papers with restricted and unrestricted conclusions Table A5: Reviews of XAI user study papers per year Country N Country N Country N No details 99 South/Latin America 5 Norway 1 US 53 Sweden 4 Denmark 1 UK 13 India 4 Iceland 1 Germany 12 Netherlands 4 South Korea 1 Canada 9 Asia 3 New Zealand 1 Europe 9 Switzerland 3 Portugal 1 Ireland 7 Belgium 2 Africa (unspecified) 1 North America 6 Finland 2 Americas (unspecified) 1 China 5 France 2 Rest of the world 1 Italy 5 Brazil 2 Russia 1 Australia 5 Japan 2 Costa Rica 1 Countries or regions Scope of conclusion: Restricted Scope of conclusion: Unrestricted Total 1 26 56 82 2 2 5 7 3 2 5 7 4 0 3 3 5 0 1 1 6 2 3 5 8 0 1 1 19 0 1 1 Total 32 75 107 Year Number 2012-2016 0 2017 1 2018 2 2019 3 2020 6 2021 12 Sept 2022 10 Total 34 CULTURAL BIAS IN EXPLAINABLE AI RESEARCH 991 Table A6: Systematic meta-review search strings Scopus (searched September 2022): TITLE-ABS-KEY ( "XAI" OR "Explainable AI" OR "transparent AI" OR "interpretable AI" OR "accountable AI" OR "AI explainability" OR "AI transparency" OR "AI accountability" OR "AI interpretability" OR "model explainability" OR "explainable artificial intelligence" OR "explainable ML" OR "explainable machine learning" OR "algorithmic explicability" OR "algorithmic explainability" ) AND ( "end user" OR "end- user" OR "audience" OR "consumer" OR "user" OR "user study" OR "user survey" OR "developer" ) AND ( LIMIT-TO ( PUBYEAR , 2022 ) OR LIMIT- TO ( PUBYEAR , 2021 ) OR LIMIT-TO ( PUBYEAR , 2020 ) OR LIMIT- TO ( PUBYEAR , 2019 ) OR LIMIT-TO ( PUBYEAR , 2018 ) OR LIMIT- TO ( PUBYEAR , 2017 ) OR LIMIT-TO ( PUBYEAR , 2016 ) OR LIMIT- TO ( PUBYEAR , 2012 ) ) AND ( LIMIT-TO ( DOCTYPE , "re" ) ) AND ( LIMIT- TO ( LANGUAGE , "English" ) ) Web of Science (searched September 2022): (ALL=("XAI" OR "Explainable AI" OR "transparent AI" OR "interpretable AI" OR "accountable AI" OR "AI explainability" OR "AI transparency" OR "AI accountability" OR "AI interpretability" OR "model explainability" OR "explainable artificial intelligence" OR "explainable ML" OR "explainable machine learning" OR "algorithmic explicability" OR "algorithmic explainability")) AND ALL=("end user" OR "end-user" OR "audience" OR "consumer" OR "user" OR "user study" OR "user survey" OR "developer") and Review Article (Document Types) and English (Languages) Refined by all ‘Publication Years’ (2012-01-01 to 2022-12-31) ArXiv (searched September 2022): Query: order: -announced_date_first; size: 200; date_range: from 2012-01-01 to 2022-12-31; classification: Computer Science (cs); include_cross_list: True; terms: AND all="XAI" OR "Explainable AI" OR "transparent AI" OR "interpretable AI" OR "accountable AI" OR "AI explainability" OR "AI transparency" OR "AI accountability" OR "AI interpretability" OR "model explainability" OR "explainable artificial intelligence" OR "explainable ML" OR "explainable machine learning" OR "algorithmic explicability" OR "algorithmic explainability"; AND all="end user" OR "end-user" OR "audience" OR "consumer" OR "user" OR "user study" OR "user survey" OR "developer"; AND all="review" PETERS & CARMAN 992 Table A7: Restricted versions of unrestricted conclusions. The parts in bold are the restricting components. (1) “Our user study shows that non-experts can analyze our explanations and identify a rich set of concepts within images that are relevant (or irrelevant) to the classification process.” (Schneider & Vlachos, 2023, p. 4196) Restricted: “Our user study shows that non-expert participants could analyze our explanations and identify a rich set of concepts within images that were relevant (or irrelevant) to the classification process.” (2) “Our pilot study revealed that users are more interested in solutions to errors than they are in just why the error happened.” (Hald et al., 2021, p. 218) Restricted: “Our pilot study revealed that users were more interested in solutions to errors than they were in just why the error happened.” (3) “Our findings demonstrate that humans often fail to trust an AI when they should, but also that humans follow an AI when they should not.” (Schmidt et al., 2020, p. 272) Restricted: “Our findings demonstrate that participants often failed to trust an AI when they should have trusted it, but also that they followed an AI when they should not have done so.” (4) “We also found that users understand explanations referring to categorical features more readily than those referring to continuous features.” (Warren et al., 2022, p. 1) Restricted: “We also found that users understood explanations referring to categorical features more readily than those referring to continuous features.” (5) “Both experiments show that when people are given case-based explanations, from an implemented ANN-CBR twin system, they perceive miss-classifications to be more correct.” (Ford et al., 2020, p. 1) Restricted: “Both experiments show that when participants were given case-based explanations, from an implemented ANN-CBR twin system, they perceived miss-classifications to be more correct.” (6) “Results indicate that human users tend to favor explanations about policy rather than about single actions.” (Waa et al., 2018, p. 1) Restricted: “Results indicate that participants tended to favor explanations about policy rather than about single actions.” (7) “Our findings suggest that people do not fully trust algorithms for various reasons, even when they have a better idea of how the algorithm works.” (Cheng et al., 2019, p. 10) Restricted: Our findings suggest that people did not fully trust algorithms for various reasons, even when they had a better idea of how the algorithm works.” CULTURAL BIAS IN EXPLAINABLE AI RESEARCH 993 References Abdul, A., Vermeulen, Wang, D., Lim, B.Y., & Kankanhalli, M. (2018). Trends and trajectories for explainable, accountable and intelligible systems: An HCI research agenda. Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, 1–18. Adadi, A. & Berrada, M. (2018). Peeking inside the black-box: A survey on Explainable Artificial Intelligence (XAI). IEEE Access, 6, 52138–52160. Akaliyski, P., Welzel, C., Bond, M. H., & Minkov, M. (2021). On “nationology”: The gravitational field of national culture. Journal of Cross-Cultural Psychology, 52(8-9), 771– 793. Akata, Z. & Balliet, D, Rijke, Maarten, R., Dignum, F., Dignum, V., Eiben, Guszti, E., Fokkens, A. Grossi, D. & Hindriks, Koen, H., Hoos, H., Hung, H. & Jonker, C., Monz, C. & Neerincx, M. & Oliehoek, F., Prakken, H., Schlobach, S., Gaag, L., Harmelen, F. & Welling, M. (2020). A research agenda for Hybrid Intelligence: Augmenting human intellect with collaborative, adaptive, responsible, and explainable Artificial Intelligence. Computer. 53. 18-28. 10.1109/MC.2020.2996587. Alexander, R., Thompson, N., McGill, T. & Murray, D. (2021). The influence of user culture on website usability. International Journal of Human-Computer Studies, 154, 102688, https://doi.org/10.1016/j.ijhcs.2021.102688. Al-Zahrani, S., & Kaplowitz, S. (1993). Attributional biases in individualistic and collectivistic cultures. Journal of Personality & Social Psychology, 47, 793–804. Arrieta, A.B., Díaz-Rodríguez, N., del Ser, J., Bennetot, A., Tabik, S., et al. (2020). Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion, 58, 82-115. Awad, E., Dsouza, S., Kim, R., Schulz, J., Henrich, J., Shariff, A., Bonnefon, J. F., & Rahwan, I. (2018). The Moral Machine experiment. Nature, 563(7729), 59–64. https://doi.org/10.1038/s41586-018-0637-6 Baldwin, J. R., Faulkner, S. L., Hecht, M. L., & Lindsley, S. L. (eds). (2006). Redefining Culture: Perspectives across the disciplines. New Jersey: Lawrence Erlbaum Associates. Bansal, G., Nushi, B., Kamar, E., Lasecki, W. S., Weld, D. S., & Horvitz, E. (2019). Beyond accuracy: The role of mental models in human-AI team performance. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 7(1), 2-11. https://doi.org/10.1609/hcomp.v7i1.5285 Banerjee, A., & Chaudhury, S. (2010). Statistics without tears: Populations and samples. Industrial Psychiatry Journal, 19(1), 60–65. https://doi.org/10.4103/0972- 6748.77642 Baughan, A., Oliveira, N., August, T., Yamashita,N. & Reinecke, K. (2021). Do cross-cultural differences in visual attention patterns affect search efficiency on websites? Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 362, 1–12. PETERS & CARMAN 994 Bhatt, U., Xiang, A., Sharma, S., Weller, A., Taly, A., Jia, Y., Ghosh, J., Puri, R., Moura, J.M., & Eckersley, P. (2019). Explainable machine learning in deployment. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 648–657. https://doi.org/10.1145/3351095.3375624 Budiman, A., & Ruiz, N.G. (2021). Key facts about Asian origin groups in the U.S. Pew Research Centre. https://www.pewresearch.org/fact-tank/2021/04/29/key-facts-about-asian- origin-groups-in-the-u-s/ Burrell, J. (2016). How the machine ‘thinks’: Understanding opacity in machine learning algorithms. Big Data & Society. https://doi.org/10.1177/ 20539 51715 622512 Carman, M. & Rosman, B. (2021). Applying a principle of explicability to AI research in Africa: Should we do it? Ethics and Information Technology 23, 2, 107–117. Casey, B., Farhangi, A. & Vogl, R. (2019). Rethinking explainable machines: The GDPR’s ‘Right to Explanation’ debate and the rise of algorithmic audits in enterprise. Berkeley Technology Law Journal, 34, 1, 143–188. Cazañas, A., de San Miguel, A., & Parra, E. (2017). Estimating sample size for usability testing. Enfoque UTE, 8(1), 172-185. Cha, J.-H., & Nam, K. D. (1985). A test of Kelley’s cube theory of attribution: A cross-cultural replication of McArthur's study. Korean Social Science Journal, 12, 151–180. Cheon, B. K., Melani, I., & Hong, Y. (2020). How USA-centric is psychology? An archival study of implicit assumptions of generalizability of findings to human nature based on origins of study samples. Social Psychological and Personality Science, 11(2), 928-937. Chen, L., Ning, H., Nugent, C., & Yu, Z. (2020). Hybrid Human-Artificial Intelligence, IEEE Computer, 53(8), 14-17. Cheng, Hao-Fei & Wang, Ruotong & Zhang, Zheng & O'Connell, Fiona & Gray, Terrance & Harper, Franklin & Zhu, Haiyi. (2019). Explaining decision-making algorithms through UI: Strategies to help non-expert stakeholders. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1-12. 10.1145/3290605.3300789. Chou, Y., Moreira, C., Bruza, P., Ouyang, C., & Jorge, J. (2022). Counterfactuals and causability in explainable Artificial Intelligence: Theory, algorithms, and applications. Information Fusion, 81, 59-83. Choi, I. & Nisbett, R.E. (1998). Situational salience and cultural differences in the correspondence bias and actor-observer bias. Personality and Social Psychology Bulletin, 24, 949–960. Choi, I., Dalal, R., Kim-Prieto, C. & Park, H. (2003). Culture and judgment of causal relevance. Journal of Personality and Social Psychology, 84(1), 46–59. De Graaf, M. M. & Malle, B. F. (2017). How people explain action (and autonomous intelligent systems should too). AAAI Fall Symposium Series, 19-26. CULTURAL BIAS IN EXPLAINABLE AI RESEARCH 995 Ding, W., Abdel-Basset, M., Hawash, H. & Ali, A. (2022). Explainability of Artificial Intelligence methods, applications and challenges: A comprehensive survey. Information Sciences, 615, 238-292. Ehsan, U., Q. Liao, V.Q., Muller, M., Riedl, M.O., & Weisz. J.D., (2021). Expanding Explainability: Towards Social Transparency in AI systems. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 82, 1–19. Fatehi, K., Priestley, J. L., & Taasoobshirazi, G. (2020). The expanded view of individualism and collectivism: One, two, or four dimensions? International Journal of Cross Cultural Management, 20(1), 7–24. https://doi.org/10.1177/1470595820913077 Fiske, A.P., Kitayama, S., Markus, H.R., & Nisbett, R.E. (1998). The cultural matrix of social psychology. In D. T. Gilbert, S. T. Fiske, & G. Lindzey (Eds.), The Handbook of Social Psychology (pp. 915–981). McGraw-Hill. Ford, C., Kenny, E.M., & Keane, M.T. (2020). Play MNIST For Me! User studies on the effects of post-hoc, example-based explanations & error rates on debugging a deep learning, black- box classifier. arXiv, abs/2009.06349. Gerlings, J., Shollo, A., & Constantiou, I.D. (2020). Reviewing the need for Explainable Artificial Intelligence (XAI). Proceedings of the 54th Hawaii International Conference on System Sciences. URL: https://arxiv.org/pdf/2012.01007.pdf Ghai S. (2021). It’s time to reimagine sample diversity and retire the WEIRD dichotomy. Nature Human Behaviour, 5(8), 971–972. https://doi.org/10.1038/s41562-021-01175-9 Hald, K., Weitz, K., André, E. & and Matthias Rehm, M. (2021). “An Error Occurred!” - Trust repair with virtual robot using levels of mistake explanation. Proceedings of the 9th International Conference on Human-Agent Interaction (HAI '21), 218–226. https://doi.org/10.1145/3472307.3484170 Hall, E. T., & Hall, M. R. (1990). Understanding Cultural Differences. Yarmouth, ME: Intercultural Press Inc. Hampton, R.S. & Varnum, M.E.W. (2020). Individualism-Collectivism. In: Zeigler-Hill, V., Shackelford, T.K. (eds) Encyclopedia of Personality and Individual Differences. Springer, Cham. Henrich, J., Heine, S. J., & Norenzayan, A. (2010). The weirdest people in the world? Behavioral and Brain Sciences, 33, 61–135. Hofstede, G. (1980). Culture and organizations. International Studies of Management & Organization, 10, 4, 15-41, DOI: 10.1080/00208825.1980.11656300 Howell, S. (1981). Rules not words. In P. Heelas and A. Lock (eds.), Indigenous Psychologies, London: Academic Press. PETERS & CARMAN 996 Islam, M. R., Ahmed, M. U., Barua, S., & Begum, S. (2022). A systematic review of Explainable Artificial Intelligence in terms of different application domains and tasks. Applied Sciences, 12(3), 1353. URL: http://dx.doi.org/10.3390/app12031353 Keith, M., & Harms, P. (2016). Is Mechanical Turk the answer to our sampling woes? Industrial and Organizational Psychology, 9(1), 162-167. Klein, G., Rasmussen, L., Lin, M. H., Hoffman, R. R., & Case, J. (2014). Influencing preferences for different types of causal explanation of complex events. Human Factors, 56(8), 1380– 1400. Klein R. A., Vianello M., Hasselman F., Adams B. G., Adams R. B.Jr., Alper S., . . . Nosek B. A. (2018). Many Labs 2: Investigating variation in replicability across samples and settings. Advances in Methods and Practices in Psychological Science, 1, 443–490. Laato, S., Tiainen, M., Islam, A. K. M. N., & Mäntymäki, M. (2022). How to explain AI systems to end users: A systematic literature review and research agenda. Internet Research. https://doi.org/10.1108/INTR-08-2021-0600 Lavelle, J. (2021). The impact of culture on mindreading. Synthese, 198, 10.1007/s11229-019- 02466-5. Lebra, T.S. (1993). Culture, self, and communication in Japan and the United States. In W. Gudykunst (Ed.), Communication in Japan and the United States (pp. 51–87). Albany, NY: State University of New York Press. Lee, F., Hallahan, M., & Herzog, T. (1996). Explaining real-life events: How culture and domain shape attributions. Personality and Social Psychology Bulletin, 22, 732–741. Lee, H.R. & Sabanović, S. (2014). Culturally variable preferences for robot design and use in South Korea, Turkey, and the United States. Proceedings of the 2014 ACM/IEEE International Conference on Human-Robot Interaction (HRI '14), 17–24. https://doi.org/10.1145/2559636.2559676 Lillard A. (1998). Ethnopsychologies: Cultural variations in theories of mind. Psychological Bulletin, 123(1), 3–32. Linxen, S., Sturm, C., Brühlmann, F., Cassau, V., Opwis, and Reinecke, K. (2021). How WEIRD is CHI? Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 143, 1–14. Liu, N. & Kirshner, S. & Lim, E. (2023). Is algorithm aversion WEIRD? A cross-country comparison of individual-differences and algorithm aversion. Journal of Retailing and Consumer Services. 72. 103259. 10.1016/j.jretconser.2023.103259. Masuda, T., & Nisbett, R. E. (2001). Attending holistically versus analytically: Comparing the context sensitivity of Japanese and Americans. Journal of Personality and Social Psychology, 81(5), 922–934. Matsumoto, D. (1996). Culture and Psychology. Pacific Grove, CA: Brooks/Cole. CULTURAL BIAS IN EXPLAINABLE AI RESEARCH 997 McSweeney, B. (2002). Hofstede’s model of national cultural differences and their consequences: A triumph of faith - a failure of analysis. Human Relations, 55(1), 89- 118. https://doi.org/10.1177/0018726702551004 Miller, J. (1984). Culture and the development of everyday social explanation. Journal of Personality and Social Psychology, 46(5), 961–978. Miller, T. (2019). Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267, 1–38. Min, C., Zhao, Y., Bu, Y., Ding, Y., & Wagner, C.S. (2023). Has China caught up to the US in AI research? An exploration of mimetic isomorphism as a model for late industrializers. arXiv, abs/2307.10198. Mueller, S.T., Hoffman, R.R., Clancey, W. J., Emrey, A., & Klein, G. (2019). Explanation in human-AI systems: A literature meta-review, synopsis of key ideas and publications, and bibliography for explainable, arXiv preprint arXiv:1902.01876 Moher, D., Liberati, A., Tetzlaff, J., Altman, D. G., & PRISMA Group (2009). Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement. PLoS Medicine, 6(7), e1000097. https://doi.org/10.1371/journal.pmed.1000097 Morris, M.W., & Peng, K. (1994). Culture and cause: American and Chinese attributions for social and physical events. Journal of Personality and Social Psychology, 67, 949–971. Nisbett, R.E., Peng, K., Choi, I., & Norenzayan, A. (2001). Culture and systems of thought: Holistic versus analytic cognition. Psychological Review, 108(2), 291–310. Norenzayan, A., Smith, E.E., Kim, B.J. & Nisbett, R.E. (2002). Cultural preferences for formal versus intuitive reasoning. Cognitive Science, 26, 653-684. Okolo, C. & Dell, N. & Vashistha, A. (2022). Making AI Explainable in the Global South: A Systematic Review. Proceedings of the 5th ACM SIGCAS/SIGCHI Conference on Computing and Sustainable Societies (COMPASS '22), 439–452. https://doi.org/10.1145/3530190.3534802 Ooge, J., Kato, S. & Verbert, K. (2022). Explaining recommendations in E-Learning: Effects on adolescents' trust. 27th International Conference on Intelligent User Interfaces (IUI '22), 93– 105. Otterbring, T., Bhatnagar, R., & Folwarczny, M. (2022). Selecting the special or choosing the common? A high-powered conceptual replication of Kim and Markus' (1999) pen study. The Journal of Social Psychology, 1–7. Advance online publication. https://doi.org/10.1080/00224545.2022.2036670 Oyserman, D., Coon, H. M. & Kemmelmeier, M. (2002). Rethinking individualism and collectivism: Evaluation of theoretical assumptions and meta-analyses. Psychological Bulletin, 128(1), 3–72. Peer, E., Brandimarte, L., Samat, S., & Acquisti, A. (2017). Beyond the Turk: Alternative platforms for crowdsourcing behavioral research. Journal of Experimental Social Psychology, 70, 153-163. PETERS & CARMAN 998 Pelham, B., Hardin, C., Murray, D., Shimizu, M., & Vandello, J. (2022). A truly global, non- WEIRD examination of collectivism: The Global Collectivism Index (GCI). Current Research in Ecological and Social Psychology, 3, https://doi.org/10.1016/j.cresp.2021.100030. Peters, U., Krauss, A. & Braganza, O. (2022). Generalization bias in science. Cognitive Science, 46: e13188, https://doi.org/10.1111/cogs.13188 Peters, U. & Lemeire, O. (2023). Hasty generalizations are pervasive in experimental philosophy: A systematic analysis. Philosophy of Science. 1-29. 10.1017/psa.2023.109. Peters, U. & Carman, M. (2023). Unjustified sample sizes and generalizations in Explainable AI research: Principles for more inclusive user studies. IEEE Intelligent Systems, 38(6), 52–60. URL: https://arxiv.org/pdf/2305.09477.pdf Rad, M.S., Martingano, A.J., & Ginges, J. (2018). Toward a psychology of homo sapiens: Making psychological science more representative of the human population. Proceedings of the National Academy of Sciences of the United States of America, 115(45), 11401–11405. Ramírez-Castañeda, V. (2020). Disadvantages in preparing and publishing scientific papers caused by the dominance of the English language in science: The case of Colombian researchers in biological sciences. PloS One, 15(9), e0238372. https://doi.org/10.1371/journal.pone.0238372 Rau, P., Li, Y., & Li, D. (2009). Effects of communication style and culture on ability to accept recommendations from robots. Computers in Human Behavior, 25(2), 587–595. Reinecke, K. & Gajos, K.Z. (2014). Quantifying visual preferences around the world. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI '14, 11-20. Reinecke, K., & Gajos, K. Z. (2015). LabintheWild: Conducting large-scale online experiments with uncompensated samples. Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing, 1364–1378. Robbins, J., & Rumsey, A. (2008). Introduction: Cultural and linguistic anthropology and the opacity of other minds. Anthropological Quarterly, 81(2), 407–420. Rong, Y.R., Leemann, T., Nguyen, Thai-trang, N., Fiedler, L., Seidel, T., Kasneci, G., & Enkelejda, K. (2022). Towards human-centered Explainable AI: User studies for model explanations, arXiv, arXiv:2210.11584 Sanchez-Burks, J., Lee, F., Choi, I., Nisbett, R., Zhao, S., & Koo, J. (2003). Conversing across cultures: East-West communication styles in work and nonwork contexts. Journal of Personality and Social Psychology, 85(2), 363–372. Sawaya, Y., Sharif, M., Christin, N., Kubota, A., Nakarai, A., & Yamada, A. (2017). Self- confidence trumps knowledge: A cross-cultural study of security behavior. Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, 2202–2214. CULTURAL BIAS IN EXPLAINABLE AI RESEARCH 999 Schmidt, P., Biessmann, F. & Teubner, T. (2020) Transparency and trust in Artificial Intelligence systems. Journal of Decision Systems, 29, 4, 260-278, DOI: 10.1080/12460125.2020.1819094 Schneider, J., & Vlachos, M. (2023). Explaining classifiers by constructing familiar concepts. Machine Learning 112, 4167–4200. https://doi.org/10.1007/s10994-022-06157-0 Seaborn, K., Barbareschi, G., & Chandra, S. (2023). Not only WEIRD but “uncanny”? A systematic review of diversity in Human–Robot Interaction research. International Journal of Social Robotics, 1 - 30. https://doi.org/10.1007/s12369-023-00968-4 Simons, D.J., Shoda, Y., & Lindsay, D.S. (2017). Constraints on generality (COG): A proposed addition to all empirical papers. Perspectives on Psychological Science, 12(6), 1123–1128. Sperrle, F., El-Assady, M., Guo, G., Chau, D., Endert, A., & Keim, D.A. (2020). Should we trust (X)AI? Design dimensions for structured experimental evaluations. arXiv, abs/2009.06433. Taras, V., Rowney, J. & Steel, P. (2009). Half a century of measuring culture: Review of approaches, challenges, and limitations based on the snalysis of 121 instruments for quantifying culture. Journal of International Management, 15. 357-373. 10.1016/j.intman.2008.08.005. Taras, V., Steel, P. & Kirkman, B.L. (2016). Does country equate with culture? Beyond geography in the search for cultural boundaries. Management International Review, 56, 455– 487. Taylor, J., & Taylor, G.W. (2021). Artificial cognition: How experimental psychology can help generate explainable artificial intelligence. Psychonomic Bulletin & Review, 28(2), 454–475. Triandis, H.C. (1995). Individualism and Collectivism. Boulder, CO: Westview Press. Ur, B. & Wang, Y. (2013). A cross-cultural framework for protecting user privacy in online social media. Proceedings of the 22nd International Conference on World Wide Web, 755-762. 10.1145/2487788.2488037. Van Brummelen, J. Kelleher, M., Tian, M.C., & Nguyen. N.H. (2022). What do WEIRD and non- WEIRD conversational agent users want and perceive? Towards transparent, trustworthy, democratized agents, arXiv, arXiv:2209.07862 Varnum, M. E., Grossmann, I., Kitayama, S., & Nisbett, R.E. (2010). The origin of cultural differences in cognition: Evidence for the social orientation hypothesis. Current Directions in Psychological Science, 19(1), 9–13. Waa, J.V., Diggelen, J.V., Bosch, K.V., & Neerincx, M.A. (2018). Contrastive explanations for reinforcement learning in terms of expected consequences. arXiv, abs/1807.08706. Waa, J.v.d, Schoonderwoerd, T., van Diggelen, J. & Neerincx, M. (2020). Interpretable confidence measures for decision support systems. International Journal of Human–Computer Studies, 144, doi: 10.1016/j.ijhcs.2020.102493. PETERS & CARMAN 1000 Wang, L., Rau, P.P., Evers, V., Robinson, B.K. & Hinds, P. (2010). When in Rome: The role of culture and context in adherence to robot recommendations. Proceedings of the 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI '10), 359–366. Wang, X. X Ming Yin, M. (2021). Are explanations helpful? A comparative study of the effects of explanations in AI-assisted decision-making. 26th International Conference on Intelligent User Interfaces (IUI '21), 318–328. https://doi.org/10.1145/3397481.3450650 Wang, G., Qi, P., Yanfeng, Z., & Mingyang, Z. (2021). What have we learned from OpenReview? arXiv, https://doi.org/10.48550/arxiv.2103.05885 Warren, G., Keane, M.T., & Byrne, R.M. (2022). Features of explainability: How users understand counterfactual and causal explanations for categorical and continuous features in XAI. arXiv, abs/2204.10152. Wierzbicka, A. (1992). Semantics, Culture, and Cognition. Oxford: Oxford University Press. Wurtz, E. (2005), Intercultural communication on web sites: A cross-cultural analysis of web sites from high-context cultures and low-context cultures. Journal of Computer-Mediated Communication, 11, 274-299. Yang, F., Huang, Z., Scholtz, J. & Arendt, D.L. (2020). How do visual explanations foster end users' appropriate trust in machine learning? Proceedings of the 25th International Conference on Intelligent User Interfaces (IUI '20), 189–201. Yilmaz, O., & Alper, S. (2019). The link between intuitive thinking and social conservatism is stronger in WEIRD societies. Judgment and Decision Making, 14(2), 156– 169. https://doi.org/10.1017/S1930297500003399 Zerilli, J., Knott, A., Maclaurin, J., & Gavaghan, C. (2019). Transparency in algorithmic and human decisionmaking: is there a double standard? Philosophy & Technology, 661–683, https://doi.org/10.1007/s13347-018-0330-6 Zerilli, J., (2022). Explaining Machine Learning Decisions. Philosophy of Science, 89, 1–19. Zerilli, J., Bhatt, U., & Weller, A. (2022). How transparency modulates trust in artificial intelligence. Patterns, 3(4), 100455. https://doi.org/10.1016/j.patter.2022.100455 Zhang, Q., Lee, M.L., & Carter, S. (2022). You complete me: Human-AI teams and complementary expertise. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 114, 1–28.