Skip to content

Partnership with GIX Brings Society + Technology Affiliates to Bellevue

Society + Technology at UW has partnered with the University of Washington’s Global Innovation Exchange (GIX) to offer a master’s-level course in the Master of Science in Technology Innovation program (MSTI 522): The History and Future of Technology: Responsible Technological Innovation.

GIX is a joint initiative between the UW College of Engineering and the Foster School of Business. The 18-month graduate program in technology innovation emphasizes practical, challenge-based learning across engineering, business, and design, and is housed in UW’s outpost in Bellevue.

The Instructor of Record, Monika Sengul-Jones, Ph.D., Director of Strategy and Operations for Society + Technology at UW, centers the interdisciplinary field of Science and Technology Studies (STS) to explore the historical, social, and cultural dimensions of technological innovation. STS approaches consider technology and innovation as socio-technical and cultural accomplishments that are both informed by and inform social structures of power. Throughout the course, students will cultivate responsible sensibilities as stewards of the social and societal impacts of emerging technologies.

As part of the collaboration with Society + Technology at UW, GIX students will also learn from the tri-campus, cutting-edge network of scholars affiliated with the initiative. Five experts will deliver guest lectures on the relationship between technology and society to this cohort of master’s students. The speaker series also includes a seasoned GIX lecturer who works at SAP to offer insights on responsibility and inclusiveness in enterprise software products. Dina Chawla, a graduate student in the Department of Human Centered Design and Engineering at UW Seattle, supports the class as a Reader/Grader.

Since 2017, this course has been developed and led by Linda Wagner, David Ribes, and Amanda Menking. In its current iteration, Past and Future of Technology is not only a keystone learning experience for GIX students to explore the historical, philosophical, and cultural foundations of innovation and technology, it is also an unparalleled opportunity to learn from UW’s extensive network of scholars working at this vital intersection.

Guest Speakers

(In order of appearance)

Muhammad Aurangzeb Ahmad

Topic: AI Agents and Responsibilities
Title: Second Voice, First Person: AI Surrogates and Digital Doppelgangers

Muhammad Aurangzeb Ahmad is a Research Scientist at the University of Washington’s Harborview Medical Center and an Affiliate Assistant Professor in the Division of Computing and Software Systems at the University of Washington, Bothell. He earned his Ph.D. in computer science from the University of Minnesota. His research focuses on artificial intelligence, algorithmic nudging (using algorithms to change human behavior), and personality emulation (software that can act like humans). Ahmad thinks extensively about the social, cultural, and ethical impact of AI and machine learning. His research has been covered by PBS and Discover Magazine. He’s spoken on a panel for the United Nations and in other venues.

Ellie Kemery

Topic: Inclusion and Responsibility in Enterprise Software UX Research

Ellie Kemery is Principal AI User Research Lead for SAP Business AI and is a frequent guest speaker at GIX. Her work seeks to establish a culture of ethical research and design practices across SAP in a way that proactively informs the way teams build intelligent product experiences that all people love. She has worked with or for companies and organizations including Microsoft, IxDA, Design in Public, and Brooks Running. She has a degree from the UW’s Foster School of Business in Human Behavior and Entrepreneurship.

Katy E. Pearce

Topic: Privacy, Technology, and Governance

Katy E. Pearce is an Associate Professor in the Department of Communication at the University of Washington and holds affiliations with the Ellison Center for Russian, East European, and Central Asian Studies and the Center for an Informed Public. She is an expert in social and political uses of technologies and digital content in the transitioning democracies and semi-authoritarian states of the South Caucasus and Central Asia, but primarily Armenia and Azerbaijan. The main focus of her research is the adoption and use of information and communication technologies in diverse cultural, economic, and political contexts, mainly authoritarian post-Soviet states. On the adoption side, Pearce looks at barriers to use—often socioeconomic, but sometimes political or cultural. On the outcome of ICT use side, Pearce studies outcomes like decreasing or increasing inequality due to ICTs, cosmopolitanism, capital enhancement, civic engagement, demand for democracy, and social activism. Methodologically, most of her earlier work is quantitative modeling, while much of her more recent work is qualitative or mixed methods.

Alexis Hiniker

Topic: Habits by Design: Research and Ethics in Human-Computer Interaction

Alexis Hiniker is an Associate Professor in the Information School at the University of Washington and Director of the User Empowerment Lab. Through her work in human-computer interaction and ubiquitous computing, she investigates the ways in which everyday technologies make life worse for their users. Hiniker combines user-centered design methods with theory from a variety of disciplines to design, implement and evaluate new technical systems. Her current projects focus on compulsive technology use, dark patterns, voice interfaces, and arguments online. She has a Ph.D. in Human Centered Design and Engineering from the University of Washington, an M.A. in Learning, Design and Technology from Stanford University, and an A.B. in Computer Science from Harvard University.

Anissa Tanweer

Topic: Data Science and Ethics in Action

Anissa Tanweer is a Senior Social Scientist at the eScience Institute, an Affiliate Faculty member in the Department of Communication, and a sociotechnical expert for the Scientific Software Engineering Center (SSEC). She conducts ethnographic research on the practice and culture of computationally mediated science and applies a sociotechnical lens to the design and implementation of training programs in data-intensive academic research. Tanweer directs the UW Data Science for Social Good summer internship and ran the Data Science Studies Special Interest Group at UW from 2018-2021. Tanweer earned a Ph.D. in Communication from the University of Washington. She has published her research on topics such as ethics and data science in journals such as Social Studies of Science, Big Data & Society, and Harvard Data Science Review.

[Conversations] What Does Consent Mean in the Age of Large Language Models (LLMs)?

Smiling dark-haired white woman with glasses.

Transcript

Monika Sengul-Jones

As a computational linguist, why are you reluctant to have the audio recording of our conversation available or streamed on the Internet?

Angelina McMillan-Major

I’m concerned about my data being out there on the [open] internet, available to crawlers. Large language models (LLMs), as well as other generative or machine learning models, are trained using data scraped from the internet. Oftentimes, it’s collected using automated systems that crawl domains such as Wikipedia[’s corpus] going from link to link.

My data, my voice data, is called PII, personally identifiable information. It’s [among] the high-risk types of data because it’s uniquely identifying. 

I’m concerned about having my PII out in the wild, where automated systems can gather my PII and throw it into a model and use it as they will.

It’s also that personal data is pervasively undervalued. From the industry perspective, ‘data goes in’ and the product is the model, the output. So I’m concerned about our individual data rights and what can be learned about us, as people, through [our] personal data.

Monika Sengul-Jones

It’s funny that the word “data” can be used to describe something so personally unique—the sound of your voice.

Angelina McMillan-Major

Yeah, your voice is conceptualized as a pattern, [as data] it becomes frequencies. What’s important, or desirable, isn’t just the content of what’s spoken—it’s your voice frequencies and what sort of words you use.

Monika Sengul-Jones

Is it accurate to say, from a privacy perspective, you’re concerned about your sensory—vocal, in this case—fingerprints? That we need protection for something that is unintentionally created and possessed and therefore is given away without realizing or consenting?

Angelina McMillan-Major

Yes.

Monika Sengul-Jones

Let’s talk more about your work as a computational linguist. You’ve presented research on the history of computation and language, and how the same word—artificial intelligence—is used to describe different technologies. For instance, we have the ELIZA chatbot (an early natural language processing computer program developed from 1964 to 1967 at MIT) in the mid-century, which was cutting-edge AI. Today, ELIZA is pretty basic. Tell us more about why this history is important to know.

Angelina McMillan-Major

It’s a good question. Well, chatbots like ELIZA used shallow processing. It was N-gram language modeling.

Monika Sengul-Jones

Can you explain an N-gram?

Angelina McMillan-Major

They work by making a statistical prediction of what text will come next—sort of like an ‘auto-complete’ that isn’t very good.

“N” refers to the number of grams, of consecutive words, or tokens. So “the cat” is bi-gram. “The cat meows” is a tri-gram. The more words you add, the higher the n-gram, which is less frequent. The phrase “the cat meows in the tree,” that’s not going to happen often [in some given text data]. 

You look at the probability of what word might come next—that was the state-of-the-art AI. But at a certain point, there’s a limit to how natural an N-gram will sound. 

Then neural networks became popular, they sounded more natural and the probability space was more fluid.

Monika Sengul-Jones

How are neural networks different from N-grams?

Angelina McMillan-Major

A neural network is fundamentally based on an algorithm called the perceptron. This is a specific mathematical formula based on linear algebra that models language as a network [of nodes]. So [neural networks] is going from the probability statistics space to linear algebra. It shifts what sort of things you can do to smooth low probabilities [in language prediction] as well as create randomization to allow for more fluid, unique patterns that aren’t necessarily directly in the training data.

Smiling dark-haired white woman with glasses.
Angelina McMillan-Major, PhD, is a computational linguist in the UW’s Language Learning Center where she focuses on methodologies for language documentation and reclamation, specifically endangered languages. Photo credit: Russell Hugo, 2024

Monika Sengul-Jones

I have to mention, just the word ‘perceptron’ sounds cool. Were these developed around the same time as the N-gram? Or did one follow the other?

Angelina McMillan-Major

An early perceptron version of a neural network was also developed back in the 1940s.

Monika Sengul-Jones

Before ELIZA.

Angelina McMillan-Major

Yes, however, in the ’40s, computational linguistics had multiple theories, but it wasn’t until we had the personal computer and then the internet with enough data and hardware that we could actually implement these theories. So there were versions of neural networks in the ’90s, but they didn’t take off until the 2010s.

Monika Sengul-Jones

That was our “big data” moment. So, in this brief history of artificial intelligence as it pertains to language, where do large language models (LLMs) come in?

Angelina McMillan-Major

At the end of the neural network period (in 2017). Most people are familiar with LLMs that use a particular type of architecture, the transformer model. This is what ChatGPT is based on. Compared to other neural networks, LLMs using a transformer [architecture] are extremely data-intensive, using billions of tokens.

Monika Sengul-Jones

Let’s go back to my first question, what is at stake for people by the fact that we’ve called all these different technologies “artificial intelligence.”

Angelina McMillan-Major

We’re seeing models used for decision-making, like determining credit scores, and we know these outputs are biased but it’s not transparent within the module itself. We don’t have the opportunity to see—“Oh, my credit score was decided because this model output a .6 or something”—and what that means internally.

Monika Sengul-Jones

I know this black boxing causes real harm to people. We deserve transparency on how decisions are made. But also, if people use these models for decision-making, if people are relieved of decision fatigue, are you worried people are going to get stupider?

Angelina McMillan-Major

I hope not.

Monika Sengul-Jones

That’s a relief!

Angelina McMillan-Major

I’m less concerned about the loss of critical thinking skills and more about people willingly giving up rights to our personally identifiable information (PII) in exchange for ease. 

Monika Sengul-Jones

In exchange for ease, yeah. And then your PII could be used against you, I suppose.

Angelina McMillan-Major

I worry about the normalization of this exchange in society. I want society to be aware that the exchange is the centralization of power into a small number of big companies.

Monika Sengul-Jones

Big in reach, small in number.

Angelina McMillan-Major

It doesn’t necessarily have to be that way.

Monika Sengul-Jones

Let’s talk about how else it could be. In your research, you’ve been developing best practices for research with communities, such as those who speak endangered languages. In North America, Indigenous communities, for instance. For anyone concerned about privacy, about the integrity of their personally identifiable data, who wants to document their language and to protect their data, what’s your approach?

Angelina McMillan-Major advocates for a consent-based model of technology, drawing from the bodily consent literature. She recommends checking out the Consentful Tech Project to learn more. Image: Screengrab from Consentful Tech Project, 2024.

Angelina McMillan-Major

Collection, maintenance, and controlling access—these are huge priorities.

Most people are familiar with participation in data-gathering as something you can opt in or out of. When the opt-out model is used [as the default], it’s not consent, since people may not be aware that removing yourself is an option.

When you’re working with a community, the process is [and should be] different. There are archives that will hold this data. And usually, there are intimate processes. You go to a specific family, for example, whose ancestor has recorded something. You get permission from that family, you specifically ask to use their recording in research. You explain the forms you’ll be using it in, what will be shared, what the outcomes will be, and how you’ll be giving back and reciprocating with the community.

Monika Sengul-Jones

So you’re thinking about computational linguistics, in this process, as co-created partnerships of reciprocity.

Angelina McMillan-Major

Yes. Additionally, the person asking for consent carries the burden of providing as much information as possible. They need to ensure there’s some sort of understanding on the other end. This is distinct from the way that most of us just go through the terms agreement and click accept.

Monika Sengul-Jones

I just do what I need to do to move on. Those modal interruptions are the worst.

Angelina McMillan-Major

Yeah. So that’s not informed consent. That’s as-quickly-as-possible consent.

Monika Sengul-Jones

You have an acronym you use to understand consent in your work. Freely given, reversible, informed, enthusiastic, and specific; FRIES consent. That’s really nice.

Angelina McMillan-Major

Yeah, that’s drawing from the bodily consent literature.

Monika Sengul-Jones

Right, and it brings us back to the beginning of our conversation, thinking about our personally identifiable information (PII) as intimate data, as an important part of us and deserving of protection. Our PII body.

Angelina McMillan-Major

Yeah. However, one of the concepts that we don’t have a technical analogy for yet is “reversible.” Once you give your agreement, you can’t take back your data. That’s not necessarily the case in Europe, with the General Data Protection Regulation (GDPR). But that’s a problem with our current LLMs. It’s hard to take out data because it’s built into the model.

Monika Sengul-Jones

Right. I like to think about how reversal might work with, for example, The Author’s Guild class action lawsuit against OpenAI. Let’s say the authors win. How could the books be removed from OpenAI’s GPT models, to, for instance, prevent works from being generated that closely resemble those copyrighted works that should be withdrawn? The litigation is an important question for copyright law because the books are not copied or saved on the servers or directly used to generate responses to queries, rather there are cases of overfitting—and we’ll see how the courts rule—but in the event the authors win, how will whatever those books helped create be removed?

Angelina McMillan-Major

Well [the books as] data, sort of, are the weights. The actual numbers that are calculated from them form the body of the model. How do you tie a specific data instance to the weights that are spread across a giant billion-parameter model? That’s hard to do.

Monika Sengul-Jones

When I hear things like this, it reminds me of people saying, ‘You can’t put the genie back in the bottle.’ But is it impossible? It seems more of a political and labor question.

Angelina McMillan-Major

I think people are trying. I’ll say that. I’m not convinced.

Monika Sengul-Jones

You’re not convinced?

Angelina McMillan-Major

I mean, I just don’t know how you would do it, from a theoretical perspective.

Monika Sengul-Jones

But if people didn’t give consent to have their data used, and yet it was, and it became the foundation of the model, then won’t we need to figure out how to remove parts?

Angelina McMillan-Major

Well, there’s the remove-the-whole-thing option. It’s the remove parts that people are trying their best to work on.

Monika Sengul-Jones

Before we end, I want to ask you about another intervention you’ve made in your work with the Tech Policy Lab. Which is this concept of “data statements”—metadata that are attached to data points. Tell us about data statements, what do you want people concerned about data and privacy to know?

Cover of A Guide for Creating and Documenting Language Datasets with Data Statements Schema Version 3 2024 Angelina McMillan Major and Emily M Bender.
Data Statements Guide by Angelina McMillan-Major & Emily Bender (2024); report design by Elias Greendorfer.

Angelina McMillan-Major

[Data Statements] was started by Batya Friedman and Emily Bender, who were asking, ‘How can we help people make more informed decisions about selecting data for the models they are going to use, and for the systems those models are embedded in?’ Data Statements help to make sure they’re appropriate for the use case. The behavior of the model is so tied to the data that it’s trained on that you don’t want to use, for example, a model only trained on English data for some other language, something as simple as that. Data Statements are guides.

Monika Sengul-Jones

I started to think of our conversation about crawlers on the internet just going, eat, eat, eat, like a little Pac-Man. Then they run into something like a data statement and it’s like, “nope!” can’t pass, it’s not right for what I need! I don’t know [laughter] I just…I liked that visual for my understanding of data statements. Is that an accurate description?

Angelina McMillan-Major

I hope so someday! [Laughter] The existing versions of data statements are designed for human decision making, but maybe further research will result in machine-readable versions.

Transcription by Mollie Chehab
Editing by Monika Sengul-Jones
Graphic of Data Statements Guide by Elias Greendorfer
Image Credit: Portrait of Angelina McMillan-Major (2024) by Russell Hugo of the Language Learning Center

Related Links

Consentful Tech Project

Tech Policy Lab’s Data Statements Project

McMillan-Major, A., Bender, E. M., & Friedman, B. (2024). Data Statements: From Technical Concept to Community Practice. ACM Journal on Responsible Computing, 1(1), 1–17. https://doi.org/10.1145/3594737

McMillan-Major, A., et al. (2024). Documenting Geographically and Contextually Diverse Language Data Sources. Northern European Journal of Language Technology, 10(1). https://doi.org/10.3384/nejlt.2000-1533.2024.5217