Behavioral researchers have begun to explore whether large language models (LLMs) such as Open AI’s GPT (which stands for generative pre-trained transformer) can be used to create “synthetic” research participants—artificial agents that can respond to surveys in a manner similar to that of humans. Studies have found that such synthetic participants can indeed mimic human decisions and respond much like their human counterparts, even replicating previous research findings. This raises the question: Could artificial intelligence (AI) models replace humans in testing behavioral policy interventions?
Playback of this video is not currently available
To date, research has focused primarily on Western countries, with limited participation from the Middle East and North Africa (MENA) region. To study the accuracy of synthetic participants across contexts, we examined the similarity between human and synthetic participants from samples in three countries—Saudi Arabia, the United Arab Emirates (UAE), and the U.S.—in three policy domains: sustainability, financial literacy, and female labor force participation. Across these domains, we assessed attitudes about policies and measured the impact of several interventions on self-reported behaviors from both human and synthetic participants.
Design a questionnaire with behavioral and attitudinal questions across three policy areas: labor market, financial literacy, and sustainability.
Recruit participants in Saudi Arabia, United Arab Emirates, and the U.S. and run the survey
Collate personal characteristics of human participants (such as demographics and attitudes)
Use them to..
Generate synthetic participants with similar characteristics using Al
Run the exact same survey on synthetic (Al-generated) participants
Analyze the data and compare results:
• Can synthetic participants help predict human answers?
• Where do discrepancies occur?
• How can Al models be improved for future studies?
In summary, we found the synthetic participants created by GPT produced responses similar to those of their human counterparts across the three policy domains we assessed. However, the effects of the behavioral interventions we tested varied between human and synthetic participants. We also observed two primary differences in Saudi Arabian and UAE responses compared with those from the United States. First, the correlations were stronger for U.S. participants—when human responses in the U.S. increased or decreased, synthetic responses reflected them more closely. Second, for the U.S., GPT exhibited higher levels of positive bias (overestimating human participants’ support for various policy proposals), and for Saudi Arabia and the UAE exhibited higher levels of negative bias (underestimating participants’ support). This report highlights the main policy implications of these findings and makes practical recommendations for researchers.
Generative AI has sparked the imagination of policymakers and the research community for its potential to dramatically accelerate research into public policy. Our research found GPT could be useful in gauging the public’s reaction to prospective policies, but it is still premature to consider using it in more advanced stages of policy development or the testing of behavioral interventions. Further advances are needed to precisely estimate human responses and remove biases against GCC populations. Understanding generative AI’s promise and limitations will be crucial to unlocking its full power in the future.
Menu