Skip to main content

Summer Psychometric Internship

Providing students with the opportunity to gain hands-on experience and mentorship.

Summer Psychometric Internship

How to Apply


To receive consideration for an NBME internship, a candidate must meet the following requirements:

  • Active enrollment in doctoral program in measurement, statistics, cognitive science, medical education, or related field; completion of two or more years of graduate coursework.
  • Experience or coursework in one or more of the following: test development, IRT, CTT, statistics, research design, and cognitive science. Advanced knowledge of topics such as equating, generalizability theory, or Bayesian methodology is helpful. Skill in writing and presenting research. Working knowledge of statistical software (e.g., Winsteps, BILOG; SPSS, SAS, or R).
  • Interns will be assigned to one or more mentors, but must be able to work independently.
  • Must be authorized to work in the US for any employer. If selected, F-1 holders will need to apply for Curricular Practical Training authorization through their school’s international student office, and have a social security number for payroll purposes.

Research Projects

Interns will help define a research problem, review related studies, conduct data analyses (real and/or simulated data), and write a summary report suitable for presentation. 

Applicants should identify 2 projects by number that they prefer to work on.

1. Application of Natural Language Processing (NLP) in the Field of Assessment

Application of Natural Language Processing (NLP) in the field of assessment has led to innovations and changes in how testing organizations design and score tests.

Possible projects will investigate novel NLP applications, using real or simulated data, for various processes relevant in an operational testing program (e.g., test construction, key validation, standard setting). Results would be informative for possible improvements to current best practices.

2. Modeling Answer-Change Strategy in a High-Stakes MCQ Examination

In this project, we explore the use of the Rasch Poisson Count model (Rasch, 1960/1980) to extend the hierarchical speed accuracy model (van der Linden, 2007) to model the item revisits and answer change behavior patterns in a high-stakes examination collected in an experimental setting.

We propose to connect the elements of process data available from a computer-based test (correctness, response time, number of revisits to an item, the outcome of the revisit to an item, IRT ability of examinee and IRT item characteristics) in a hierarchical latent trait model that explains examinee’s behavior on changing the initial response to the item.

The relationship between working speed, ability, and the number of visits and number of answer changes can be modeled using a multidimensional model that conceptualizes them as latent variables. The model should help us better understand the answer change behavior and cognitive behavior of examinees in a timed high-stakes examination.

3. Performance Assessments

The intern will pursue research related to improving the precision and accuracy of a performance test involving physician interactions with standardized patients.

Possible projects include designing an enhanced process for flagging aberrant ratings by trained raters and supporting research on standardized patients in a high-stakes exam.

4. Measurement Instrument Revision and Development

This project will involve revising a commonly-used measurement instrument so that the appropriate inferences can be made with regard to medical students.

Duties will include the following:

  • Working with subject-matter experts to revise the existing items
  • Conducting think-alouds with medical students
  • Developing a pilot measure of potential items
  • Exploratory and confirmatory factor analysis of initial pilot results to gather structural validity evidence
  • Developing a larger survey to gather concurrent and discriminate validity evidence with the revised measure
  • Administration and evaluation of the larger survey

5. Characterizing (and Visualizing) Item Pool Health 

The health of an item pool can be defined in a number of ways. Our current test development practices utilize have/need reports broken down by content area, and many content outlines are hierarchical in nature, with several layers of content coding and metadata. The problem is that the have/need ratios are, for the most part, one dimensional, but details within the “have” portion of these ratios represent multidimensional information that can be used to improve multiple aspects of test development, including form construction, test security, pool management/maintenance, and targeting of item-writing assignments.

The aims of this project are two-fold:

  • Develop helpful, easily-interpretable metrics to assess item pool health
  • Employ a sophisticated visualization method of item pool health (e.g., via R Shiny, D3.js, .NET languages/libraries, etc) to assist in improving one or more aspects of test development.

6. Item Tagging and Mapping with Natural Language Processing 

Test content outlines and specifications often change rapidly within cutting-edge domains. In response to these changes, test development teams must “map” the pre-existing content onto the new content domains. Such a task is trivial when there are equivalent content domains between the new and old content outlines. However, this direct mapping rarely occurs, leaving item mapping to be done manually, a time-intensive task that is prone to human error and differences in subjective interpretations across humans.

This project seeks to utilize and integrate natural language processing (NLP), machine learning (ML), and data visualization to:

  • Assist subject-matter experts with creating new content outlines
  • Help map items to new content domains
  • Review manual item mappings for accuracy as a quality control measure
  • Visually represent the degree of content distribution within a group of items (e.g., test form, item bank, etc).

A component of this project will be the utilization of sophisticated data visualization methods to allow subject matter experts and test development staff to more easily examine items in multiple contexts. Strong candidates for this position will have knowledge of Python or a similar language to utilize common libraries used in NLP (e.g., Keras, Tensorflow, Pytorch, etc)

7. Computer-Assisted Scoring of Constructed Response Test Items 

Recently the NBME has developed a computer-assisted scoring program that utilizes natural language processing (NLP). The two main components of the program are:

  • Ensuring that the information in the constructed response is correctly identified and represented
  • Building a scoring model based on the these concept representations

Current areas of research surrounding this project include (but are not limited to):

  • Refining quality control steps to be taken prior to an item being used in computer-assisted scoring
  • Linking and equating computer-assisted scores with human rater scores
  • Evaluating a scoring method based on using orthogonal arrays
  • Developing metrics that assess item quality and test reliability when computer-assisted scores and human scores are used to make classification decisions.

The final project will be determined based on a combination of intern interest and project importance.


The application window is currently closed, please check back for updated application deadlines.