Skip to main content


Apropos #openai refusing to disclose any information about the training data for #GPT4 and #Google being similarly cagey about #Bard...

From the Stochastic Parrots paper, written in late 2020 and published in March 2021:

@timnitGebru @meg
Screencap: "When we rely on ever larger datasets we risk incurring documentation debt, 18 i.e. putting ourselves in a situation where the datasets are both undocumented and too large to document post hoc. While documentation allows for potential accountability [13, 52, 86], undocumented training data perpetuates harm without recourse. Without documentation, one cannot try to understand training data characteristics in order to mitigate some of these attested issues or even unknown ones. The solution, we propose, is to budget for" and footnote 18 "On the notion of documentation debt as applied to code, rather than data, see [154]." Screencap "As a part of careful data collection practices, researchers must adopt frameworks such as [13, 52, 86] to describe the uses for which their models are suited and benchmark evaluations for a variety of conditions. This involves providing thorough documentation on the data used in model building, including the motivations underlying data selection and collection processes. This documentation should reflect and indicate researchers’ goals, values, and motivations in assembling data and creating a given model. It should also make note of potential users and stakeholders, particularly those that stand to be negatively impacted by model errors or misuse. We note that just because a model might have many different applications doesn’t mean that its developers don’t need to consider stakeholders. An exploration of stakeholders for likely use cases can still be informative around potential risks, even when there is no way to guarantee that all use cases can be explored." 2nd-4th sentences highlighted in blue.
Screen cap: Running header "Bender and Gebru, et al." and text "documentation as part of the planned costs of dataset creation, and only collect as much data as can be thoroughly documented within that budget." Screencap: "In summary, we advocate for research that centers the people who stand to be adversely affected by the resulting technology, with a broad view on the possible ways that technology can affect people. This, in turn, means making time in the research process for considering environmental impacts, for doing careful data curation and documentation, for engaging with stakeholders early in the design process, for exploring multiple possible paths towards longterm goals, for keeping alert to dual-use scenarios, and finally for allocating research effort to harm mitigation in such cases." Highlighted in blue: "for doing careful data curation and documentation"