Where does data come from? 

What are the ethical and governance problems that arise from ‘found’ data?

 

Data Ethics and Provenance


Fordyce, R. and Jethani, S., 2021. Critical data provenance as a methodology for studying how language conceals data ethics. Continuum, 35(5), pp.775-787.

DOI: https://doi.org/10.1080/10304312.2021.1983259

I wrote “Critical data provenance as a methodology for studying how language conceals data ethics” with my good friend Suneel Jethani back in 2020, seeking to address questions that we had that we felt went unanswered by some existing models for data forensics. Within this article, we make an argument that there needs to be more work done to ensure that any dataset has a record of its origins built into it. We see that datasets can be bought and sold, obtained, released, captured, aggregated, triangulated, re-identified, stumbled upon, leaked, illicitly obtained, or whatever. We think that data can be more ethical if we keep track of how the data was created within the dataset or database. This means that people can know what appropriate uses of the data are, what was consented to, and – if the data was obtained illicitly or obliquely – a possibility that some history of the data’s carriage and transactions are retained in its records.

We see this as contributing to the ethics of data provenance. Data provenance as an idea that we develop from the work of Peter Buneman, Sanjeev Khanna, and Tan Wang-Chiew. These authors present ‘where-provenance’ as question of correspondence between a datum and the thing that it references. This is a kind of question of accuracy, and they seek to point out the importance of ensuring that there need to be technical measures for ensuring the validity and reliability of data. We think that data ethics can add to this idea of ‘where-provenance’ by incorporating data that explains the origins of the data not in terms of validity or reliability (but these are important) but in terms of how it was justified ethically, legally, or discursively. To put it simply, we think that data gathered within a dataset should have transparent, human-readable information about the clauses or processes that led to its creation.

There are all sorts of implicit and explicit justifications and mechanisms at play that lead people to providing data to someone. Sometimes these are clearly laid out at the moment of capture, such as in the plain language statements of some university research ethics clearances. Some data is less clear in relation to how its capture is justified, such as the terms and conditions of commercial websites. Some data capture is even less clear, for instance when someone ‘posts’ or ‘likes’ material on social media, it may not be very clear to them how their post or like will influence the creation of an advertising profile or information about their susceptibility to political influence campaigns. Finally, we have dark patterns, illicit, illegal, or incidental capture where data is created without any kind of clear communication to the user, such as zombie cookies, Facebook’s pixel tracking methods, digital fingerprinting, or other methods.