CROWDSS: A crowdsourced, ecologically-valid dialogue dataset for German

Frommherz, Yannick; Zarcone, Alessandra

Jun-2021 Textual Data https://fordatis.fraunhofer.de/handle/fordatis/198
http://dx.doi.org/10.24406/fordatis/124

CROWDSS: A crowdsourced, ecologically-valid dialogue dataset for German

Frommherz, Yannick (Fraunhofer-Institut für Integrierte Schaltungen IIS); Zarcone, Alessandra

IIS Fraunhofer-Institut für Integrierte Schaltungen

Files in This Item:

File	Description	Size	Format
Annotation guidelines.pdf	Guidelines for the dialogue act annotation	222,15 kB	Adobe PDF	Preview Download/Open
CROWDSS.json	CROWDSS dataset	458,87 kB	Unknown	Download/Open

Abstract

The CROWDSS dataset (Crowdsourced Wizard of Oz Dialogue dataset based on Situated Scenarios) contains 113 German dialogues collected in a Wizard-of-Oz fashion (i.e., simulating human-machine interaction). To refer to CROWDSS in any publication, please cite the following paper: Frommherz, Y. and Zarcone, A. (2021). Crowdsourcing ecologically-valid dialogue data for German. In Frontiers in Computer Science, Vol 3, doi: 10.3389/fcomp.2021.686050

Technical Information

The dataset is structured as follows: Each dialogue is saved as a dictionary (with the dialogue id as key) containing 1) the scenario which was used for eliciting the corresponding dialogue and 2) the log. The log is a list of turns made by user and assistant, where each turn again is a dictionary containing the actual turn ("text"), who uttered it ("role") as well as the corresponding dialogue act annotations, following the scheme in Pareti and Lando (2019) but with some modifications (see annotation guidelines). The dialogue acts are saved as a list with the label as well as the start and end indices in the text. The dialogues were collected on a turn-by-turn basis and using a one-to-many ratio (see paper). The dialogue ids consist of numbers separated by dots. The first number corresponds to the the 30 dialogue beginnings that where collected in batch 1 (see paper). Since we assigned each of these dialogues to multiple participants in batch 2, dialogues sharing the first number in their id share both the same scenario and the first turn, etc.

Classification

400 Sprache
000 Informatik, Informationswissenschaft, allgemeine Werke

Keywords

dialogue data
voice assistants
crowdsourcing
Wizard-of-Oz
German
ecological validity
situated knowledge

Relationships

Is part of
10.3389/fcomp.2021.686050

Funder

Bundesministerium fur Wirtschaft und Energie BMWi (Deutschland)

Show full item record

This item is licensed under a Creative Commons License