Abstract
The CROWDSS dataset (Crowdsourced Wizard of Oz Dialogue dataset based on Situated Scenarios) contains 113 German dialogues collected in a Wizard-of-Oz fashion (i.e., simulating human-machine interaction).
To refer to CROWDSS in any publication, please cite the following paper:
Frommherz, Y. and Zarcone, A. (2021). Crowdsourcing ecologically-valid dialogue data for German. In Frontiers in Computer Science, Vol 3, doi: 10.3389/fcomp.2021.686050
Technical Information
The dataset is structured as follows: Each dialogue is saved as a dictionary (with the dialogue id as key) containing 1) the scenario which was used for eliciting the corresponding dialogue and 2) the log.
The log is a list of turns made by user and assistant, where each turn again is a dictionary containing the actual turn ("text"), who uttered it ("role") as well as the corresponding dialogue act annotations, following the scheme in Pareti and Lando (2019) but with some modifications (see annotation guidelines). The dialogue acts are saved as a list with the label as well as the start and end indices in the text.
The dialogues were collected on a turn-by-turn basis and using a one-to-many ratio (see paper). The dialogue ids consist of numbers separated by dots. The first number corresponds to the the 30 dialogue beginnings that where collected in batch 1 (see paper). Since we assigned each of these dialogues to multiple participants in batch 2, dialogues sharing the first number in their id share both the same scenario and the first turn, etc.