Characteristics

The characteristics section of a datacard offers a detailed exploration of a dataset's features, essential for assessing its alignment with specific AI tasks. This section highlights the dataset's potential to inform the development and refinement of AI models, including their use in training and fine-tuning processes. By providing comprehensive details on the dataset's composition, format, and thematic scope, it outlines the capabilities that AI models can gain post-training with this data.

Characteristics pydantic type:

class InferredMetadata(BaseModel):
    text_topics: Optional[List[str]]
    hugging_face: Optional[HuggingFaceInfo] = None
    github: Optional[GitHubInfo] = None
    pwc: Optional[PwcInfo] = None
    s2: Optional[S2Info] = None

class DatasetCharacteristics(BaseModel):
    languages: Optional[List[str]] = None
    task_categories: Optional[List[str]] = None
    format: Optional[List[str]] = None
    dataset_metrics: Optional[TextMetrics, VideoMetrics, AudioMetrics, TimeSeriesMetrics] = None
    dataset_filter_ids: Optional[List[str]] = None
    inferred_metadata: Optional[InferredMetadata] = None

Characteristics Fields

  • languages: Lists the languages present in the dataset. Understanding the linguistic diversity is crucial for deploying AI models in multilingual contexts and ensuring inclusivity.

  • task_categories: Describes the AI tasks the dataset is designed for, such as classification, translation, or sentiment analysis. This guides practitioners in selecting datasets aligned with their model's objectives.

  • format: Details the dataset's format (e.g., JSON, CSV, TXT), which is important for data parsing and integration into AI workflows.

  • dataset_metrics: Provides quantitative measures to assess the quality, performance, and suitability of datasets from different modalities. Refer to the types below for modality specific metric documentation.


Text Metrics (new):

  • name: The file and column name of the CSV or JSON where the data is stored. This identifier helps locate the dataset within your data repository.

  • dataType: The modality of the dataset metric ("Text_Metrics").

  • numberofSamples: Represents the total number of samples in the dataset.

  • languageScoreMean: The average language score across the dataset, reflecting the overall quality and fluency of the text. A higher mean score suggests a generally high-quality dataset.

  • languageScoreMax: The maximum language score found in the dataset.

  • languageScoreMin: The minimum language score within the dataset.

  • languageScoreBins: An array representing the histogram distribution of language scores across different bins between the minimum and maximum scores.

  • lengthMean: The average token length of text samples in the dataset.

  • lengthMin: The shortest token length of text samples in the dataset.

  • lengthMax: The longest token length of text samples in the dataset.

  • lengthBins: An array representing the histogram distribution of token lengths across different bins between the minimum and maximum lengths.

  • languagePie: An object that presents a pie chart distribution of languages present within the dataset.

Text Metrics (old):

  • dataType: The modality of the dataset metric ("TextMetrics").

  • num_dialogs: The number of dialogues, relevant for conversational AI models.

  • mean_inputs_length and mean_targets_length: Average lengths of inputs and targets, indicating the dataset's complexity and verbosity.

  • max_inputs_length and max_targets_length: Maximum lengths, highlighting the potential need for truncation or segmentation in AI models.

  • min_inputs_length and min_targets_length: Minimum lengths, showing the conciseness or brevity present in the data.

  • min_dialog_turns, max_dialog_turns, mean_dialog_turns: Dialogue metrics, essential for models dealing with conversational dynamics.

  • dataset_filter_ids: Identifiers used for filtering or categorizing the dataset, which can facilitate dataset discovery and selection.

Image Metrics:

  • name: The name of the folder where the data is stored.

  • dataType: The modality of the dataset metric ("Image_Metrics").

  • numberofSamples: The total number of samples in the dataset.

  • qualityScoreMean: The average quality score across the dataset, reflecting the objective overall quality and clarity of the images. This score typically considers factors such as resolution, clarity, and noise levels. A higher mean score suggests a generally high-quality dataset.

  • qualityScoreMax: The maximum quality score found in the dataset.

  • qualityScoreMin: The minimum quality score within the dataset.

  • qualityScoreBins: An array representing the histogram distribution of quality scores across different bins between the minimum and maximum scores.

  • lengthMean: The average height (in pixels) of images in the dataset.

  • lengthMin: The minimum height (in pixels) of images in the dataset.

  • lengthMax: The maximum height (in pixels) of images in the dataset.

  • lengthBins: An array representing the histogram distribution of image heights across different bins between the minimum and maximum heights.

  • widthMean: The average width (in pixels) of images in the dataset.

  • widthMin: The minimum width (in pixels) of images in the dataset.

  • widthMax: The maximum width (in pixels) of images in the dataset.

  • widthBins: An array representing the histogram distribution of image widths across different bins between the minimum and maximum widths.

  • formatPie: An object that presents a pie chart distribution of image formats (e.g., JPEG, PNG) present within the dataset.

  • modePie: An object that presents a pie chart distribution of image color modes (e.g., RGB, grayscale) within the dataset.

  • oddAspectRatio: The percentage of images within the dataset with odd aspect ratios.

  • oddSizes: The percentage of images with odd sizes.

Video Metrics:

  • name: The name of the folder where the data is stored

  • dataType: The modality of the dataset metric ("Video_Metrics").

  • numberofSamples: The total number of video samples in the dataset.

  • qualityScoreMean: The average quality score across the dataset, reflecting the objective overall quality and clarity of the videos. This score typically considers factors such as resolution, clarity, frame rate, and noise levels. A higher mean score suggests a generally high-quality dataset.

  • qualityScoreMax: The maximum quality score found within the dataset.

  • qualityScoreMin: The minimum quality score within the dataset.

  • qualityScoreBins: An array representing the histogram distribution of quality scores across different bins between the minimum and maximum scores.

  • durationMean: The average duration of video samples in the dataset, measured in seconds.

  • durationMin: The shortest duration of video samples in the dataset.

  • durationMax: The longest duration of video samples in the dataset.

  • durationBins: An array representing the histogram distribution of video durations across different bins between the minimum and maximum durations.

  • lengthMean: The average height (in pixels) of videos in the dataset.

  • lengthMin: The minimum height (in pixels) of videos in the dataset.

  • lengthMax: The maximum height (in pixels) of videos in the dataset.

  • lengthBins: An array representing the histogram distribution of image heights across different bins between the minimum and maximum heights.

  • widthMean: The average width (in pixels) of videos in the dataset.

  • widthMin: The minimum width (in pixels) of videos in the dataset.

  • widthMax: The maximum width (in pixels) of videos in the dataset.

  • widthBins: An array representing the histogram distribution of videos widths across different bins between the minimum and maximum widths.

  • formatPie: An object that presents a pie chart distribution of video formats (e.g., MP4, AVI, MKV) within the dataset.

  • codecPie: An object that presents a pie chart distribution of video codecs (e.g., H.264, VP9, HEVC) within the dataset.

  • oddAspectRatio: The percentage of videos within the dataset with odd aspect ratios.

  • oddSizes: The percentage of videos with odd sizes.

Audio Metrics:

  • name: The name of the folder where the data is stored.

  • dataType: The modality of the dataset metric ("Audio_Metrics").

  • numberofSamples: The total number of audio samples in the dataset.

  • qualityScoreMean: The average quality score across the dataset, reflecting the objective overall quality and clarity of the audio. This score typically considers objective factors such as bitrate, noise levels, and fidelity. A higher mean score suggests a generally high-quality dataset.

  • qualityScoreMax: The maximum quality score found within the dataset.

  • qualityScoreMin: The minimum quality score within the dataset.

  • qualityScoreBins: An array representing the histogram distribution of quality scores across different bins between the minimum and maximum scores.

  • durationMean: The average duration of audio samples in the dataset, measured in milliseconds.

  • durationMin: The shortest duration of audio samples in the dataset.

  • durationMax: The longest duration of audio samples in the dataset.

  • durationBins: An array representing the histogram distribution of audio durations across different bins between the minimum and maximum durations.

  • sampleRateMean: The average sample rate (in Hz) of the audio files in the dataset.

  • sampleRateMin: The minimum sample rate (in Hz) of audio files in the dataset.

  • sampleRateMax: The maximum sample rate (in Hz) of audio files in the dataset.

  • sampleRateBins: An array representing the histogram distribution of audio sample rates across different bins between the minimum and maximum sample rates.

  • formatPie: An object that presents a pie chart distribution of audio formats (e.g., MP3, WAV, FLAC) within the dataset.

  • codecPie: An object that presents a pie chart distribution of audio codecs (e.g., AAC, MP3, Opus) within the dataset.

  • oddSampleRates: The percentage of audio files within the dataset with sample rates that deviate from common standards (e.g., 44.1 kHz, 48 kHz).

  • lowBitrate: The percentage of audio files with a bitrate lower than a specified threshold, indicating potential quality issues.


Inferred metadata:

  • text_topics: Topics covered in the dataset, providing insights into its thematic scope.

  • Platforms-specific metadata from sources like Hugging Face, GitHub, Papers with Code, and Semantic Scholar, which can offer additional context, popularity metrics, and usage examples.

Last updated