API

core.analyse(script, **kwargs)[source]

Run R code given in parameter script using rpy2. You can provide the R code python variables in kwargs and those are automatically transfered to suitable R format.

Parameters:script (str) – R code or a path to script to be run.
Kwargs:Parameters and their values with which to parameterize the R script.
Example:

core.analyse( """t <- table( df$a, df$b) print( chisq.test( t ) ) """, df = data) ## Runs the χ²-test to examine the expected cross-tabulated frequencies of a and b to observed frequeincies in data. data is a list of dictonaries, each dictonary having a and b variables.

core.counts(data, count_by, verbose=False)[source]

Counts the occurrences of the feature count_by in the dataset data. Returns the counts as a Counter object and prints them if verbose is True.

Parameters:
  • data (generator or list) – Data entries to be counted.
  • count_by (str) – The feature to be used for counting. Can be author or domain.
  • verbose (bool) – If True, prints the counts. Defaults to False.
Example:
  • core.counts(data, count_by = 'author') ## counts distinct authors in data.
  • core.counts(data, count_by = 'domain', verbose = True) ## counts distinct domains in data and print the counts.
core.data(source, folder='', **kwargs)[source]

Load data of type source using the parser for that data.

Parameters:
  • source (str) – Type of data loaded. Can be facebook, media, twitter.
  • folder (str) – Folder under data path that contains the data to be loaded.
Kwargs:
  • terms (list) – If source is facebook, news or twitter. Terms to be searched for in data filenames. Given as strings.
  • data_dir (str) – Data directory to override set data path.
Example:

core.data('news', terms = ['uutiset'], folder = 'yle') ## load news data from files with filename containing the term 'uutiset' from the subfolder YLE in your data folder.

core.data_path()[source]

Returns the existing data path.

core.data_sources()[source]

Lists possible data sources hybra core can parse.

core.describe(data)[source]

Describe the dataset data, showing the amount of posts, number of authors, historical data and more detailed data sources.

Parameters:data (generator or list) – Data entries. Given as generator or list.
core.export(data, file_path)[source]

Export the dataset data in common format to the given file format. Recognizes output format from file extension in given file path. Accepted formats: .csv, .xlsx

Parameters:
  • data (generator or list) – Data entries to be exported.
  • file_path (str) – Path to output file.
Example:

core.export(data, 'exported_data.csv') ## Exports data in common format to file 'exported_data.csv' in current path.

core.filter_by(data, filter_type, **kwargs)[source]

Filters the dataset data with the filter given in filter_type. Returns the filtered data if filter_type matches a filtering method in the modude filters.

Parameters:
  • data (generator or list) – Data entries to be filtered.
  • filter_type (str) – Filter type to be used. Can be text, datetime, author or domain.
Kwargs:
  • text (list) – If filter_type is text. List of strings to use for filtering.
  • substrings (bool) – If filter_type is text. If True, will search substrings in text content for terms given in parameter text. Defaults to True.
  • inclusive (bool) – If filter_type is text. If True, returns only entries with all terms given in parameter text. Defaults to True.
  • after (str) – Date and time after which to return entries.
  • before (str) – Date and time before which to return entries.
  • authors (list) – If filter_type is author. List of authors as strings to filter by.
  • domains (list) – If filter_type is domain. List of domains as strings to filter by.
Example:
  • core.filter_by(data, 'text', text = ['research']) ## Return from dataset `data` entries which include the term 'research' in text content.
  • core.filter_by(data, 'text', text = ['research', 'science'], substrings = False, inclusive = False) ## Return from dataset `data` entries which include the term 'research' or the term 'science' in text content as full strings.
  • core.filter_by(data, 'datetime', after = '2015-2-15') ## Return from dataset `data` entries with timestamp after the date '2015-2-15'.
  • core.filter_by(data, 'datetime', after = '2017-1-1', before = '2017-6-30 18:00:00') ## Return from dataset `data` entries with timestamp after the date '2017-1-1' and before the time '2017-6-30 18:00:00'.
  • core.filter_by(data, 'author', authors = ['author1', 'author2']) ## Return from dataset `data` entries which have 'author1' or 'author2' as creator.
  • core.filter_by(data, 'domain', domains = ['domain1.com', 'domain2.net']) ## Return from dataset `data` entries which are from domains 'domain1.com' or 'domain2.net'.
core.network(data)[source]

Draws a network the dataset data.

Parameters:data (generator or list) – Data entries.
core.sample(data, size, seed=100, export_file=None)[source]

Takes a random sample of the dataset data. Exports the sample to file using the core module export method if the parameter export_file is not None.

Parameters:
  • data (generator or list) – Data entries to be sampled.
  • size (int) – An integer value specifying the sample size.
  • seed (int) – Seed to use in randomization. Defaults to 100.
  • export_file (None or str) – Path to output file. Defaults to None.
Example:

core.sample(data, 100, seed = 0, export_file = 'exported_sample.csv') ## Takes a random sample of dataset `data` using the seed 0 and exports it to file 'exported_sample.csv' in current path.

core.set_data_path(path)[source]

Sets the path where the data is stored. Relative to where you run your Python. :param path: Where the data is stored. :type path: str

Example:
  • core.set_data_path('.') ## search for data from the current folder.
  • core.set_data_path('~/Documents/data/hybra-data') ## data in folder Documents/data/hybra-data.
core.timeline(datasets=[], **kwargs)[source]

Draws a timeline the dataset data.

Parameters:

datasets (list) – Datasets to plot. Given as generators or lists.

Kwargs:
  • colors (list) – List of css colors given as strings to be used in drawing the timeline plots.
Example:

core.timeline(datasets[news_data, fb_data], colors = ['blue', 'red']) ## Plots the dataset `news_data` as blue timeline and the dataset `fb_data` as red timeline.

core.unduplicate(data)[source]

Removes all dulicates from data and returns only unique items.

Parameters:data (generator or list) – Entries of data with potential duplicates.
core.wordcloud(data, **kwargs)[source]

Draws a wordcloud the dataset data.

Parameters:

data (generator or list) – Data entries.

Kwargs:
  • stopwords (list) – Words to be ignored in generating the wordcloud. Given as strings.

Internals

These are internal APIs and no promises are made about their stability. Always use the core to access these directly.

network.module_network.create_network(data)[source]
network.module_network.encode_utf8(string)[source]
timeline.module_timeline.create_axes(data)[source]
timeline.module_timeline.create_data_points(x_axis, y_axis)[source]
timeline.module_timeline.create_plots(datasets)[source]
timeline.module_timeline.create_timeline(datasets=[], colors=[])[source]
wordclouds.create_wordcloud(data, stopwords=[u'the', u'a', u'or', u'tai', u'and', u'ja', u'to', u'on', u'in', u'of', u'for', u'is', u'i', u'this', u'http', u'www', u'fi', u'com'])[source]