Skip to main content

Format Compatibility Assessment

How each data source maps to GospeLib's JSON schema, FalkorDB graph model, and ingest pipeline.

Mapping to scripture-text v2.0.0 JSON Schema

SourceNative FormatConversion DifficultyNotes
STEPBible-DataTSVLowColumn mapping to JSON properties is straightforward
scrollmapperSQLite/JSON/CSVLowJSON variant is closest; SQLite requires SQL extraction
lxx-sweteTSVLowVerse-level text → scripture-text passages
MorphGNTSpace-separated textLowFixed-column format, easily parsed
OpenScriptures morphhbOSIS XMLMediumXML namespace handling, <w> element extraction
viz.bibleCSV/JSON/Neo4jLow–MediumJSON is direct; Neo4j dump needs Cypher adaptation
biblicalhumanitiesHTML/XML/CSV (varies)MediumMultiple formats across repos
berean.bibleUSFM/xlsx/tsvMediumUSFM needs parser; xlsx needs openpyxl
ebible.orgUSFMMediumStandard format but needs USFM parsing infrastructure
unfoldingWordUSFMMediumSame USFM consideration as ebible.org
marvel.bibleCustom modulesHighUnknown format, likely needs reverse engineering
Clear-Bible MACULAXML (lowfat/nodes/TEI) + TSVMediumXML namespaces; TSV variant simplifies parsing
ETCBC/dssText-FabricHighRequires text-fabric Python package; custom ETL
SEDRA IVJSON (REST API)LowStandard JSON response; API consumption
SefariaJSON / MongoDB dumpMediumBulk export is MongoDB BSON; API is clean JSON
CrossWire SWORDSWORD binary → OSIS XMLMediumRequires mod2osis CLI; then standard XML parsing

Mapping to FalkorDB Graph Model

Data TypeSource(s)Graph Mapping
Morphological enrichmentSTEPBible, MorphGNT, morphhbProperties on existing :InterlinearWord nodes
New translationsscrollmapper, lxx-swete, ebible, bereanNew :Translation:Book:Passage subgraphs
Cross-referencesscrollmapper:CROSS_REFERENCES edges between :Passage nodes (with votes property)
PeopleSTEPBible TIPNR, viz.bibleNew :Person nodes with :MENTIONED_IN edges to :Passage
PlacesSTEPBible TIPNR, viz.bibleNew :Place nodes with :MENTIONED_IN edges to :Passage, geocoding properties
VersificationSTEPBible TVTMS:MAPS_TO edges between :Passage nodes across traditions
Eventsviz.bibleNew :Event nodes with :PARTICIPANT, :LOCATED_AT, :REFERENCED_IN edges
Lexicon additionsSTEPBible TFLSJ, Dodson, bereanProperties/nodes enriching existing :LexiconEntry nodes
Syntax treesMACULA:SyntaxNode tree with :CHILD_OF edges, linked to :InterlinearWord
DSS transcriptionsETCBC/dss:Manuscript:Fragment:DSSWord nodes with linguistic properties
Aramaic lexiconSEDRA IV, Sefaria:AramaicLexiconEntry nodes or combined into :LexiconEntry with language: "aramaic"
CommentarySWORD:Commentary:CommentaryEntry nodes with :COMMENTS_ON edges to :Passage

Ingest Pipeline Integration

GospeLib's ingest pipeline is a 7-stage linear Python/Click pipeline with Pydantic validation. New data sources would extend this pipeline:

New StageSourcesDescription
Morphology enrichmentSTEPBible TAHOT/TAGNT, MorphGNT, morphhbMERGE morphological properties onto existing :InterlinearWord nodes
Cross-reference ingestionscrollmapperCREATE :CROSS_REFERENCES edges from TSK data
People/Places ingestionSTEPBible TIPNR, viz.bibleCREATE :Person and :Place nodes with relationship edges
Translation ingestion (extended)scrollmapper, lxx-swete, ebible, bereanExisting TranslationPipeline with format-specific pre-parsers
Versification mappingSTEPBible TVTMSCREATE :MAPS_TO edges between verse systems
Syntax tree ingestionMACULACREATE :SyntaxNode tree structures linked to word-level nodes
DSS ingestionETCBC/dssCREATE :Manuscript fragments with word-level linguistic data
Aramaic lexicon ingestionSEDRA IV, SefariaCREATE or MERGE :LexiconEntry nodes for Aramaic vocabulary
Commentary ingestionSWORDCREATE :Commentary and :CommentaryEntry nodes from OSIS XML

Each new stage would follow the existing pipeline pattern: Pydantic model validation → batch UNWIND/MERGE Cypher writes → idempotent via MERGE.