Overview
AI Data Hubs support diverse document types for creating comprehensive knowledge bases. Understanding supported formats helps you organize and structure your data effectively while managing billing considerations.
Learning Objectives:
Identify supported file formats for upload
Understand how webpages are treated as documents
Learn website import capabilities and limitations
Understand billing implications for different document types
Plan data hub organization based on document type support
Supported Document Types
Files
AI Data Hubs support the following file formats:
Text and Data Files
CSV (Comma-Separated Values) - Structured data and spreadsheets
TXT (Plain Text) - Simple text documents without formatting
MD (Markdown) - Formatted documentation with headers and styling
JSON (JavaScript Object Notation) - Structured data and API responses
Microsoft Office Documents
DOCX (Word Document) - Modern Word files (.docx)
DOC (Word Document) - Legacy Word files (.doc)
XLSX (Excel Spreadsheet) - Modern Excel files (.xlsx)
XLS (Excel Spreadsheet) - Legacy Excel files (.xls)
PPTX (PowerPoint Presentation) - Modern PowerPoint files (.pptx)
PPT (PowerPoint Presentation) - Legacy PowerPoint files (.ppt)
Web and Code Files
HTML (Hypertext Markup Language) - Web pages and formatted content
CSS (Cascading Style Sheets) - Stylesheets and design documentation
JS (JavaScript) - Code files and scripts
PDF (Portable Document Format) - Universal document format with complex layouts
Webpages
Webpages are treated as individual documents within data hubs:
Single webpage import - Store individual URLs as separate documents
Content extraction - System extracts text content from the page
Knowledge base creation - Archive important web resources for offline access
Search integration - Webpage content becomes searchable within your data hub
Result: Webpage content is indexed and available as a searchable document alongside your other files.
Tip: Use webpage imports to capture reference documentation, articles, and online resources that support your knowledge base.
Websites
Websites consist of multiple webpages that can be imported:
Website mode import - Import entire website content as a single unit
Individual pages - Each webpage is stored as a separate document within the website collection
Crawling limitations - Some dynamic or protected content may not be accessible
Organization - Website content is grouped together for easy management
Result: Complete website content becomes part of your data hub structure.
Note: Website import requires the website to be publicly accessible and respect crawling permissions (robots.txt).
Billing Considerations
Understanding how different document types impact your billing is critical for cost management.
Document Count Billing (Files)
These count toward your monthly document limit:
Individual files - Each uploaded file counts as one document
Webpages - Each imported URL counts as one document
High-volume impact - Large quantities of files can quickly increase document count
Example: Uploading 50 PDFs + 30 CSV files = 80 documents toward your limit
Website Count Billing (Websites)
Website mode has separate billing:
Website as single unit - Entire website import counts as one website (not multiple documents)
Doesn't affect document limit - Website imports don't add to your monthly document count
Separate website limit - Websites have their own monthly limit based on your plan
Example: Importing a 100-page website in website mode = 1 unit toward website limit, 0 documents toward document limit
Best Practices
Use website mode for large sites: When archiving an entire documentation site or knowledge base, use website import to avoid consuming document limits.
Upload files for controlled content: For documents you edit and version, use file uploads to maintain control over content.
Import select webpages: For a few important articles, import as webpages rather than full websites to organize them alongside files.
Plan around limits: Track your document and website counts in workspace settings to avoid unexpected overage charges.
Organize by document type: Create separate data hub collections for files vs. web content to maintain clean organization.
Test website imports: Before importing large websites, test with a small subsection to verify content extraction quality.
Use supported formats: Convert documents to supported formats (e.g., DOCX, PDF) before upload to ensure proper indexing.
Monitor billing dashboard: Regularly check your workspace usage statistics to understand consumption patterns.
Common Questions
Q: Are files and webpages counted the same for billing?
A: Yes. Both individual files and individual webpage imports count as one document toward your monthly document limit.
Q: What's the difference between webpage and website imports?
A: Webpage imports add one document per URL to your document count. Website import adds one website unit (containing many pages) to a separate website limit, not your document count.
Q: Do deleted documents or websites count toward billing?
A: No. Deleted content is removed from your active count. Only currently stored documents and websites count toward monthly limits.
Q: What happens if I exceed my document limit?
A: Most plans prevent uploads beyond the limit. Contact your administrator to upgrade your plan or archive unused content.
Q: Are there file size limits for uploads?
A: Yes. Individual files have size limits (typically 50MB-100MB depending on plan). Large PDFs or presentations may need to be split.
Q: Can I import password-protected websites?
A: No. The system can only access publicly available content. For secured resources, download the content and upload as files instead.
Q: Do HTML files import differently than webpages?
A: Yes. HTML files are treated as static documents. Webpage imports actively fetch and extract content from live URLs, which may change over time.
