Documents
Learn how to interact with and manage documents using the Weav.ai platform. This guide covers uploading, retrieving, analyzing and more.
Overview
The Documents section provides functionalities to upload, manage, and extract insights from documents on the Weav.ai platform. This guide walks you through key operations such as creating documents, retrieving metadata, downloading forms, generating summaries, and analyzing page-level data. These tools are designed to streamline document workflows and improve efficiency.
Prerequisite - To get started, ensure your environment is properly configured by following the Setup Guide.
Create document
Upload a document from your local system to the copilot. Has the ability to upload documents into a folder.
python3 documents/documents/create_document.py --file_path "AAPL_10Q.pdf"
Parameters:
Parameter | Description | Required/Optional |
---|---|---|
file_path | The file path to your document on your local system | Required |
folder_id | The ID of the folder that should hold the document. If it is not provided, the file will be uploaded but will not be inside any folder. | Optional |
Response:
Upon uploading, the document response should be as follows.
{
"ai_tags":[
],
"category":"",
"created_at":datetime.datetime(2024, 10, 3, 22, 14, 10, tzinfo=TzInfo(UTC)),
"download_url":"/doc-proc-service/local_store/google-oauth2|117349365869611297391/66ff1732927ce8c0ebda42bd/66ff1732927ce8c0ebda42bd",
"file_name":"AAPL_10Q.pdf",
"form_instances":"None",
"id":"66ff1732927ce8c0ebda42bd",
"in_folders":[
],
"media_type":"application/pdf",
"pages":[
],
"redacted_summary":"",
"size":654929,
"source":"application",
"status":"NEW",
"step_status":{
"FORM_EXTRACTION":{
"error":"",
"modified_at":datetime.datetime(2024, 10, 3, 22, 14, 10, tzinfo=TzInfo(UTC)),
"response":{
},
"status":"NOT_STARTED"
}
},
"summary":"",
"summary_status":"",
"tags":[
],
"tenant_id":"",
"user_id":"google-oauth2|117349365869611297391"
}
After the document has been uploaded, it undergoes process_document_sensors
workflow which may take a while.
Once the processing is completed, you will notice that some fields from before are updated using information extracted from the document using the Get Document
API
Get Document
Retrieve information about the uploaded document. Fields in this response might be empty initially but are completely filled once basic processing is completed
Parameters:
Parameter | Description | Required/Optional |
---|---|---|
document_id | The unique identifier of the document | Required |
fill_pages | If false pages will be empty | Optional (default: False) |
Response:
{
"id":"66ff1732927ce8c0ebda42bd",
"media_type":"application/pdf",
"download_url":"/doc-proc-service/local_store/google-oauth2|117349365869611297391/66ff1732927ce8c0ebda42bd/66ff1732927ce8c0ebda42bd",
"pages":[
],
"status":"AI_READY",
"file_name":"AAPL_10Q.pdf",
"created_at":"",
"size":654929,
"source":"application",
"category":"UNITED STATES SECURITIES AND EXCHANGE COMMISSION",
"summary":"",
"redacted_summary":"",
"summary_status":"",
"step_status":{
"FORM_EXTRACTION":{
"status":"DONE",
"modified_at":"",
"error":"",
"response":{
"name":"New",
"category":"UNITED STATES SECURITIES AND EXCHANGE COMMISSION",
"description":"A Test form",
"fields":[
{
"identifier":"4b68933c-2432-4784-8b01-37a1803b72a0",
"name":"new",
"field_type":"Number",
"description":"",
"is_array":true,
"fill_by_search":false,
"value":[
"0.00001",
"1.375",
.
.
],
"weav_page_number":[
0,
0,
0,
0,
0,
.
.
]
}
],
"is_shared":false,
"is_searchable":false,
"_id":"66ff076db1d0dfb13c99760f",
"user_id":"google-oauth2|117349365869611297391",
"created_at":"2024-10-03T21:06:53Z",
"form_id":"66ff076db1d0dfb13c99760f"
}
}
},
"in_folders":[
],
"tags":[
],
"ai_tags":[
],
"user_id":"google-oauth2|117349365869611297391",
"tenant_id":"",
"form_instances":"None"
}
Download form instance
Allows the user to download the extracted form
python3 documents/documents/download_form_instance.py --download_format "JSON" --document_id "66ff1732927ce8c0ebda42bd"
Parameters:
Parameter | Description | Required/Optional | Allowed values |
---|---|---|---|
document_id | The unique identifier of the document | Required | |
download_format | The format in which the results need to be viewed | Optional (default : “JSON”) | JSON, CSV |
Response:
--download_format = "JSON"
{
"doc_id":"66ff1732927ce8c0ebda42bd",
"form_id":"66ff076db1d0dfb13c99760f",
"new":[
"0.00001",
"1.375",
"0.000",
.
.
.
]
}
--download_format = "CSV"
doc_id form_id new
0 66ff1732927ce8c0ebda42bd 66ff076db1d0dfb13c99760f ['0.00001', '1.375', '0.000', '0.875', '1.625'...
Get document categories
Retrieves a list of all categories present on the copilot, considering categories from all documents.
python3 documents/documents/get_document_categories.py
Response:
{
"categories":[
"ANNUAL REPORT",
"SECURITIES AND EXCHANGE COMMISSION",
"UNITED STATES SECURITIES AND EXCHANGE COMMISSION"
]
}
Get document tags
Retrieves a list of all tags present on the copilot, considering tags from all documents.
python3 documents/documents/get_document_tags.py
Response:
{'tags': [['apple']]}
Get document page level status
Retrieves count of pages on which workflow has succeeded or failed.
Parameters:
Parameter | Description | Required/Optional |
---|---|---|
document_id | The unique identifier of the document | Required |
Response:
{
"classification":{
"pages_done":25,
"pages_failed":0
},
"entity_extraction":{
"pages_done":25,
"pages_failed":0
},
"ocr":{
"pages_done":25,
"pages_failed":0
},
"vectorization":{
"pages_done":25,
"pages_failed":0
}
}
Get page text and words
Retrieves information about words and text in a single page of the document.
python3 documents/documents/get_page_text_and_words.py --document_id 66f9ccbb927ce8c0ebda4261 --page_number 1
Parameters:
Parameter | Description | Required/Optional |
---|---|---|
document_id | The unique identifier of the document | Required |
page_number | The page number for which the information is required | Required |
Response:
{
"page_number":1,
"media_type":"NONE",
"page_text":"9/6/23, 9:57 AM\naapl-20230701\nIndicate by check mark whether the Registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T (§232.405 of this chapter) during the preceding 12 months (or for such shorter period that the Registrant was required to submit such files).\nYes :selected: :unselected: No\nIndicate by check mark whether the Registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company, or an emerging growth company. See the definitions of \"large accelerated filer,\" \"accelerated filer,\" \"smaller reporting company,\" and \"emerging growth company\" in Rule 12b-2 of the Exchange Act.\nLarge accelerated filer :selected:\nAccelerated filer :unselected:\nNon-accelerated filer :unselected:\nSmaller reporting company :unselected:\nEmerging growth company :unselected:\nIf an emerging growth company, indicate by check mark if the Registrant has elected not to use the extended transition period for complying with any new or revised financial accounting standards provided pursuant to Section 13(a) of the Exchange Act. :unselected:\nIndicate by check mark whether the Registrant is a shell company (as defined in Rule 12b-2 of the Exchange Act). :unselected: Yes :selected: No\n15,634,232,000 shares of common stock were issued and outstanding as of July 21, 2023.\nhttps://www.sec.gov/Archives/edgar/data/320193/000032019323000077/aapl-20230701.htm\n2/31",
"status":"NONE",
"classification":{
"page_class":"Company Regulatory Compliance",
"page_sections":[
"Interactive Data File Submission",
"Company Classification",
"Emerging Growth Company Status",
"Shell Company Status",
"Common Stock Issuance"
],
"page_no":2
},
"extracted_entities":[
{
"entity_group":"default",
"entities":[
{
"polygon":[
],
"key":"Date",
"value":"9/6/23, 9:57 AM",
"label":"Document Date",
"is_sensitive":false
},
{
"polygon":[
],
"key":"Document ID",
"value":"aapl-20230701",
"label":"Document Identifier",
"is_sensitive":false
},
.
.
.
]
}
],
"redacted_summary":"",
"words":[
{
"content":"9/6/23,",
"polygon":[
{
"x":71.0,
"y":42.0
},
.
.
],
"span":{
"offset":0,
"length":7
},
"confidence":0.994
},
.
.
.
]
}
Get Page
Retrieves all the information from a single page of a document
python3 documents/documents/get_page.py --document_id 66f9ccbb927ce8c0ebda4261 --page_number 1
Parameters:
Parameter | Description | Required/Optional | Allowed values |
---|---|---|---|
document_id | The unique identifier of the document | Required | |
page_number | The page number for which the information is required | Required | |
bounding_boxes | Get information about bounding boxes polygons | Optional | false, f, False, true, t, True |
Response:
{
"classification":{
"page_class":"Company Regulatory Compliance",
"page_no":2,
"page_sections":[
"Interactive Data File Submission",
"Company Classification",
"Emerging Growth Company Status",
"Shell Company Status",
"Common Stock Issuance"
]
},
"download_url":"/doc-proc-service/local_store/google-oauth2|117349365869611297391/66ff1732927ce8c0ebda42bd/1.jpg",
"extracted_entities":[
{
"entities":[
{
"is_sensitive":false,
"key":"Date",
"label":"Document Date",
"polygon":[
[
[
{
"x":71.0,
"y":42.0
},
{
"x":135.0,
"y":43.0
},
{
"x":136.0,
"y":66.0
},
{
"x":71.0,
"y":66.0
}
]
],
[
[
{
"x":140.0,
"y":43.0
},
{
"x":180.0,
"y":43.0
},
{
"x":180.0,
"y":66.0
},
{
"x":140.0,
"y":66.0
}
]
],
[
[
{
"x":185.0,
"y":43.0
},
{
"x":212.0,
"y":42.0
},
{
"x":212.0,
"y":66.0
},
{
"x":185.0,
"y":66.0
}
]
]
],
"value":"9/6/23, 9:57 AM"
},
.
.
.
],
"entity_group":"default"
}
],
"media_type":"image/jpeg",
"page_hierarchy":"None",
"page_number":1,
"page_text":"Indicate by check mark whether the Registrant has submitted electronically every Interactive Data File required to be ....",
"redacted_summary":"The document, identified as 'aapl-20230701', was ... ",
"sensitive_words":[
],
"status":"VECTORIZATION_DONE",
"step_status":{
"OCR":{
"error":"",
"modified_at":datetime.datetime(2024, 10, 3, 22, 14, 48, tzinfo=TzInfo(UTC)),
"response":{
},
"status":"DONE"
},
"classification":{
"error":"",
"modified_at":datetime.datetime(2024, 10, 3, 22, 15, 33, tzinfo=TzInfo(UTC)),
"response":{
},
"status":"DONE"
},
"entity_extraction":{
"error":"",
"modified_at":datetime.datetime(2024, 10, 3, 22, 16, 47, tzinfo=TzInfo(UTC)),
"response":{
},
"status":"DONE"
},
"vectorization":{
"error":"",
"modified_at":datetime.datetime(2024, 10, 3, 22, 17, 42, tzinfo=TzInfo(UTC)),
"response":{
},
"status":"DONE"
}
},
"summary":"The document, identified as 'aapl-20230701', was submitted on..."
}
Trigger document summary
Requests the copilot to generate a summary for a document. The response changes based on the state of the summarization workflow. Once the summarization is completed, the script returns the summary.
python3 documents/documents/trigger_document_summary.py --document_id 66ff1732927ce8c0ebda42bd
Parameters:
Parameter | Description | Required/Optional |
---|---|---|
document_id | The unique identifier of the document | Required |
Response:
{
"summary_status":"PROCESSING",
"summary":"",
"redacted_summary":""
}
{
"summary_status":"READY",
"summary":"Apple Inc.'s Q3 2023 ...",
"redacted_summary":"Apple Inc.'s Q3 2023 ..."
}
Get document summary status
Once the document summary has been triggered, this script helps check the status of summarization for the document.
python3 documents/documents/get_document_summary_status.py --document_id 66fe1b65b1d0dfb13c9975f042.0
Parameters:
Parameter | Description | Required/Optional |
---|---|---|
document_id | The unique identifier of the document | Required |
Response:
If the script returns the following:
{'message': 'Summarization not triggered'}
Follow document summary script to trigger summarization.
Once the summarization is triggered,
The same script should return:
{'redacted_summary': '', 'summary': '', 'summary_status': 'PROCESSING'}
Once processing is completed, the script should return
{
"summary_status":"READY",
"summary":"Apple Inc.'s Q3 2023 ...",
"redacted_summary":"Apple Inc.'s Q3 2023 ..."
}