You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 10 Next »

Current state: "Under Discussion"

ISSUE: #6383

PRs:

Keywords: delete

Released: 

Summary

This document describes how to support delete in Milvus. Milvus provides a new delete API to delete entities from a collection.

Motivation

In some scenarios, users want to delete some entities from a collection which will no longer be searched out. Currently, users can only manually filter out unwanted entities from search results. We hope to implement a new function that allows users to delete entities from a collection.

Public Interfaces

def delete(self, condition=None)->MutationResult:
    """
    Delete entities by primary keys.
    Example: client.delete("_id in [1,10,100]")
    
    :param condition: an expression indicates whether an entity should be deleted
    :type  condition: str
    """

Design Details

In Milvus, Proxy maintains 2 Pulsar channels for each collection:

  1. Insert channel, handle following msg types:
    1. DDL msg (CreateCollection/DropCollection/CreatePartition/DropPartition)
    2. InsertMsg
    3. DeleteMsg
  2. Search channel, handle following msg types:
    1. SearchMsg
    2. RetrieveMsg

DataNode consumes messages from Insert channel only.

QueryNode consumes messages from both Insert channel and Search channel.

To support delete, we will send DeleteMsg into Insert channel also.


Since Milvus's storage is an append-only, `delete` function is implemented using soft delete, setting a flag on entity to indicate this entity has been deleted.

This solution needs:

  • record deletion offset in Milvus
  • let algorithm library support to search with a bitset

Now the algorithm library `Knowhere` has already supported to search with a bitset which indicates whether an entity is deleted. So we discuss how to store the deleted primary keys here.

Proposal delete operation persistent

  1. DataNode subscribe the insert channel
  2. Proxy receives a delete request, split into insert channels by primary keys
  3. DataNode receives a delete request from the insert channel, save it in buffer, and write it into the delta channel
  4. ...
  5. DataNode receives a flush request, write out the deletions saved above
  6. DataNode notifies IndexNode to building indexes
  7. finish

Proposal delete operation serving search(sealed+growing)

  1. QueryNode subscribe the insert channel
  2. QueryNode load the checkpoint and recovery by the checkpoint
  3. Proxy receives a delete request, split into insert channels by primary keys
  4. QueryNode retrieves a delete request from the insert channel, judges the segment to which each deletion belongs, and updates the Inverted Delta Logs(IDL)
  5. ...
  6. QueryNode retrieves a search request, search on each segment
  7. finish

Proposal delete operation serving search(sealed only)

  1. QueryNode subscribe the delta channel
  2. QueryNode load the checkpoint and recovery by the checkpoint
  3. Proxy receives a delete request, split into insert channels
  4. DataNode filter out all delete requests, and write them into the delta channel
  5. QueryNode retrieves delete requests from the delta channel, judges the segment to which each deletion belongs, and update the Inverted Delta Logs(IDL)
  6. ...
  7. QueryNode retrieve a search request, search on each segment
  8. finish

Process delete operation in system recovery

Unaffected

SegmentFilter

SegmentFilter provide a method that can used to get the segment id which a PK possible existed in. Implement by segments statistics and bloomfilter.

DeltaLog

DeltaLog is the persistent file, recording the primary keys deleted and the delete timestamp. Each DeltaLog only belongs to a segment.

InvertedDeltaLog

InvertedDeltaLog provides a method that can be used to get deletion that meets the timestamp condition fastly.

Test Plan

Testcase1

Search a deleted entity, except not in the resultset

client.insert()
client.search()
client.delete()
client.search()

Rejected Alternatives[WIP]

  1. Without any channel changes, the delete requests will be forward into the insert channels by Proxy
    1. Since the balance policy of QueryNode, some QueryNodes will only serve the sealed segments. They should discard all insert data from this channel. It cause a lot of performance waste.
  2. Add a delta channel for all shards, and all the delete requests will be forward into this delta channel by Proxy
    1. DataNode should subscribe insert channel and delta channel, each time to consume data need to align timetick between these two channel, complicatedly
  3. Add a delta channel for each shards, and the delete requests will be forward into this channel and the origin insert channel concurrently. DataNode subscribe the insert channel only, and the QueryNode(serving sealed segments) subscribe the delta channel
    1. It's difficult to sync the insert channel and delta channel

References


  • No labels