Duplicate Record Match¶

Duplicate Record Match resides in L2/demos/text/dup_match directory and is to achieve the function of duplicate recoed matching, which includes modules such as Index, Predicate, Pair, Score, Cluster, etc.

Dataset¶

Input file: Randomly generate 10,000,000 lines (about 1GB) of csv file similar to L2/demos/text/dup_match/data/test.csv as test input file.
The Demo execute time 8,215.56 s.
Baseline (Dedupe Python: https://github.com/dedupeio/dedupe) execute time 35,030.751 s
Accelaration Ratio: 5.1X

Note

1. The baseline version run on Intel(R) Xeon(R) CPU E5-2690 v4, clocked at 2.60GHz.

2. The training result of Baseline includes self.predicate=((TfidfNGramCanopyPredicate: (0.8, Site name), TfidfTextCanopyPredicate: (0.8, Address)), (SimplePredicate: (alphaNumericPredicate, Site name), TfidfTextCanopyPredicate: (0.8, Site name)), (SimplePredicate: (wholeFieldPredicate, Site name), SimplePredicate: (wholeFieldPredicate, Zip))).

Executable Usage¶

Work Directory(Step 1)

The steps for library download and environment setup can be found in Vitis Data Analytics Library. For getting the design,

cd L2/demos/text/dup_match

Build kernel(Step 2)

Run the following make command to build your XCLBIN and host binary targeting a specific device. Please be noticed that this process will take a long time, maybe couple of hours.

make run TARGET=hw DEVICE=xilinx_u50_gen3x16_xdma_201920_3 HOST_ARCH=x86

Run kernel(Step 3)

To get the benchmark results, please run the following command.

./build_dir.hw.xilinx_u50_gen3x16_xdma_201920_3/host.exe -xclbin ./build_dir.hw.xilinx_u50_gen3x16_xdma_201920_3/TGP_Kernel.xclbin -in ./data/test.csv -golden ./data/golden.txt

Duplicate Record Match Input Arguments:

Usage: host.exe -xclbin <xclbin_name> -in <input data>  -golden <golden data>
       -xclbin:     the kernel name
       -in    :     input data
       -golden:     golden data

Example output(Step 4)

---------------------Duplicate Record Matching Flow-----------------
DupMatch::run...
TwoGramPredicate: column map size=14
threshold=1000
tf_value_ size is 238, index count=14, term count=122, skip=0
config=15, 316
config=15, 301
Found Platform
Platform Name: Xilinx
Found Device=xilinx_u50_gen3x16_xdma_201920_3
INFO: Importing build_dir.hw.xilinx_u50_gen3x16_xdma_201920_3/TGP_Kernel.xclbin
Loading: 'build_dir.hw.xilinx_u50_gen3x16_xdma_201920_3/TGP_Kernel.xclbin'
kernel has been created
kernel start------
threshold=1000
index count=11, term count=65, skip=0
threshold=1000
index count=14, term count=36, skip=0
CompoundPredicate: pair size=30
CompoundPredicate: pair size=30
CompoundPredicate: pair size=36
duplicate sets 10
DupMatch::run End
Execution time 8.979s
Pass validation.

------------------------------------------------------------

Profiling¶

The duplicate record match design is validated on Alveo U50 board at 270 MHz frequency. The hardware resource utilizations are listed in the following table.

Table 1 Hardware resources for duplicate record match¶
Name	LUT	BRAM	URAM	DSP
Platform	135778	180	0	4
TGP_Kernel	272031	50	260	506
TGP_Kernel_1	135974	25	130	253
TGP_Kernel_2	136057	25	130	253
User Budget	734238	1164	640	5936
Used Resources	272031	50	260	506
Percentage	37.05%	4.30%	40.63%	8.52%

The performance is shown below.: The input file is randomly generated 10,000,000 lines (about 1GB) of csv file similar to L2/demos/text/dup_match/data/test.csv as test input file. And its execute time is 8,215.56 s, so its throughput is 124.64 MB/s.