honestly, I didn’t check what approach you are using for negacyclic convolution (i.e., polynomial multiplication mod X^N + 1). However, what I have seen so far, is to make a negacyclic extension of the polynomials and then run the trad FFT.
Let me introduce my proposal, which does not waste resources: namely, it does not calculate any zero and/or conjugated values, which occur in the trad FFT image of a negacyclic extension. This way, the transform can be calculated about 4x faster. You can find my paper at IACR ePrint https://eprint.iacr.org/2021/480 and as it will be presented at CSCML '21, you are also warmly welcome to attend my talk!