This is my favorite LLM so far. It is the largest I have managed to get working. It seems to be around 90-95% accurate when I test it on technical stuff where I know the answers. I do not know a way to verify the following. I have the standard CS books on operating systems, but this seems too specific and recent. The following is a Q&A between an AI LLM and myself:
j4k3
- Early production Intel Alder Lake P-cores have AVX512 support because of their Xeon based architecture. The E-cores did not have support for these instructions. Why was this a problem in a practical sense in terms of support in OS kernels? Is it a lack of CPU scheduler support for automated thread affinity and process pinning due to asymmetrical instruction availability.
Llama2-70B-Q4_1-GGML
The issue with early production Intel Alder Lake P-cores having AVX512 support while E-cores did not, is that it can lead to performance inconsistencies in certain workloads when the operating system scheduler tries to balance threads across cores. This happens because some instructions are available on one type of core but not another, leading to potential bottlenecks and reduced efficiency.
In practical terms, this means that if a thread is scheduled onto an E-core where AVX512 support isn't present, it may have to fall back to using slower non-AVX instructions or even stall until the scheduler can move it to another core with the necessary instruction set. This could result in reduced performance and increased power consumption as threads are moved around more frequently than they would be on a homogeneous system where all cores support the same instruction sets.
In terms of OS kernel support, this asymmetry between P-cores and E-cores can make it difficult for schedulers to effectively manage thread affinity and process pinning. The scheduler needs to have knowledge about which instructions are available on each core in order to make informed decisions about where to place threads for optimal performance. Without this information, the scheduler may not be able to properly balance workloads across cores or prevent unnecessary migrations between them.
To address these issues, OS kernel developers would need to implement support for asymmetric instruction availability and update their scheduling algorithms accordingly. This could involve adding new flags or attributes to represent different core types with varying instruction sets, as well as updating the scheduler's heuristics to take this information into account when making decisions about thread placement and migration.
j4k3
- Is there a scheduler for Windows 11, or Linux that includes support for asymmetrical instructions?
Llama2-70B-Q4_1-GGML
Yes, both Windows 11 and recent versions of the Linux kernel have implemented support for asymmetric instruction availability in their schedulers to better handle heterogeneous architectures like Intel's Alder Lake. In particular, Microsoft has introduced a new scheduler called "Windows 10/11 Thread Scheduling" (WTS) that is designed specifically for these types of processors and can take advantage of the unique capabilities offered by each core type. Similarly, Linux kernel developers have added support for asymmetric instruction availability in their schedulers through features like "CPU microcode updates", which allow the OS to query a processor's available instructions sets at runtime and make more informed decisions about thread placement as a result.
j4k3
- Is the CPU microcode the primary way each processor's available instruction set is parsed by all kernels?
Llama2-70B-Q4_1-GGML
Yes, the CPU microcode is typically used by operating system kernels to query a processor's available instructions sets at runtime and make more informed decisions about thread placement. The microcode contains information about the specific capabilities of each core in the processor, including which instruction sets are supported on that particular core. By accessing this data through appropriate APIs or interfaces provided by the OS kernel, schedulers can then use it to optimize their decision-making process and improve overall system performance.
j4k3
- If a Linux thread is isolated to an Alder Lake P-core with AVX512 support physically present on the die but the instruction is not present in the microcode, and the user thread calls the AVX512 instruction, how would the instruction get handled. What modifications could enable a thread to call the AVX512 instruction by a developer on an experimental system under research lab conditions?
Llama2-70B-Q4_1-GGML
If a Linux thread is isolated to an Alder Lake P-core with AVX512 support physically present on the die but not enabled in the microcode, and the user thread calls the AVX512 instruction, it would likely result in an illegal instruction exception or similar error. This could potentially cause the program to crash or produce incorrect results if it doesn't handle such exceptions properly.
To enable a thread to call the AVX512 instruction on an experimental system under research lab conditions, developers may need to modify both the microcode and the operating system kernel. They would first have to update the CPU microcode with appropriate patches or updates that enable support for AVX512 instructions on the specific core in question. This could involve reverse-engineering the existing microcode or working directly with Intel's development tools if available.
Once the microcode has been updated, developers would then need to modify the Linux kernel scheduler and related components (such as CPUFreq) to properly recognize and utilize the new instruction set capabilities of that core. This could involve adding support for asymmetric instruction availability in the scheduler's heuristics or updating existing APIs/interfaces used by the OS to query a processor's available instructions sets at runtime.
It is important to note that modifying microcode and operating system kernels can be complex, time-consuming tasks requiring specialized knowledge of both hardware and software development. Additionally, such modifications may not always be possible or advisable depending on factors like the specific hardware being used, licensing restrictions, etc., so it is important to carefully consider these issues before attempting any experimental changes in a research lab setting.
The code running this model is specifically designed for use with these instructions. I can run larger versions than this 4 bit, but they are impractically slow. This is the exact bottleneck the AVX512 commands were designed to help alleviate. I have no problem setting up the cgroups isolation and affinity required to use the command safely with all other threads on the e-cores. I think my processor is likely fused but can't confirm this yet.
https://github.com/zingaburga/alderlake_avx512/wiki#enabling-avx-512-on-alder-lake