Discussion:
[RFC Patch V1 00/30] Enable memoryless node on x86 platforms
Jiang Liu
2014-07-11 07:37:17 UTC
Permalink
Previously we have posted a patch fix a memory crash issue caused by
memoryless node on x86 platforms, please refer to
http://comments.gmane.org/gmane.linux.kernel/1687425

As suggested by David Rientjes, the most suitable fix for the issue
should be to use cpu_to_mem() rather than cpu_to_node() in the caller.
So this is the patchset according to David's suggestion.

Patch 1-26 prepare for enabling memoryless node on x86 platforms by
replacing cpu_to_node()/numa_node_id() with cpu_to_mem()/numa_mem_id().
Patch 27-29 enable support of memoryless node on x86 platforms.
Patch 30 tunes order to online NUMA node when doing CPU hot-addition.

This patchset fixes the issue mentioned by Mike Galbraith that CPUs
are associated with wrong node after adding memory to a memoryless
node.

With support of memoryless node enabled, it will correctly report system
hardware topology for nodes without memory installed.
***@bkd01sdp:~# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
node 0 size: 15725 MB
node 0 free: 15129 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
node 1 size: 15862 MB
node 1 free: 15627 MB
node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node 0 1 2 3
0: 10 21 21 21
1: 21 10 21 21
2: 21 21 10 21
3: 21 21 21 10

With memoryless node enabled, CPUs are correctly associated with node 2
after memory hot-addition to node 2.
***@bkd01sdp:/sys/devices/system/node/node2# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
node 0 size: 15725 MB
node 0 free: 14872 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
node 1 size: 15862 MB
node 1 free: 15641 MB
node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
node 2 size: 128 MB
node 2 free: 127 MB
node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node 0 1 2 3
0: 10 21 21 21
1: 21 10 21 21
2: 21 21 10 21
3: 21 21 21 10

The patchset is based on the latest mainstream kernel and has been
tested on a 4-socket Intel platform with CPU/memory hot-addition
capability.

Any comments are welcomed!

Jiang Liu (30):
mm, kernel: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, sched: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, net: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, netfilter: Use cpu_to_mem()/numa_mem_id() to support memoryless
node
mm, perf: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, tracing: Use cpu_to_mem()/numa_mem_id() to support memoryless
node
mm: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, thp: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, memcg: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, xfrm: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, char/mspec.c: Use cpu_to_mem()/numa_mem_id() to support
memoryless node
mm, IB/qib: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, i40e: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, i40evf: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, igb: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, ixgbe: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, intel_powerclamp: Use cpu_to_mem()/numa_mem_id() to support
memoryless node
mm, bnx2fc: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, bnx2i: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, fcoe: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, irqchip: Use cpu_to_mem()/numa_mem_id() to support memoryless
node
mm, of: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, x86: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, x86/platform/uv: Use cpu_to_mem()/numa_mem_id() to support
memoryless node
mm, x86, kvm: Use cpu_to_mem()/numa_mem_id() to support memoryless
node
mm, x86, perf: Use cpu_to_mem()/numa_mem_id() to support memoryless
node
x86, numa: Kill useless code to improve code readability
mm: Update _mem_id_[] for every possible CPU when memory
configuration changes
mm, x86: Enable memoryless node support to better support CPU/memory
hotplug
x86, NUMA: Online node earlier when doing CPU hot-addition

arch/x86/Kconfig | 3 ++
arch/x86/kernel/acpi/boot.c | 6 ++-
arch/x86/kernel/apic/io_apic.c | 10 ++---
arch/x86/kernel/cpu/perf_event_amd.c | 2 +-
arch/x86/kernel/cpu/perf_event_amd_uncore.c | 2 +-
arch/x86/kernel/cpu/perf_event_intel.c | 2 +-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 6 +--
arch/x86/kernel/cpu/perf_event_intel_rapl.c | 2 +-
arch/x86/kernel/cpu/perf_event_intel_uncore.c | 2 +-
arch/x86/kernel/devicetree.c | 2 +-
arch/x86/kernel/irq_32.c | 4 +-
arch/x86/kernel/smpboot.c | 2 +
arch/x86/kvm/vmx.c | 2 +-
arch/x86/mm/numa.c | 52 +++++++++++++++++--------
arch/x86/platform/uv/tlb_uv.c | 2 +-
arch/x86/platform/uv/uv_nmi.c | 3 +-
arch/x86/platform/uv/uv_time.c | 2 +-
drivers/char/mspec.c | 2 +-
drivers/infiniband/hw/qib/qib_file_ops.c | 4 +-
drivers/infiniband/hw/qib/qib_init.c | 2 +-
drivers/irqchip/irq-clps711x.c | 2 +-
drivers/irqchip/irq-gic.c | 2 +-
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 2 +-
drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 2 +-
drivers/net/ethernet/intel/igb/igb_main.c | 4 +-
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 +-
drivers/of/base.c | 2 +-
drivers/scsi/bnx2fc/bnx2fc_fcoe.c | 2 +-
drivers/scsi/bnx2i/bnx2i_init.c | 2 +-
drivers/scsi/fcoe/fcoe.c | 2 +-
drivers/thermal/intel_powerclamp.c | 4 +-
include/linux/gfp.h | 6 +--
kernel/events/callchain.c | 2 +-
kernel/events/core.c | 2 +-
kernel/events/ring_buffer.c | 2 +-
kernel/rcu/rcutorture.c | 2 +-
kernel/sched/core.c | 8 ++--
kernel/sched/deadline.c | 2 +-
kernel/sched/fair.c | 4 +-
kernel/sched/rt.c | 6 +--
kernel/smp.c | 2 +-
kernel/smpboot.c | 2 +-
kernel/taskstats.c | 2 +-
kernel/timer.c | 2 +-
kernel/trace/ring_buffer.c | 12 +++---
kernel/trace/trace_uprobe.c | 2 +-
mm/huge_memory.c | 6 +--
mm/memcontrol.c | 2 +-
mm/memory.c | 2 +-
mm/page_alloc.c | 10 ++---
mm/percpu-vm.c | 2 +-
mm/vmalloc.c | 2 +-
net/core/dev.c | 6 +--
net/core/flow.c | 2 +-
net/core/pktgen.c | 10 ++---
net/core/sysctl_net_core.c | 2 +-
net/netfilter/x_tables.c | 8 ++--
net/xfrm/xfrm_ipcomp.c | 2 +-
58 files changed, 139 insertions(+), 111 deletions(-)
--
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-hotplug" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jiang Liu
2014-07-11 07:37:18 UTC
Permalink
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
kernel/rcu/rcutorture.c | 2 +-
kernel/smp.c | 2 +-
kernel/smpboot.c | 2 +-
kernel/taskstats.c | 2 +-
kernel/timer.c | 2 +-
5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 7fa34f86e5ba..f593762d3214 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -1209,7 +1209,7 @@ static int rcutorture_booster_init(int cpu)
mutex_lock(&boost_mutex);
VERBOSE_TOROUT_STRING("Creating rcu_torture_boost task");
boost_tasks[cpu] = kthread_create_on_node(rcu_torture_boost, NULL,
- cpu_to_node(cpu),
+ cpu_to_mem(cpu),
"rcu_torture_boost");
if (IS_ERR(boost_tasks[cpu])) {
retval = PTR_ERR(boost_tasks[cpu]);
diff --git a/kernel/smp.c b/kernel/smp.c
index 80c33f8de14f..2f3b84aef159 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -41,7 +41,7 @@ hotplug_cfd(struct notifier_block *nfb, unsigned long action, void *hcpu)
case CPU_UP_PREPARE:
case CPU_UP_PREPARE_FROZEN:
if (!zalloc_cpumask_var_node(&cfd->cpumask, GFP_KERNEL,
- cpu_to_node(cpu)))
+ cpu_to_mem(cpu)))
return notifier_from_errno(-ENOMEM);
cfd->csd = alloc_percpu(struct call_single_data);
if (!cfd->csd) {
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index eb89e1807408..9c08e68e48a9 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -171,7 +171,7 @@ __smpboot_create_thread(struct smp_hotplug_thread *ht, unsigned int cpu)
if (tsk)
return 0;

- td = kzalloc_node(sizeof(*td), GFP_KERNEL, cpu_to_node(cpu));
+ td = kzalloc_node(sizeof(*td), GFP_KERNEL, cpu_to_mem(cpu));
if (!td)
return -ENOMEM;
td->cpu = cpu;
diff --git a/kernel/taskstats.c b/kernel/taskstats.c
index 13d2f7cd65db..cf5cba1e7fbe 100644
--- a/kernel/taskstats.c
+++ b/kernel/taskstats.c
@@ -304,7 +304,7 @@ static int add_del_listener(pid_t pid, const struct cpumask *mask, int isadd)
if (isadd == REGISTER) {
for_each_cpu(cpu, mask) {
s = kmalloc_node(sizeof(struct listener),
- GFP_KERNEL, cpu_to_node(cpu));
+ GFP_KERNEL, cpu_to_mem(cpu));
if (!s) {
ret = -ENOMEM;
goto cleanup;
diff --git a/kernel/timer.c b/kernel/timer.c
index 3bb01a323b2a..5831a38b5681 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1546,7 +1546,7 @@ static int init_timers_cpu(int cpu)
* The APs use this path later in boot
*/
base = kzalloc_node(sizeof(*base), GFP_KERNEL,
- cpu_to_node(cpu));
+ cpu_to_mem(cpu));
if (!base)
return -ENOMEM;
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Paul E. McKenney
2014-07-11 15:14:05 UTC
Permalink
Post by Jiang Liu
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.
If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().
For the rcutorture piece:

Acked-by: Paul E. McKenney <***@linux.vnet.ibm.com>

Or if you separate the kernel/rcu/rcutorture.c portion into a separate
patch, I will queue it separately.

Thanx, Paul
Post by Jiang Liu
---
kernel/rcu/rcutorture.c | 2 +-
kernel/smp.c | 2 +-
kernel/smpboot.c | 2 +-
kernel/taskstats.c | 2 +-
kernel/timer.c | 2 +-
5 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 7fa34f86e5ba..f593762d3214 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -1209,7 +1209,7 @@ static int rcutorture_booster_init(int cpu)
mutex_lock(&boost_mutex);
VERBOSE_TOROUT_STRING("Creating rcu_torture_boost task");
boost_tasks[cpu] = kthread_create_on_node(rcu_torture_boost, NULL,
- cpu_to_node(cpu),
+ cpu_to_mem(cpu),
"rcu_torture_boost");
if (IS_ERR(boost_tasks[cpu])) {
retval = PTR_ERR(boost_tasks[cpu]);
diff --git a/kernel/smp.c b/kernel/smp.c
index 80c33f8de14f..2f3b84aef159 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -41,7 +41,7 @@ hotplug_cfd(struct notifier_block *nfb, unsigned long action, void *hcpu)
if (!zalloc_cpumask_var_node(&cfd->cpumask, GFP_KERNEL,
- cpu_to_node(cpu)))
+ cpu_to_mem(cpu)))
return notifier_from_errno(-ENOMEM);
cfd->csd = alloc_percpu(struct call_single_data);
if (!cfd->csd) {
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index eb89e1807408..9c08e68e48a9 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -171,7 +171,7 @@ __smpboot_create_thread(struct smp_hotplug_thread *ht, unsigned int cpu)
if (tsk)
return 0;
- td = kzalloc_node(sizeof(*td), GFP_KERNEL, cpu_to_node(cpu));
+ td = kzalloc_node(sizeof(*td), GFP_KERNEL, cpu_to_mem(cpu));
if (!td)
return -ENOMEM;
td->cpu = cpu;
diff --git a/kernel/taskstats.c b/kernel/taskstats.c
index 13d2f7cd65db..cf5cba1e7fbe 100644
--- a/kernel/taskstats.c
+++ b/kernel/taskstats.c
@@ -304,7 +304,7 @@ static int add_del_listener(pid_t pid, const struct cpumask *mask, int isadd)
if (isadd == REGISTER) {
for_each_cpu(cpu, mask) {
s = kmalloc_node(sizeof(struct listener),
- GFP_KERNEL, cpu_to_node(cpu));
+ GFP_KERNEL, cpu_to_mem(cpu));
if (!s) {
ret = -ENOMEM;
goto cleanup;
diff --git a/kernel/timer.c b/kernel/timer.c
index 3bb01a323b2a..5831a38b5681 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1546,7 +1546,7 @@ static int init_timers_cpu(int cpu)
* The APs use this path later in boot
*/
base = kzalloc_node(sizeof(*base), GFP_KERNEL,
- cpu_to_node(cpu));
+ cpu_to_mem(cpu));
if (!base)
return -ENOMEM;
--
1.7.10.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nishanth Aravamudan
2014-07-21 17:15:27 UTC
Permalink
Hi Paul,
Post by Paul E. McKenney
Post by Jiang Liu
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.
If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().
Or if you separate the kernel/rcu/rcutorture.c portion into a separate
patch, I will queue it separately.
Just FYI, based upon a separate discussion with Tejun and others, it
seems to be preferred to avoid the proliferation of cpu_to_mem
throughout the kernel blindly. For kthread_create_on_node(), I'm going
to try and fix the underlying issue and so you, as the caller, should
still specify the NUMA node you are running the kthread on
(cpu_to_node), not where you expect the memory to come from
(cpu_to_mem).

Thanks,
Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-hotplug" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Paul E. McKenney
2014-07-21 17:33:42 UTC
Permalink
Post by Nishanth Aravamudan
Hi Paul,
Post by Paul E. McKenney
Post by Jiang Liu
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.
If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().
Or if you separate the kernel/rcu/rcutorture.c portion into a separate
patch, I will queue it separately.
Just FYI, based upon a separate discussion with Tejun and others, it
seems to be preferred to avoid the proliferation of cpu_to_mem
throughout the kernel blindly. For kthread_create_on_node(), I'm going
to try and fix the underlying issue and so you, as the caller, should
still specify the NUMA node you are running the kthread on
(cpu_to_node), not where you expect the memory to come from
(cpu_to_mem).
Even better!!! ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-hotplug" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jens Axboe
2014-07-12 12:32:01 UTC
Permalink
Post by Jiang Liu
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.
If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().
I think blk-mq requires some of the same help, as do other places in the
block layer. I'll take a look at that.

As for you smp.c bits here:

Acked-by: Jens Axboe <***@fb.com>
--
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-11 07:37:19 UTC
Permalink
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
kernel/sched/core.c | 8 ++++----
kernel/sched/deadline.c | 2 +-
kernel/sched/fair.c | 4 ++--
kernel/sched/rt.c | 6 +++---
4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3bdf01b494fe..27e3af246310 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5743,7 +5743,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
continue;

sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
- GFP_KERNEL, cpu_to_node(cpu));
+ GFP_KERNEL, cpu_to_mem(cpu));

if (!sg)
goto fail;
@@ -6397,14 +6397,14 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
struct sched_group_capacity *sgc;

sd = kzalloc_node(sizeof(struct sched_domain) + cpumask_size(),
- GFP_KERNEL, cpu_to_node(j));
+ GFP_KERNEL, cpu_to_mem(j));
if (!sd)
return -ENOMEM;

*per_cpu_ptr(sdd->sd, j) = sd;

sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
- GFP_KERNEL, cpu_to_node(j));
+ GFP_KERNEL, cpu_to_mem(j));
if (!sg)
return -ENOMEM;

@@ -6413,7 +6413,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
*per_cpu_ptr(sdd->sg, j) = sg;

sgc = kzalloc_node(sizeof(struct sched_group_capacity) + cpumask_size(),
- GFP_KERNEL, cpu_to_node(j));
+ GFP_KERNEL, cpu_to_mem(j));
if (!sgc)
return -ENOMEM;

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fc4f98b1258f..95104d363a8c 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1559,7 +1559,7 @@ void init_sched_dl_class(void)

for_each_possible_cpu(i)
zalloc_cpumask_var_node(&per_cpu(local_cpu_mask_dl, i),
- GFP_KERNEL, cpu_to_node(i));
+ GFP_KERNEL, cpu_to_mem(i));
}

#endif /* CONFIG_SMP */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fea7d3335e1f..26e75b8a52e6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7611,12 +7611,12 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)

for_each_possible_cpu(i) {
cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
- GFP_KERNEL, cpu_to_node(i));
+ GFP_KERNEL, cpu_to_mem(i));
if (!cfs_rq)
goto err;

se = kzalloc_node(sizeof(struct sched_entity),
- GFP_KERNEL, cpu_to_node(i));
+ GFP_KERNEL, cpu_to_mem(i));
if (!se)
goto err_free_rq;

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a49083192c64..88d1315c6223 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -184,12 +184,12 @@ int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)

for_each_possible_cpu(i) {
rt_rq = kzalloc_node(sizeof(struct rt_rq),
- GFP_KERNEL, cpu_to_node(i));
+ GFP_KERNEL, cpu_to_mem(i));
if (!rt_rq)
goto err;

rt_se = kzalloc_node(sizeof(struct sched_rt_entity),
- GFP_KERNEL, cpu_to_node(i));
+ GFP_KERNEL, cpu_to_mem(i));
if (!rt_se)
goto err_free_rq;

@@ -1945,7 +1945,7 @@ void __init init_sched_rt_class(void)

for_each_possible_cpu(i) {
zalloc_cpumask_var_node(&per_cpu(local_cpu_mask, i),
- GFP_KERNEL, cpu_to_node(i));
+ GFP_KERNEL, cpu_to_mem(i));
}
}
#endif /* CONFIG_SMP */
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-11 07:37:20 UTC
Permalink
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
net/core/dev.c | 6 +++---
net/core/flow.c | 2 +-
net/core/pktgen.c | 10 +++++-----
net/core/sysctl_net_core.c | 2 +-
4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 30eedf677913..e4c1e84374b7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1910,7 +1910,7 @@ static struct xps_map *expand_xps_map(struct xps_map *map,

/* Need to allocate new map to store queue on this CPU's map */
new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
- cpu_to_node(cpu));
+ cpu_to_mem(cpu));
if (!new_map)
return NULL;

@@ -1973,8 +1973,8 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
map->queues[map->len++] = index;
#ifdef CONFIG_NUMA
if (numa_node_id == -2)
- numa_node_id = cpu_to_node(cpu);
- else if (numa_node_id != cpu_to_node(cpu))
+ numa_node_id = cpu_to_mem(cpu);
+ else if (numa_node_id != cpu_to_mem(cpu))
numa_node_id = -1;
#endif
} else if (dev_maps) {
diff --git a/net/core/flow.c b/net/core/flow.c
index a0348fde1fdf..4139dbb50cc0 100644
--- a/net/core/flow.c
+++ b/net/core/flow.c
@@ -396,7 +396,7 @@ static int flow_cache_cpu_prepare(struct flow_cache *fc, int cpu)
size_t sz = sizeof(struct hlist_head) * flow_cache_hash_size(fc);

if (!fcp->hash_table) {
- fcp->hash_table = kzalloc_node(sz, GFP_KERNEL, cpu_to_node(cpu));
+ fcp->hash_table = kzalloc_node(sz, GFP_KERNEL, cpu_to_mem(cpu));
if (!fcp->hash_table) {
pr_err("NET: failed to allocate flow cache sz %zu\n", sz);
return -ENOMEM;
diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index fc17a9d309ac..45d18f88dce4 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -2653,7 +2653,7 @@ static void pktgen_finalize_skb(struct pktgen_dev *pkt_dev, struct sk_buff *skb,
(datalen/frags) : PAGE_SIZE;
while (datalen > 0) {
if (unlikely(!pkt_dev->page)) {
- int node = numa_node_id();
+ int node = numa_mem_id();

if (pkt_dev->node >= 0 && (pkt_dev->flags & F_NODE))
node = pkt_dev->node;
@@ -2698,7 +2698,7 @@ static struct sk_buff *pktgen_alloc_skb(struct net_device *dev,
pkt_dev->pkt_overhead;

if (pkt_dev->flags & F_NODE) {
- int node = pkt_dev->node >= 0 ? pkt_dev->node : numa_node_id();
+ int node = pkt_dev->node >= 0 ? pkt_dev->node : numa_mem_id();

skb = __alloc_skb(NET_SKB_PAD + size, GFP_NOWAIT, 0, node);
if (likely(skb)) {
@@ -3533,7 +3533,7 @@ static int pktgen_add_device(struct pktgen_thread *t, const char *ifname)
{
struct pktgen_dev *pkt_dev;
int err;
- int node = cpu_to_node(t->cpu);
+ int node = cpu_to_mem(t->cpu);

/* We don't allow a device to be on several threads */

@@ -3621,7 +3621,7 @@ static int __net_init pktgen_create_thread(int cpu, struct pktgen_net *pn)
struct task_struct *p;

t = kzalloc_node(sizeof(struct pktgen_thread), GFP_KERNEL,
- cpu_to_node(cpu));
+ cpu_to_mem(cpu));
if (!t) {
pr_err("ERROR: out of memory, can't create new thread\n");
return -ENOMEM;
@@ -3637,7 +3637,7 @@ static int __net_init pktgen_create_thread(int cpu, struct pktgen_net *pn)

p = kthread_create_on_node(pktgen_thread_worker,
t,
- cpu_to_node(cpu),
+ cpu_to_mem(cpu),
"kpktgend_%d", cpu);
if (IS_ERR(p)) {
pr_err("kernel_thread() failed for cpu %d\n", t->cpu);
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index cf9cd13509a7..1375447b833e 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -123,7 +123,7 @@ static int flow_limit_cpu_sysctl(struct ctl_table *table, int write,
kfree(cur);
} else if (!cur && cpumask_test_cpu(i, mask)) {
cur = kzalloc_node(len, GFP_KERNEL,
- cpu_to_node(i));
+ cpu_to_mem(i));
if (!cur) {
/* not unwinding previous changes */
ret = -ENOMEM;
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-11 07:37:22 UTC
Permalink
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
kernel/events/callchain.c | 2 +-
kernel/events/core.c | 2 +-
kernel/events/ring_buffer.c | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 97b67df8fbfe..09f470a9262e 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -77,7 +77,7 @@ static int alloc_callchain_buffers(void)

for_each_possible_cpu(cpu) {
entries->cpu_entries[cpu] = kmalloc_node(size, GFP_KERNEL,
- cpu_to_node(cpu));
+ cpu_to_mem(cpu));
if (!entries->cpu_entries[cpu])
goto fail;
}
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a33d9a2bcbd7..bb1a5f326309 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7911,7 +7911,7 @@ static void perf_event_init_cpu(int cpu)
if (swhash->hlist_refcount > 0) {
struct swevent_hlist *hlist;

- hlist = kzalloc_node(sizeof(*hlist), GFP_KERNEL, cpu_to_node(cpu));
+ hlist = kzalloc_node(sizeof(*hlist), GFP_KERNEL, cpu_to_mem(cpu));
WARN_ON(!hlist);
rcu_assign_pointer(swhash->swevent_hlist, hlist);
}
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 146a5792b1d2..22128f58aa0b 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -265,7 +265,7 @@ static void *perf_mmap_alloc_page(int cpu)
struct page *page;
int node;

- node = (cpu == -1) ? cpu : cpu_to_node(cpu);
+ node = (cpu == -1) ? NUMA_NO_NODE : cpu_to_mem(cpu);
page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
if (!page)
return NULL;
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-11 07:37:23 UTC
Permalink
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
kernel/trace/ring_buffer.c | 12 ++++++------
kernel/trace/trace_uprobe.c | 2 +-
2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 7c56c3d06943..38c51583f968 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1124,13 +1124,13 @@ static int __rb_allocate_pages(int nr_pages, struct list_head *pages, int cpu)
*/
bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
GFP_KERNEL | __GFP_NORETRY,
- cpu_to_node(cpu));
+ cpu_to_mem(cpu));
if (!bpage)
goto free_pages;

list_add(&bpage->list, pages);

- page = alloc_pages_node(cpu_to_node(cpu),
+ page = alloc_pages_node(cpu_to_mem(cpu),
GFP_KERNEL | __GFP_NORETRY, 0);
if (!page)
goto free_pages;
@@ -1183,7 +1183,7 @@ rb_allocate_cpu_buffer(struct ring_buffer *buffer, int nr_pages, int cpu)
int ret;

cpu_buffer = kzalloc_node(ALIGN(sizeof(*cpu_buffer), cache_line_size()),
- GFP_KERNEL, cpu_to_node(cpu));
+ GFP_KERNEL, cpu_to_mem(cpu));
if (!cpu_buffer)
return NULL;

@@ -1198,14 +1198,14 @@ rb_allocate_cpu_buffer(struct ring_buffer *buffer, int nr_pages, int cpu)
init_waitqueue_head(&cpu_buffer->irq_work.waiters);

bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
- GFP_KERNEL, cpu_to_node(cpu));
+ GFP_KERNEL, cpu_to_mem(cpu));
if (!bpage)
goto fail_free_buffer;

rb_check_bpage(cpu_buffer, bpage);

cpu_buffer->reader_page = bpage;
- page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 0);
+ page = alloc_pages_node(cpu_to_mem(cpu), GFP_KERNEL, 0);
if (!page)
goto fail_free_reader;
bpage->page = page_address(page);
@@ -4378,7 +4378,7 @@ void *ring_buffer_alloc_read_page(struct ring_buffer *buffer, int cpu)
struct buffer_data_page *bpage;
struct page *page;

- page = alloc_pages_node(cpu_to_node(cpu),
+ page = alloc_pages_node(cpu_to_mem(cpu),
GFP_KERNEL | __GFP_NORETRY, 0);
if (!page)
return NULL;
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 3c9b97e6b1f4..e585fb67472b 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -692,7 +692,7 @@ static int uprobe_buffer_init(void)
return -ENOMEM;

for_each_possible_cpu(cpu) {
- struct page *p = alloc_pages_node(cpu_to_node(cpu),
+ struct page *p = alloc_pages_node(cpu_to_mem(cpu),
GFP_KERNEL, 0);
if (p == NULL) {
err_cpu = cpu;
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-11 07:37:21 UTC
Permalink
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
net/netfilter/x_tables.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 227aa11e8409..6e7d4bc81422 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -692,10 +692,10 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
if (size <= PAGE_SIZE)
newinfo->entries[cpu] = kmalloc_node(size,
GFP_KERNEL,
- cpu_to_node(cpu));
+ cpu_to_mem(cpu));
else
newinfo->entries[cpu] = vmalloc_node(size,
- cpu_to_node(cpu));
+ cpu_to_mem(cpu));

if (newinfo->entries[cpu] == NULL) {
xt_free_table_info(newinfo);
@@ -801,10 +801,10 @@ static int xt_jumpstack_alloc(struct xt_table_info *i)
for_each_possible_cpu(cpu) {
if (size > PAGE_SIZE)
i->jumpstack[cpu] = vmalloc_node(size,
- cpu_to_node(cpu));
+ cpu_to_mem(cpu));
else
i->jumpstack[cpu] = kmalloc_node(size,
- GFP_KERNEL, cpu_to_node(cpu));
+ GFP_KERNEL, cpu_to_mem(cpu));
if (i->jumpstack[cpu] == NULL)
/*
* Freeing will be done later on by the callers. The
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-11 07:37:24 UTC
Permalink
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
include/linux/gfp.h | 6 +++---
mm/memory.c | 2 +-
mm/percpu-vm.c | 2 +-
mm/vmalloc.c | 2 +-
4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 6eb1fb37de9a..56dd2043f510 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -314,7 +314,7 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
{
/* Unknown node is current node */
if (nid < 0)
- nid = numa_node_id();
+ nid = numa_mem_id();

return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
}
@@ -340,13 +340,13 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
int node);
#else
#define alloc_pages(gfp_mask, order) \
- alloc_pages_node(numa_node_id(), gfp_mask, order)
+ alloc_pages_node(numa_mem_id(), gfp_mask, order)
#define alloc_pages_vma(gfp_mask, order, vma, addr, node) \
alloc_pages(gfp_mask, order)
#endif
#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
#define alloc_page_vma(gfp_mask, vma, addr) \
- alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id())
+ alloc_pages_vma(gfp_mask, 0, vma, addr, numa_mem_id())
#define alloc_page_vma_node(gfp_mask, vma, addr, node) \
alloc_pages_vma(gfp_mask, 0, vma, addr, node)

diff --git a/mm/memory.c b/mm/memory.c
index d67fd9fcf1f2..f434d2692f70 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3074,7 +3074,7 @@ static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
get_page(page);

count_vm_numa_event(NUMA_HINT_FAULTS);
- if (page_nid == numa_node_id()) {
+ if (page_nid == numa_mem_id()) {
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
*flags |= TNF_FAULT_LOCAL;
}
diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index 3707c71ae4cd..a20b8f7d0dd0 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -115,7 +115,7 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
for (i = page_start; i < page_end; i++) {
struct page **pagep = &pages[pcpu_page_idx(cpu, i)];

- *pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0);
+ *pagep = alloc_pages_node(cpu_to_mem(cpu), gfp, 0);
if (!*pagep) {
pcpu_free_pages(chunk, pages, populated,
page_start, page_end);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f64632b67196..c06f90641916 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -800,7 +800,7 @@ static struct vmap_block *new_vmap_block(gfp_t gfp_mask)
unsigned long vb_idx;
int node, err;

- node = numa_node_id();
+ node = numa_mem_id();

vb = kmalloc_node(sizeof(struct vmap_block),
gfp_mask & GFP_RECLAIM_MASK, node);
--
1.7.10.4
Christoph Lameter
2014-07-11 13:51:10 UTC
Permalink
Post by Jiang Liu
If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().
Reviewed-by: Christoph Lameter <***@linux.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Tejun Heo
2014-07-11 14:42:05 UTC
Permalink
Hello,
Post by Jiang Liu
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
The patch itself looks okay to me but is this the right way to handle
this? Can't we just let the allocators fall back to the nearest node
with memory? Why do we need to impose this awareness of memory-less
node on all the users?

Thanks.
--
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2014-07-11 15:13:57 UTC
Permalink
Post by Tejun Heo
Hello,
Post by Jiang Liu
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
The patch itself looks okay to me but is this the right way to handle
this? Can't we just let the allocators fall back to the nearest node
with memory? Why do we need to impose this awareness of memory-less
node on all the users?
Allocators typically fall back but they wont in some cases if you say
that you want memory from a particular node. A GFP_THISNODE would force a
failure of the alloc. In other cases it should fall back. I am not sure
that all allocations obey these conventions though.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Tejun Heo
2014-07-11 15:21:56 UTC
Permalink
Hello,
Post by Christoph Lameter
Allocators typically fall back but they wont in some cases if you say
that you want memory from a particular node. A GFP_THISNODE would force a
failure of the alloc. In other cases it should fall back. I am not sure
that all allocations obey these conventions though.
But, GFP_THISNODE + numa_mem_id() is identical to numa_node_id() +
nearest node with memory fallback. Is there any case where the user
would actually want to always fail if it's on the memless node?

Even if that's the case, there's no reason to burden everyone with
this distinction. Most users just wanna say "I'm on this node.
Please allocate considering that". There's nothing wrong with using
numa_node_id() for that.

Thanks.
--
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Tejun Heo
2014-07-11 15:33:02 UTC
Permalink
Post by Tejun Heo
Even if that's the case, there's no reason to burden everyone with
this distinction. Most users just wanna say "I'm on this node.
Please allocate considering that". There's nothing wrong with using
numa_node_id() for that.
Also, this is minor but don't we also lose fallback information by
doing this from the caller? Please consider the following topology
where each hop is the same distance.

A - B - X - C - D

Where X is the memless node. num_mem_id() on X would return either B
or C, right? If B or C can't satisfy the allocation, the allocator
would fallback to A from B and D for C, both of which aren't optimal.
It should first fall back to C or B respectively, which the allocator
can't do anymoe because the information is lost when the caller side
performs numa_mem_id().

Seems pretty misguided to me.

Thanks.
--
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2014-07-11 15:55:59 UTC
Permalink
Post by Tejun Heo
Post by Tejun Heo
Even if that's the case, there's no reason to burden everyone with
this distinction. Most users just wanna say "I'm on this node.
Please allocate considering that". There's nothing wrong with using
numa_node_id() for that.
Also, this is minor but don't we also lose fallback information by
doing this from the caller? Please consider the following topology
where each hop is the same distance.
A - B - X - C - D
Where X is the memless node. num_mem_id() on X would return either B
or C, right? If B or C can't satisfy the allocation, the allocator
would fallback to A from B and D for C, both of which aren't optimal.
It should first fall back to C or B respectively, which the allocator
can't do anymoe because the information is lost when the caller side
performs numa_mem_id().
True but the advantage is that the numa_mem_id() allows the use of a
consitent sort of "local" node which increases allocator performance due
to the abillity to cache objects from that node.
Post by Tejun Heo
Seems pretty misguided to me.
IMHO the whole concept of a memoryless node looks pretty misguided to me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Tejun Heo
2014-07-11 15:58:38 UTC
Permalink
Post by Christoph Lameter
Post by Tejun Heo
Where X is the memless node. num_mem_id() on X would return either B
or C, right? If B or C can't satisfy the allocation, the allocator
would fallback to A from B and D for C, both of which aren't optimal.
It should first fall back to C or B respectively, which the allocator
can't do anymoe because the information is lost when the caller side
performs numa_mem_id().
True but the advantage is that the numa_mem_id() allows the use of a
consitent sort of "local" node which increases allocator performance due
to the abillity to cache objects from that node.
But the allocator can do the mapping the same. I really don't see why
we'd push the distinction to the individual users.

Thanks.
--
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2014-07-11 16:04:05 UTC
Permalink
Post by Tejun Heo
Post by Christoph Lameter
Post by Tejun Heo
Where X is the memless node. num_mem_id() on X would return either B
or C, right? If B or C can't satisfy the allocation, the allocator
would fallback to A from B and D for C, both of which aren't optimal.
It should first fall back to C or B respectively, which the allocator
can't do anymoe because the information is lost when the caller side
performs numa_mem_id().
True but the advantage is that the numa_mem_id() allows the use of a
consitent sort of "local" node which increases allocator performance due
to the abillity to cache objects from that node.
But the allocator can do the mapping the same. I really don't see why
we'd push the distinction to the individual users.
The "users" (I guess you mean general kernel code/drivers) can use various
memory allocators which will do the right thing internally regarding
GFP_THISNODE. They do not need to worry too much about this unless there
are reasons beyond optimizing NUMA placement to need memory from a
particuylar node (f.e. a device that requires memory from a numa node that
is local to the PCI bus where the hardware resides).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2014-07-11 15:58:52 UTC
Permalink
Post by Tejun Heo
Hello,
Post by Christoph Lameter
Allocators typically fall back but they wont in some cases if you say
that you want memory from a particular node. A GFP_THISNODE would force a
failure of the alloc. In other cases it should fall back. I am not sure
that all allocations obey these conventions though.
But, GFP_THISNODE + numa_mem_id() is identical to numa_node_id() +
nearest node with memory fallback. Is there any case where the user
would actually want to always fail if it's on the memless node?
GFP_THISNODE allocatios must fail if there is no memory available on
the node. No fallback allowed.

If the allocator performs caching for a particular node (like SLAB) then
the allocator *cannnot* accept memory from another node and the alloc via
the page allocator must fail so that the allocator can then pick another
node for keeping track of the allocations.
Post by Tejun Heo
Even if that's the case, there's no reason to burden everyone with
this distinction. Most users just wanna say "I'm on this node.
Please allocate considering that". There's nothing wrong with using
numa_node_id() for that.
Well yes that speaks for this patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Tejun Heo
2014-07-11 16:01:52 UTC
Permalink
Post by Christoph Lameter
Post by Tejun Heo
But, GFP_THISNODE + numa_mem_id() is identical to numa_node_id() +
nearest node with memory fallback. Is there any case where the user
would actually want to always fail if it's on the memless node?
GFP_THISNODE allocatios must fail if there is no memory available on
the node. No fallback allowed.
I don't know. The intention is that the caller wants something on
this node or the caller will fail or fallback ourselves, right? For
most use cases just considering the nearest memory node as "local" for
memless nodes should work and serve the intentions of the users close
enough. Whether that'd be better or we'd be better off with something
else depends on the details for sure.

Thanks.
--
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2014-07-11 16:19:14 UTC
Permalink
Post by Tejun Heo
Post by Christoph Lameter
Post by Tejun Heo
But, GFP_THISNODE + numa_mem_id() is identical to numa_node_id() +
nearest node with memory fallback. Is there any case where the user
would actually want to always fail if it's on the memless node?
GFP_THISNODE allocatios must fail if there is no memory available on
the node. No fallback allowed.
I don't know. The intention is that the caller wants something on
this node or the caller will fail or fallback ourselves, right? For
most use cases just considering the nearest memory node as "local" for
memless nodes should work and serve the intentions of the users close
enough. Whether that'd be better or we'd be better off with something
else depends on the details for sure.
Yes that works. But if we want a consistent node to allocate from (and
avoid the fallbacks) then we need this patch. I think this is up to those
needing memoryless nodes to figure out what semantics they need.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Tejun Heo
2014-07-11 16:24:51 UTC
Permalink
Post by Christoph Lameter
Yes that works. But if we want a consistent node to allocate from (and
avoid the fallbacks) then we need this patch. I think this is up to those
needing memoryless nodes to figure out what semantics they need.
I'm not following what you're saying. Are you saying that we need to
spread numa_mem_id() all over the place for GFP_THISNODE users on
memless nodes? There aren't that many users of GFP_THISNODE.
Wouldn't it make far more sense to just change them? Or just
introduce a new GFP flag GFP_CLOSE_OR_BUST which allows falling back
to the nearest local node for memless nodes. There's no reason to
leak this information outside allocator proper.

Thanks.
--
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2014-07-11 17:29:30 UTC
Permalink
Post by Tejun Heo
Post by Christoph Lameter
Yes that works. But if we want a consistent node to allocate from (and
avoid the fallbacks) then we need this patch. I think this is up to those
needing memoryless nodes to figure out what semantics they need.
I'm not following what you're saying. Are you saying that we need to
spread numa_mem_id() all over the place for GFP_THISNODE users on
memless nodes? There aren't that many users of GFP_THISNODE.
GFP_THISNODE is mostly used by allocators that need memory from specific
nodes. The use of numa_mem_id() there is useful because one will not
get any memory at all when attempting to allocate from a memoryless
node using GFP_THISNODE.

I meant that the relying on fallback to the neighboring nodes without
GFP_THISNODE using numa_node_id() is one approach that may prevent memory
allocators from caching objects for that node because every allocation may
choose a different neighboring node. And the other is the use of
numa_mem_id() which will always use a specific node and avoid fallback to
different node.

The choice is up to those having an interest in memoryless nodes. Which
again I find a pretty strange thing to have that has already proven itself
difficult to maintain in the kernel given the the notion of memory
nodes that should have memory but surprisingly have none. Then there are
the esoteric fallback conditions and special cases introduced. Its a mess.

The best solution may be to just get rid of the whole thing and require
all processors to have a node with memory that is local to them. Current
"memoryless" hardware can simply decide on bootup to pick a memory node
that is local and thus we do not have to deal with it in the core.

--
To unsubscribe from this list: send the line "unsubscribe linux-hotplug" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo
2014-07-11 18:28:14 UTC
Permalink
Hello,
Post by Christoph Lameter
GFP_THISNODE is mostly used by allocators that need memory from specific
nodes. The use of numa_mem_id() there is useful because one will not
get any memory at all when attempting to allocate from a memoryless
node using GFP_THISNODE.
As long as it's in allocator proper, it doesn't matter all that much
but the changes are clearly not contained, are they?

Also, unless this is done where the falling back is actually
happening, numa_mem_id() seems like the wrong interface because you
end up losing information of the originating node. Given that this
isn't a wide spread use case, maybe we can do with something like
numa_mem_id() as a compromise but if we're doing that let's at least
make it clear that it's something ugly (give it an ugly name, not
something as generic as numa_mem_id()) and not expose it outside
allocators.

Thanks.
--
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2014-07-11 19:11:02 UTC
Permalink
Post by Tejun Heo
Post by Christoph Lameter
GFP_THISNODE is mostly used by allocators that need memory from specific
nodes. The use of numa_mem_id() there is useful because one will not
get any memory at all when attempting to allocate from a memoryless
node using GFP_THISNODE.
As long as it's in allocator proper, it doesn't matter all that much
but the changes are clearly not contained, are they?
Well there is a proliferation of memory allocators recently. NUMA is often
a second thought in those.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-23 03:16:53 UTC
Permalink
Hi Tejun and Christoph,
Thanks for your suggestions and discussion. Tejun really
gives a good point to hide memoryless node interface from normal
slab users. I will rework the patch set to go that direction.
Regards!
Gerry
Post by Christoph Lameter
Post by Tejun Heo
Post by Christoph Lameter
GFP_THISNODE is mostly used by allocators that need memory from specific
nodes. The use of numa_mem_id() there is useful because one will not
get any memory at all when attempting to allocate from a memoryless
node using GFP_THISNODE.
As long as it's in allocator proper, it doesn't matter all that much
but the changes are clearly not contained, are they?
Well there is a proliferation of memory allocators recently. NUMA is often
a second thought in those.
--
To unsubscribe from this list: send the line "unsubscribe linux-hotplug" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jiang Liu
2014-07-11 07:37:25 UTC
Permalink
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
mm/huge_memory.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 33514d88fef9..3307dd840873 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -822,7 +822,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return 0;
}
page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
- vma, haddr, numa_node_id(), 0);
+ vma, haddr, numa_mem_id(), 0);
if (unlikely(!page)) {
count_vm_event(THP_FAULT_FALLBACK);
return VM_FAULT_FALLBACK;
@@ -1111,7 +1111,7 @@ alloc:
if (transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow())
new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
- vma, haddr, numa_node_id(), 0);
+ vma, haddr, numa_mem_id(), 0);
else
new_page = NULL;

@@ -1255,7 +1255,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct anon_vma *anon_vma = NULL;
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
- int page_nid = -1, this_nid = numa_node_id();
+ int page_nid = -1, this_nid = numa_mem_id();
int target_nid, last_cpupid = -1;
bool page_locked;
bool migrated = false;
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-11 07:37:26 UTC
Permalink
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
mm/memcontrol.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a2c7bcb0e6eb..d6c4b7255ca9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1933,7 +1933,7 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
* we use curret node.
*/
if (unlikely(node == MAX_NUMNODES))
- node = numa_node_id();
+ node = numa_mem_id();

memcg->last_scanned_node = node;
return node;
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Michal Hocko
2014-07-18 07:36:14 UTC
Permalink
Post by Jiang Liu
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.
If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().
The change makes difference only for really tiny memcgs. If we really
have all pages on unevictable list or anon with no swap allowed and that
is the reason why no node is set in scan_nodes mask then reclaiming
memoryless node or any arbitrary close one doesn't make any difference.
The current memcg might not have any memory on that node at all.

So the change doesn't make any practical difference and the changelog is
misleading.
Post by Jiang Liu
---
mm/memcontrol.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a2c7bcb0e6eb..d6c4b7255ca9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1933,7 +1933,7 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
* we use curret node.
*/
if (unlikely(node == MAX_NUMNODES))
- node = numa_node_id();
+ node = numa_mem_id();
memcg->last_scanned_node = node;
return node;
--
1.7.10.4
--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-23 03:18:15 UTC
Permalink
Hi Michal,
Thanks for your comments! As discussed, we will
rework the patch set in another direction to hide memoryless
node from normal slab users.
Regards!
Gerry
Post by Michal Hocko
Post by Jiang Liu
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.
If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().
The change makes difference only for really tiny memcgs. If we really
have all pages on unevictable list or anon with no swap allowed and that
is the reason why no node is set in scan_nodes mask then reclaiming
memoryless node or any arbitrary close one doesn't make any difference.
The current memcg might not have any memory on that node at all.
So the change doesn't make any practical difference and the changelog is
misleading.
Post by Jiang Liu
---
mm/memcontrol.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a2c7bcb0e6eb..d6c4b7255ca9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1933,7 +1933,7 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
* we use curret node.
*/
if (unlikely(node == MAX_NUMNODES))
- node = numa_node_id();
+ node = numa_mem_id();
memcg->last_scanned_node = node;
return node;
--
1.7.10.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-11 07:37:27 UTC
Permalink
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
net/xfrm/xfrm_ipcomp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_ipcomp.c b/net/xfrm/xfrm_ipcomp.c
index ccfdc7115a83..129f469ae75d 100644
--- a/net/xfrm/xfrm_ipcomp.c
+++ b/net/xfrm/xfrm_ipcomp.c
@@ -235,7 +235,7 @@ static void * __percpu *ipcomp_alloc_scratches(void)
for_each_possible_cpu(i) {
void *scratch;

- scratch = vmalloc_node(IPCOMP_SCRATCH_SIZE, cpu_to_node(i));
+ scratch = vmalloc_node(IPCOMP_SCRATCH_SIZE, cpu_to_mem(i));
if (!scratch)
return NULL;
*per_cpu_ptr(scratches, i) = scratch;
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-11 07:37:28 UTC
Permalink
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
drivers/char/mspec.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/char/mspec.c b/drivers/char/mspec.c
index f1d7fa45c275..20e893cde9fd 100644
--- a/drivers/char/mspec.c
+++ b/drivers/char/mspec.c
@@ -206,7 +206,7 @@ mspec_fault(struct vm_area_struct *vma, struct vm_fault *vmf)

maddr = (volatile unsigned long) vdata->maddr[index];
if (maddr == 0) {
- maddr = uncached_alloc_page(numa_node_id(), 1);
+ maddr = uncached_alloc_page(numa_mem_id(), 1);
if (maddr == 0)
return VM_FAULT_OOM;
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-11 07:37:29 UTC
Permalink
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
drivers/infiniband/hw/qib/qib_file_ops.c | 4 ++--
drivers/infiniband/hw/qib/qib_init.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/qib/qib_file_ops.c b/drivers/infiniband/hw/qib/qib_file_ops.c
index b15e34eeef68..55540295e0e3 100644
--- a/drivers/infiniband/hw/qib/qib_file_ops.c
+++ b/drivers/infiniband/hw/qib/qib_file_ops.c
@@ -1312,8 +1312,8 @@ static int setup_ctxt(struct qib_pportdata *ppd, int ctxt,
assign_ctxt_affinity(fp, dd);

numa_id = qib_numa_aware ? ((fd->rec_cpu_num != -1) ?
- cpu_to_node(fd->rec_cpu_num) :
- numa_node_id()) : dd->assigned_node_id;
+ cpu_to_mem(fd->rec_cpu_num) : numa_mem_id()) :
+ dd->assigned_node_id;

rcd = qib_create_ctxtdata(ppd, ctxt, numa_id);

diff --git a/drivers/infiniband/hw/qib/qib_init.c b/drivers/infiniband/hw/qib/qib_init.c
index 8d3c78ddc906..85ff56ad1075 100644
--- a/drivers/infiniband/hw/qib/qib_init.c
+++ b/drivers/infiniband/hw/qib/qib_init.c
@@ -133,7 +133,7 @@ int qib_create_ctxts(struct qib_devdata *dd)
int local_node_id = pcibus_to_node(dd->pcidev->bus);

if (local_node_id < 0)
- local_node_id = numa_node_id();
+ local_node_id = numa_mem_id();
dd->assigned_node_id = local_node_id;

/*
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-11 07:37:30 UTC
Permalink
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index e49f31dbd5d8..e9f6f9efd944 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1342,7 +1342,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
unsigned int total_rx_bytes = 0, total_rx_packets = 0;
u16 rx_packet_len, rx_header_len, rx_sph, rx_hbo;
u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
- const int current_node = numa_node_id();
+ const int current_node = numa_mem_id();
struct i40e_vsi *vsi = rx_ring->vsi;
u16 i = rx_ring->next_to_clean;
union i40e_rx_desc *rx_desc;
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-11 07:37:31 UTC
Permalink
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 48ebb6cd69f2..5c057ae21c22 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -877,7 +877,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
unsigned int total_rx_bytes = 0, total_rx_packets = 0;
u16 rx_packet_len, rx_header_len, rx_sph, rx_hbo;
u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
- const int current_node = numa_node_id();
+ const int current_node = numa_mem_id();
struct i40e_vsi *vsi = rx_ring->vsi;
u16 i = rx_ring->next_to_clean;
union i40e_rx_desc *rx_desc;
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-11 07:37:32 UTC
Permalink
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
drivers/net/ethernet/intel/igb/igb_main.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index f145adbb55ac..2b74bffa5648 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6518,7 +6518,7 @@ static bool igb_can_reuse_rx_page(struct igb_rx_buffer *rx_buffer,
unsigned int truesize)
{
/* avoid re-using remote pages */
- if (unlikely(page_to_nid(page) != numa_node_id()))
+ if (unlikely(page_to_nid(page) != numa_mem_id()))
return false;

#if (PAGE_SIZE < 8192)
@@ -6588,7 +6588,7 @@ static bool igb_add_rx_frag(struct igb_ring *rx_ring,
memcpy(__skb_put(skb, size), va, ALIGN(size, sizeof(long)));

/* we can reuse buffer as-is, just make sure it is local */
- if (likely(page_to_nid(page) == numa_node_id()))
+ if (likely(page_to_nid(page) == numa_mem_id()))
return true;

/* this page cannot be reused so discard it */
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nishanth Aravamudan
2014-07-21 17:42:18 UTC
Permalink
Post by Jiang Liu
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.
If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().
---
drivers/net/ethernet/intel/igb/igb_main.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index f145adbb55ac..2b74bffa5648 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6518,7 +6518,7 @@ static bool igb_can_reuse_rx_page(struct igb_rx_buffer *rx_buffer,
unsigned int truesize)
{
/* avoid re-using remote pages */
- if (unlikely(page_to_nid(page) != numa_node_id()))
+ if (unlikely(page_to_nid(page) != numa_mem_id()))
return false;
#if (PAGE_SIZE < 8192)
@@ -6588,7 +6588,7 @@ static bool igb_add_rx_frag(struct igb_ring *rx_ring,
memcpy(__skb_put(skb, size), va, ALIGN(size, sizeof(long)));
/* we can reuse buffer as-is, just make sure it is local */
- if (likely(page_to_nid(page) == numa_node_id()))
+ if (likely(page_to_nid(page) == numa_mem_id()))
return true;
/* this page cannot be reused so discard it */
This doesn't seem to have anything to do with crashes or errors?

The original code is checking if the NUMA node of a page is remote to
the NUMA node current is running on. Your change makes it check if the
NUMA node of a page is not equal to the nearest NUMA node with memory.
That's not necessarily local, though, which seems like that is the whole
point. In this case, perhaps the driver author doesn't want to reuse the
memory at all for performance reasons? In any case, I don't think this
patch has appropriate justification.

Thanks,
Nish
Alexander Duyck
2014-07-21 19:53:33 UTC
Permalink
I do agree the description should probably be changed. There shouldn't be
any panics involved, only a performance impact as it will be reallocating
always if it is on a node with no memory.

My intention on this was to make certain that the memory used is from the
closest node possible. As such I believe this change likely honours that.

Thanks,

Alex


On Mon, Jul 21, 2014 at 10:42 AM, Nishanth Aravamudan <
Post by Jiang Liu
Post by Jiang Liu
When CONFIG_HAVE_MEMORYLESS_NODES is enabled,
cpu_to_node()/numa_node_id()
Post by Jiang Liu
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.
If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().
---
drivers/net/ethernet/intel/igb/igb_main.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
b/drivers/net/ethernet/intel/igb/igb_main.c
Post by Jiang Liu
index f145adbb55ac..2b74bffa5648 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6518,7 +6518,7 @@ static bool igb_can_reuse_rx_page(struct
igb_rx_buffer *rx_buffer,
Post by Jiang Liu
unsigned int truesize)
{
/* avoid re-using remote pages */
- if (unlikely(page_to_nid(page) != numa_node_id()))
+ if (unlikely(page_to_nid(page) != numa_mem_id()))
return false;
#if (PAGE_SIZE < 8192)
@@ -6588,7 +6588,7 @@ static bool igb_add_rx_frag(struct igb_ring
*rx_ring,
Post by Jiang Liu
memcpy(__skb_put(skb, size), va, ALIGN(size,
sizeof(long)));
Post by Jiang Liu
/* we can reuse buffer as-is, just make sure it is local */
- if (likely(page_to_nid(page) == numa_node_id()))
+ if (likely(page_to_nid(page) == numa_mem_id()))
return true;
/* this page cannot be reused so discard it */
This doesn't seem to have anything to do with crashes or errors?
The original code is checking if the NUMA node of a page is remote to
the NUMA node current is running on. Your change makes it check if the
NUMA node of a page is not equal to the nearest NUMA node with memory.
That's not necessarily local, though, which seems like that is the whole
point. In this case, perhaps the driver author doesn't want to reuse the
memory at all for performance reasons? In any case, I don't think this
patch has appropriate justification.
Thanks,
Nish
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Nishanth Aravamudan
2014-07-21 21:09:00 UTC
Permalink
Post by Alexander Duyck
I do agree the description should probably be changed. There shouldn't be
any panics involved, only a performance impact as it will be reallocating
always if it is on a node with no memory.
Yep, thanks for the review.
Post by Alexander Duyck
My intention on this was to make certain that the memory used is from the
closest node possible. As such I believe this change likely honours that.
Absolutely, just wanted to make it explicit that it's not a functional
fix, just a performance fix (presuming this shows up at all on systems
that have memoryless NUMA nodes).

I'd suggest an update to the comments, as well.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-23 03:20:20 UTC
Permalink
Hi Nishanth and Alexander,
Thanks for review, will update the comments
in next version.
Regards!
Gerry
Post by Nishanth Aravamudan
Post by Alexander Duyck
I do agree the description should probably be changed. There shouldn't be
any panics involved, only a performance impact as it will be reallocating
always if it is on a node with no memory.
Yep, thanks for the review.
Post by Alexander Duyck
My intention on this was to make certain that the memory used is from the
closest node possible. As such I believe this change likely honours that.
Absolutely, just wanted to make it explicit that it's not a functional
fix, just a performance fix (presuming this shows up at all on systems
that have memoryless NUMA nodes).
I'd suggest an update to the comments, as well.
Thanks,
Nish
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-11 07:37:34 UTC
Permalink
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
drivers/thermal/intel_powerclamp.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
index 95cb7fc20e17..9d9be8cd1b50 100644
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -531,7 +531,7 @@ static int start_power_clamp(void)

thread = kthread_create_on_node(clamp_thread,
(void *) cpu,
- cpu_to_node(cpu),
+ cpu_to_mem(cpu),
"kidle_inject/%ld", cpu);
/* bind to cpu here */
if (likely(!IS_ERR(thread))) {
@@ -582,7 +582,7 @@ static int powerclamp_cpu_callback(struct notifier_block *nfb,
case CPU_ONLINE:
thread = kthread_create_on_node(clamp_thread,
(void *) cpu,
- cpu_to_node(cpu),
+ cpu_to_mem(cpu),
"kidle_inject/%lu", cpu);
if (likely(!IS_ERR(thread))) {
kthread_bind(thread, cpu);
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nishanth Aravamudan
2014-07-21 17:38:33 UTC
Permalink
Post by Jiang Liu
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.
You used the same changelog for all of the patches, it seems. But the
interface below (kthread_create_on_node) doesn't go into kmalloc_node?

kthread_create_on_node eventually sets the value used by
tsk_fork_get_node(), which is used by alloc_task_struct_node() and
alloc_thread_info_node(). The first uses kmem_cache_alloc_node() and the
second, depending on the relative sizes of THREAD_SIZE and PAGE_SIZE
uses either alloc_kmem_pages_node() or kmem_cache_alloc_node().
kmem_cache_alloc_node() goes into the appropriate slab allocator which
on SLUB for instance, goes down into __alloc_pages_nodemask. But no
failure occurs when memoryless nodes are present, you just get memory
that is remote from the node specified? Similarly,
alloc_kmem_pages_node() calls into __alloc_pages with an appropriate
node_zonelist, which should provide for the correct fallback based upon
NUMA topology?

What system failure/panic did you see that is resolved by this patch?
Post by Jiang Liu
If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().
---
drivers/thermal/intel_powerclamp.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
index 95cb7fc20e17..9d9be8cd1b50 100644
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -531,7 +531,7 @@ static int start_power_clamp(void)
thread = kthread_create_on_node(clamp_thread,
(void *) cpu,
- cpu_to_node(cpu),
+ cpu_to_mem(cpu),
As Tejun has pointed out elsewhere, we lose context here about the
original node we were running on. That information is relevant for a few
reasons:

1) In the underlying allocator, we might not have memory *right now* to
satisfy a request, which, say, causes us to deactivate a slab
(CONFIG_SLUB). But that condition may be relieved in the future and we
want to use the correct node again then.

2) For topologies that are symmetrical around a memoryless node, we
could lose the correct fallback information when we specify a nearest
neighbor with memory.

Thanks,
Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jiang Liu
2014-07-11 07:37:35 UTC
Permalink
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
drivers/scsi/bnx2fc/bnx2fc_fcoe.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/bnx2fc/bnx2fc_fcoe.c b/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
index 785d0d71781e..144534a51cbb 100644
--- a/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
+++ b/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
@@ -2453,7 +2453,7 @@ static void bnx2fc_percpu_thread_create(unsigned int cpu)
p = &per_cpu(bnx2fc_percpu, cpu);

thread = kthread_create_on_node(bnx2fc_percpu_io_thread,
- (void *)p, cpu_to_node(cpu),
+ (void *)p, cpu_to_mem(cpu),
"bnx2fc_thread/%d", cpu);
/* bind thread to the cpu */
if (likely(!IS_ERR(thread))) {
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-11 07:37:33 UTC
Permalink
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index f5aa3311ea28..46dc083573ea 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1962,7 +1962,7 @@ static bool ixgbe_add_rx_frag(struct ixgbe_ring *rx_ring,
memcpy(__skb_put(skb, size), va, ALIGN(size, sizeof(long)));

/* we can reuse buffer as-is, just make sure it is local */
- if (likely(page_to_nid(page) == numa_node_id()))
+ if (likely(page_to_nid(page) == numa_mem_id()))
return true;

/* this page cannot be reused so discard it */
@@ -1974,7 +1974,7 @@ static bool ixgbe_add_rx_frag(struct ixgbe_ring *rx_ring,
rx_buffer->page_offset, size, truesize);

/* avoid re-using remote pages */
- if (unlikely(page_to_nid(page) != numa_node_id()))
+ if (unlikely(page_to_nid(page) != numa_mem_id()))
return false;

#if (PAGE_SIZE < 8192)
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-11 07:37:36 UTC
Permalink
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
drivers/scsi/bnx2i/bnx2i_init.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/bnx2i/bnx2i_init.c b/drivers/scsi/bnx2i/bnx2i_init.c
index 80c03b452d61..f67a5a63134e 100644
--- a/drivers/scsi/bnx2i/bnx2i_init.c
+++ b/drivers/scsi/bnx2i/bnx2i_init.c
@@ -423,7 +423,7 @@ static void bnx2i_percpu_thread_create(unsigned int cpu)
p = &per_cpu(bnx2i_percpu, cpu);

thread = kthread_create_on_node(bnx2i_percpu_io_thread, (void *)p,
- cpu_to_node(cpu),
+ cpu_to_mem(cpu),
"bnx2i_thread/%d", cpu);
/* bind thread to the cpu */
if (likely(!IS_ERR(thread))) {
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-11 07:37:37 UTC
Permalink
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
drivers/scsi/fcoe/fcoe.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/fcoe/fcoe.c b/drivers/scsi/fcoe/fcoe.c
index 00ee0ed642aa..779a7af0e410 100644
--- a/drivers/scsi/fcoe/fcoe.c
+++ b/drivers/scsi/fcoe/fcoe.c
@@ -1257,7 +1257,7 @@ static void fcoe_percpu_thread_create(unsigned int cpu)
p = &per_cpu(fcoe_percpu, cpu);

thread = kthread_create_on_node(fcoe_percpu_receive_thread,
- (void *)p, cpu_to_node(cpu),
+ (void *)p, cpu_to_mem(cpu),
"fcoethread/%d", cpu);

if (likely(!IS_ERR(thread))) {
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-11 07:37:38 UTC
Permalink
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
drivers/irqchip/irq-clps711x.c | 2 +-
drivers/irqchip/irq-gic.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/irqchip/irq-clps711x.c b/drivers/irqchip/irq-clps711x.c
index 33340dc97d1d..b0acf8b32a1a 100644
--- a/drivers/irqchip/irq-clps711x.c
+++ b/drivers/irqchip/irq-clps711x.c
@@ -186,7 +186,7 @@ static int __init _clps711x_intc_init(struct device_node *np,
writel_relaxed(0, clps711x_intc->intmr[1]);
writel_relaxed(0, clps711x_intc->intmr[2]);

- err = irq_alloc_descs(-1, 0, ARRAY_SIZE(clps711x_irqs), numa_node_id());
+ err = irq_alloc_descs(-1, 0, ARRAY_SIZE(clps711x_irqs), numa_mem_id());
if (IS_ERR_VALUE(err))
goto out_iounmap;

diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
index 7e11c9d6ae8c..a7e6c043d823 100644
--- a/drivers/irqchip/irq-gic.c
+++ b/drivers/irqchip/irq-gic.c
@@ -1005,7 +1005,7 @@ void __init gic_init_bases(unsigned int gic_nr, int irq_start,
if (of_property_read_u32(node, "arm,routable-irqs",
&nr_routable_irqs)) {
irq_base = irq_alloc_descs(irq_start, 16, gic_irqs,
- numa_node_id());
+ numa_mem_id());
if (IS_ERR_VALUE(irq_base)) {
WARN(1, "Cannot allocate irq_descs @ IRQ%d, assuming pre-allocated\n",
irq_start);
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jason Cooper
2014-07-18 12:40:38 UTC
Permalink
Post by Jiang Liu
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.
If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().
---
drivers/irqchip/irq-clps711x.c | 2 +-
drivers/irqchip/irq-gic.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
Do you have anything depending on this? Can apply it to irqchip? If
you need to keep it with other changes,

Acked-by: Jason Cooper <***@lakedaemon.net>

But please do let me know if I can take it.

thx,

Jason.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-23 03:47:55 UTC
Permalink
Hi Jason,
Thanks for your review. According to review comments,
we need to rework the patch set in another direction and will
give up this patch.
Regards!
Gerry
Post by Jason Cooper
Post by Jiang Liu
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.
If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().
---
drivers/irqchip/irq-clps711x.c | 2 +-
drivers/irqchip/irq-gic.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
Do you have anything depending on this? Can apply it to irqchip? If
you need to keep it with other changes,
But please do let me know if I can take it.
thx,
Jason.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-11 07:37:39 UTC
Permalink
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
drivers/of/base.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/of/base.c b/drivers/of/base.c
index b9864806e9b8..40d4772973ad 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -85,7 +85,7 @@ EXPORT_SYMBOL(of_n_size_cells);
#ifdef CONFIG_NUMA
int __weak of_node_to_nid(struct device_node *np)
{
- return numa_node_id();
+ return numa_mem_id();
}
#endif
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nishanth Aravamudan
2014-07-21 17:52:41 UTC
Permalink
Post by Jiang Liu
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.
If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().
---
drivers/of/base.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/of/base.c b/drivers/of/base.c
index b9864806e9b8..40d4772973ad 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -85,7 +85,7 @@ EXPORT_SYMBOL(of_n_size_cells);
#ifdef CONFIG_NUMA
int __weak of_node_to_nid(struct device_node *np)
{
- return numa_node_id();
+ return numa_mem_id();
}
#endif
Um, NAK. of_node_to_nid() returns the NUMA node ID for a given device
tree node. The default should be the physically local NUMA node, not the
nearest memory-containing node.

I think the general direction of this patchset is good -- what NUMA
information do we actually are about at each callsite. But the execution
is blind and doesn't consider at all what the code is actually doing.
The changelogs are all identical and don't actually provide any
information about what errors this (or any) specific patch are
resolving.

Thanks,
Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-hotplug" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Grant Likely
2014-07-28 13:30:40 UTC
Permalink
Post by Nishanth Aravamudan
Post by Jiang Liu
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.
If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().
---
drivers/of/base.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/of/base.c b/drivers/of/base.c
index b9864806e9b8..40d4772973ad 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -85,7 +85,7 @@ EXPORT_SYMBOL(of_n_size_cells);
#ifdef CONFIG_NUMA
int __weak of_node_to_nid(struct device_node *np)
{
- return numa_node_id();
+ return numa_mem_id();
}
#endif
Um, NAK. of_node_to_nid() returns the NUMA node ID for a given device
tree node. The default should be the physically local NUMA node, not the
nearest memory-containing node.
That description doesn't match the code. This patch only changes the
default implementation of of_node_to_nid() which doesn't take the device
node into account *at all* when returning a node ID. Just look at the
diff.

I think this patch is correct, and it doesn't affect the override
versions provided by powerpc and sparc.

g.
Post by Nishanth Aravamudan
I think the general direction of this patchset is good -- what NUMA
information do we actually are about at each callsite. But the execution
is blind and doesn't consider at all what the code is actually doing.
The changelogs are all identical and don't actually provide any
information about what errors this (or any) specific patch are
resolving.
Thanks,
Nish
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nishanth Aravamudan
2014-07-28 19:26:02 UTC
Permalink
Post by Grant Likely
Post by Nishanth Aravamudan
Post by Jiang Liu
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.
If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().
---
drivers/of/base.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/of/base.c b/drivers/of/base.c
index b9864806e9b8..40d4772973ad 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -85,7 +85,7 @@ EXPORT_SYMBOL(of_n_size_cells);
#ifdef CONFIG_NUMA
int __weak of_node_to_nid(struct device_node *np)
{
- return numa_node_id();
+ return numa_mem_id();
}
#endif
Um, NAK. of_node_to_nid() returns the NUMA node ID for a given device
tree node. The default should be the physically local NUMA node, not the
nearest memory-containing node.
That description doesn't match the code. This patch only changes the
default implementation of of_node_to_nid() which doesn't take the device
node into account *at all* when returning a node ID. Just look at the
diff.
I meant that of_node_to_nid() seems to be used throughout the call-sites
to indicate caller locality. We want to keep using cpu_to_node() there,
and fallback appropriately in the MM (when allocations occur offnode due
to memoryless nodes), not indicate memory-specific topology the caller
itself. There was a long thread between between Tejun and I that
discussed what we are trying for: https://lkml.org/lkml/2014/7/18/278

I understand that the code unconditionally returns current's NUMA node
ID right now (ignoring the device node). That seems correct, to me, for
something like:

of_device_add:
/* device_add will assume that this device is on the same node as
* the parent. If there is no parent defined, set the node
* explicitly */
if (!ofdev->dev.parent)
set_dev_node(&ofdev->dev, of_node_to_nid(ofdev->dev.of_node));

I don't think we want the default implementation to set the NUMA node of
a dev to the nearest NUMA node with memory?
Post by Grant Likely
I think this patch is correct, and it doesn't affect the override
versions provided by powerpc and sparc.
Yes, agreed, so maybe it doesn't matter. I guess my point was simply
that it only seems reasonable to change callers of cpu_to_node() to
cpu_to_mem() that aren't in the core MM is if they care about memoryless
nodes explicitly. I don't think the OF code does, so I don't think it
should change.

Sorry for my premature NAK and lack of clarity in my explanation.

-Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-11 07:37:43 UTC
Permalink
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
arch/x86/kernel/cpu/perf_event_amd.c | 2 +-
arch/x86/kernel/cpu/perf_event_amd_uncore.c | 2 +-
arch/x86/kernel/cpu/perf_event_intel.c | 2 +-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 6 +++---
arch/x86/kernel/cpu/perf_event_intel_rapl.c | 2 +-
arch/x86/kernel/cpu/perf_event_intel_uncore.c | 2 +-
6 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c
index beeb7cc07044..ee5120ce3e98 100644
--- a/arch/x86/kernel/cpu/perf_event_amd.c
+++ b/arch/x86/kernel/cpu/perf_event_amd.c
@@ -347,7 +347,7 @@ static struct amd_nb *amd_alloc_nb(int cpu)
struct amd_nb *nb;
int i;

- nb = kzalloc_node(sizeof(struct amd_nb), GFP_KERNEL, cpu_to_node(cpu));
+ nb = kzalloc_node(sizeof(struct amd_nb), GFP_KERNEL, cpu_to_mem(cpu));
if (!nb)
return NULL;

diff --git a/arch/x86/kernel/cpu/perf_event_amd_uncore.c b/arch/x86/kernel/cpu/perf_event_amd_uncore.c
index 3bbdf4cd38b9..1a7f4129bf4c 100644
--- a/arch/x86/kernel/cpu/perf_event_amd_uncore.c
+++ b/arch/x86/kernel/cpu/perf_event_amd_uncore.c
@@ -291,7 +291,7 @@ static struct pmu amd_l2_pmu = {
static struct amd_uncore *amd_uncore_alloc(unsigned int cpu)
{
return kzalloc_node(sizeof(struct amd_uncore), GFP_KERNEL,
- cpu_to_node(cpu));
+ cpu_to_mem(cpu));
}

static void amd_uncore_cpu_up_prepare(unsigned int cpu)
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index adb02aa62af5..4f48d1bb7608 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1957,7 +1957,7 @@ struct intel_shared_regs *allocate_shared_regs(int cpu)
int i;

regs = kzalloc_node(sizeof(struct intel_shared_regs),
- GFP_KERNEL, cpu_to_node(cpu));
+ GFP_KERNEL, cpu_to_mem(cpu));
if (regs) {
/*
* initialize the locks to keep lockdep happy
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 980970cb744d..bb0327411bf1 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -250,7 +250,7 @@ static DEFINE_PER_CPU(void *, insn_buffer);
static int alloc_pebs_buffer(int cpu)
{
struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
- int node = cpu_to_node(cpu);
+ int node = cpu_to_mem(cpu);
int max, thresh = 1; /* always use a single PEBS record */
void *buffer, *ibuffer;

@@ -304,7 +304,7 @@ static void release_pebs_buffer(int cpu)
static int alloc_bts_buffer(int cpu)
{
struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
- int node = cpu_to_node(cpu);
+ int node = cpu_to_mem(cpu);
int max, thresh;
void *buffer;

@@ -341,7 +341,7 @@ static void release_bts_buffer(int cpu)

static int alloc_ds_buffer(int cpu)
{
- int node = cpu_to_node(cpu);
+ int node = cpu_to_mem(cpu);
struct debug_store *ds;

ds = kzalloc_node(sizeof(*ds), GFP_KERNEL, node);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_rapl.c b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
index 619f7699487a..9df1ec3b505d 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_rapl.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
@@ -547,7 +547,7 @@ static int rapl_cpu_prepare(int cpu)
if (rdmsrl_safe(MSR_RAPL_POWER_UNIT, &msr_rapl_power_unit_bits))
return -1;

- pmu = kzalloc_node(sizeof(*pmu), GFP_KERNEL, cpu_to_node(cpu));
+ pmu = kzalloc_node(sizeof(*pmu), GFP_KERNEL, cpu_to_mem(cpu));
if (!pmu)
return -1;

diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
index 65bbbea38b9c..4b77ba4b4e36 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
@@ -4011,7 +4011,7 @@ static int uncore_cpu_prepare(int cpu, int phys_id)
if (pmu->func_id < 0)
pmu->func_id = j;

- box = uncore_alloc_box(type, cpu_to_node(cpu));
+ box = uncore_alloc_box(type, cpu_to_mem(cpu));
if (!box)
return -ENOMEM;
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-11 07:37:42 UTC
Permalink
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
arch/x86/kvm/vmx.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 801332edefc3..beb7c6d5d51b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2964,7 +2964,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)

static struct vmcs *alloc_vmcs_cpu(int cpu)
{
- int node = cpu_to_node(cpu);
+ int node = cpu_to_mem(cpu);
struct page *pages;
struct vmcs *vmcs;
--
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Paolo Bonzini
2014-07-11 07:44:33 UTC
Permalink
Post by Jiang Liu
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.
If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().
---
arch/x86/kvm/vmx.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 801332edefc3..beb7c6d5d51b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2964,7 +2964,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
static struct vmcs *alloc_vmcs_cpu(int cpu)
{
- int node = cpu_to_node(cpu);
+ int node = cpu_to_mem(cpu);
struct page *pages;
struct vmcs *vmcs;
Acked-by: Paolo Bonzini <***@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-11 07:37:44 UTC
Permalink
According to x86 boot sequence, early_cpu_to_node() always returns
NUMA_NO_NODE when called from numa_init(). So kill useless code
to improve code readability.

Related code sequence as below:
x86_cpu_to_node_map is set until step 2, so it is still the default
value (NUMA_NO_NODE) when accessed at step 1.

start_kernel()
setup_arch()
initmem_init()
x86_numa_init()
numa_init()
early_cpu_to_node()
1) return early_per_cpu_ptr(x86_cpu_to_node_map)[cpu];
acpi_boot_init();
sfi_init()
x86_dtb_init()
generic_processor_info()
early_per_cpu(x86_cpu_to_apicid, cpu) = apicid;
init_cpu_to_node()
numa_set_node(cpu, node);
2) per_cpu(x86_cpu_to_node_map, cpu) = node;

rest_init()
kernel_init()
smp_init()
native_cpu_up()
start_secondary()
numa_set_node()
per_cpu(x86_cpu_to_node_map, cpu) = node;

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
arch/x86/mm/numa.c | 10 ----------
1 file changed, 10 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index a32b706c401a..eec4f6c322bb 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -545,8 +545,6 @@ static void __init numa_init_array(void)

rr = first_node(node_online_map);
for (i = 0; i < nr_cpu_ids; i++) {
- if (early_cpu_to_node(i) != NUMA_NO_NODE)
- continue;
numa_set_node(i, rr);
rr = next_node(rr, node_online_map);
if (rr == MAX_NUMNODES)
@@ -633,14 +631,6 @@ static int __init numa_init(int (*init_func)(void))
if (ret < 0)
return ret;

- for (i = 0; i < nr_cpu_ids; i++) {
- int nid = early_cpu_to_node(i);
-
- if (nid == NUMA_NO_NODE)
- continue;
- if (!node_online(nid))
- numa_clear_node(i);
- }
numa_init_array();

/*
--
1.7.10.4
Jiang Liu
2014-07-11 07:37:41 UTC
Permalink
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
arch/x86/platform/uv/tlb_uv.c | 2 +-
arch/x86/platform/uv/uv_nmi.c | 3 ++-
arch/x86/platform/uv/uv_time.c | 2 +-
3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/platform/uv/tlb_uv.c b/arch/x86/platform/uv/tlb_uv.c
index dfe605ac1bcd..4612b4396004 100644
--- a/arch/x86/platform/uv/tlb_uv.c
+++ b/arch/x86/platform/uv/tlb_uv.c
@@ -2116,7 +2116,7 @@ static int __init uv_bau_init(void)

for_each_possible_cpu(cur_cpu) {
mask = &per_cpu(uv_flush_tlb_mask, cur_cpu);
- zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cur_cpu));
+ zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_mem(cur_cpu));
}

nuvhubs = uv_num_possible_blades();
diff --git a/arch/x86/platform/uv/uv_nmi.c b/arch/x86/platform/uv/uv_nmi.c
index c89c93320c12..d17758215a61 100644
--- a/arch/x86/platform/uv/uv_nmi.c
+++ b/arch/x86/platform/uv/uv_nmi.c
@@ -715,7 +715,8 @@ void uv_nmi_setup(void)
nid = cpu_to_node(cpu);
if (uv_hub_nmi_list[nid] == NULL) {
uv_hub_nmi_list[nid] = kzalloc_node(size,
- GFP_KERNEL, nid);
+ GFP_KERNEL,
+ cpu_to_mem(cpu));
BUG_ON(!uv_hub_nmi_list[nid]);
raw_spin_lock_init(&(uv_hub_nmi_list[nid]->nmi_lock));
atomic_set(&uv_hub_nmi_list[nid]->cpu_owner, -1);
diff --git a/arch/x86/platform/uv/uv_time.c b/arch/x86/platform/uv/uv_time.c
index 5c86786bbfd2..c369fb2eb7d3 100644
--- a/arch/x86/platform/uv/uv_time.c
+++ b/arch/x86/platform/uv/uv_time.c
@@ -164,7 +164,7 @@ static __init int uv_rtc_allocate_timers(void)
return -ENOMEM;

for_each_present_cpu(cpu) {
- int nid = cpu_to_node(cpu);
+ int nid = cpu_to_mem(cpu);
int bid = uv_cpu_to_blade_id(cpu);
int bcpu = uv_cpu_hub_info(cpu)->blade_processor_id;
struct uv_rtc_timer_head *head = blade_info[bid];
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-11 07:37:40 UTC
Permalink
When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
arch/x86/kernel/apic/io_apic.c | 10 +++++-----
arch/x86/kernel/devicetree.c | 2 +-
arch/x86/kernel/irq_32.c | 4 ++--
3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 81e08eff05ee..7cb3d58b11e8 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -204,7 +204,7 @@ int __init arch_early_irq_init(void)

cfg = irq_cfgx;
count = ARRAY_SIZE(irq_cfgx);
- node = cpu_to_node(0);
+ node = cpu_to_mem(0);

for (i = 0; i < count; i++) {
irq_set_chip_data(i, &cfg[i]);
@@ -1348,7 +1348,7 @@ static bool __init io_apic_pin_not_connected(int idx, int ioapic_idx, int pin)

static void __init __io_apic_setup_irqs(unsigned int ioapic_idx)
{
- int idx, node = cpu_to_node(0);
+ int idx, node = cpu_to_mem(0);
struct io_apic_irq_attr attr;
unsigned int pin, irq;

@@ -1394,7 +1394,7 @@ static void __init setup_IO_APIC_irqs(void)
*/
void setup_IO_APIC_irq_extra(u32 gsi)
{
- int ioapic_idx = 0, pin, idx, irq, node = cpu_to_node(0);
+ int ioapic_idx = 0, pin, idx, irq, node = cpu_to_mem(0);
struct io_apic_irq_attr attr;

/*
@@ -2662,7 +2662,7 @@ int timer_through_8259 __initdata;
static inline void __init check_timer(void)
{
struct irq_cfg *cfg = irq_get_chip_data(0);
- int node = cpu_to_node(0);
+ int node = cpu_to_mem(0);
int apic1, pin1, apic2, pin2;
unsigned long flags;
int no_pin1 = 0;
@@ -3387,7 +3387,7 @@ int io_apic_set_pci_routing(struct device *dev, int irq,
return -EINVAL;
}

- node = dev ? dev_to_node(dev) : cpu_to_node(0);
+ node = dev ? dev_to_node(dev) : cpu_to_mem(0);

return io_apic_setup_irq_pin_once(irq, node, irq_attr);
}
diff --git a/arch/x86/kernel/devicetree.c b/arch/x86/kernel/devicetree.c
index 7db54b5d5f86..289762f4ea06 100644
--- a/arch/x86/kernel/devicetree.c
+++ b/arch/x86/kernel/devicetree.c
@@ -295,7 +295,7 @@ static int ioapic_xlate(struct irq_domain *domain,
set_io_apic_irq_attr(&attr, idx, line, it->trigger, it->polarity);

rc = io_apic_setup_irq_pin_once(irq_find_mapping(domain, line),
- cpu_to_node(0), &attr);
+ cpu_to_mem(0), &attr);
if (rc)
return rc;

diff --git a/arch/x86/kernel/irq_32.c b/arch/x86/kernel/irq_32.c
index 63ce838e5a54..425bb4b1110a 100644
--- a/arch/x86/kernel/irq_32.c
+++ b/arch/x86/kernel/irq_32.c
@@ -128,12 +128,12 @@ void irq_ctx_init(int cpu)
if (per_cpu(hardirq_stack, cpu))
return;

- irqstk = page_address(alloc_pages_node(cpu_to_node(cpu),
+ irqstk = page_address(alloc_pages_node(cpu_to_mem(cpu),
THREADINFO_GFP,
THREAD_SIZE_ORDER));
per_cpu(hardirq_stack, cpu) = irqstk;

- irqstk = page_address(alloc_pages_node(cpu_to_node(cpu),
+ irqstk = page_address(alloc_pages_node(cpu_to_mem(cpu),
THREADINFO_GFP,
THREAD_SIZE_ORDER));
per_cpu(softirq_stack, cpu) = irqstk;
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-11 07:37:45 UTC
Permalink
Current kernel only updates _mem_id_[cpu] for onlined CPUs when memory
configuration changes. So kernel may allocate memory from remote node
for a CPU if the CPU is still in absent or offline state even if the
node associated with the CPU has already been onlined. This patch tries
to improve performance by updating _mem_id_[cpu] for each possible CPU
when memory configuration changes, thus kernel could always allocate
from local node once the node is onlined.

We check node_online(cpu_to_node(cpu)) because:
1) local_memory_node(nid) needs to access NODE_DATA(nid)
2) try_offline_node(nid) just zeroes out NODE_DATA(nid) instead of free it

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
mm/page_alloc.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0ea758b898fd..de86e941ed57 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3844,13 +3844,13 @@ static int __build_all_zonelists(void *data)
/*
* We now know the "local memory node" for each node--
* i.e., the node of the first zone in the generic zonelist.
- * Set up numa_mem percpu variable for on-line cpus. During
- * boot, only the boot cpu should be on-line; we'll init the
- * secondary cpus' numa_mem as they come on-line. During
- * node/memory hotplug, we'll fixup all on-line cpus.
+ * Set up numa_mem percpu variable for all possible cpus
+ * if associated node has been onlined.
*/
- if (cpu_online(cpu))
+ if (node_online(cpu_to_node(cpu)))
set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
+ else
+ set_cpu_numa_mem(cpu, NUMA_NO_NODE);
#endif
}
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nishanth Aravamudan
2014-07-21 17:47:54 UTC
Permalink
Post by Jiang Liu
Current kernel only updates _mem_id_[cpu] for onlined CPUs when memory
configuration changes. So kernel may allocate memory from remote node
for a CPU if the CPU is still in absent or offline state even if the
node associated with the CPU has already been onlined.
This just sounds like the topology information is being updated at the
wrong place/time? That is, the memory is online, the CPU is being
brought online, but isn't associated with any node?
Post by Jiang Liu
This patch tries to improve performance by updating _mem_id_[cpu] for
each possible CPU when memory configuration changes, thus kernel could
always allocate from local node once the node is onlined.
Ok, what is the impact? Do you actually see better performance?
Post by Jiang Liu
1) local_memory_node(nid) needs to access NODE_DATA(nid)
2) try_offline_node(nid) just zeroes out NODE_DATA(nid) instead of free it
---
mm/page_alloc.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0ea758b898fd..de86e941ed57 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3844,13 +3844,13 @@ static int __build_all_zonelists(void *data)
/*
* We now know the "local memory node" for each node--
* i.e., the node of the first zone in the generic zonelist.
- * Set up numa_mem percpu variable for on-line cpus. During
- * boot, only the boot cpu should be on-line; we'll init the
- * secondary cpus' numa_mem as they come on-line. During
- * node/memory hotplug, we'll fixup all on-line cpus.
+ * Set up numa_mem percpu variable for all possible cpus
+ * if associated node has been onlined.
*/
- if (cpu_online(cpu))
+ if (node_online(cpu_to_node(cpu)))
set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
+ else
+ set_cpu_numa_mem(cpu, NUMA_NO_NODE);
#endif
Jiang Liu
2014-07-23 08:16:14 UTC
Permalink
Post by Nishanth Aravamudan
Post by Jiang Liu
Current kernel only updates _mem_id_[cpu] for onlined CPUs when memory
configuration changes. So kernel may allocate memory from remote node
for a CPU if the CPU is still in absent or offline state even if the
node associated with the CPU has already been onlined.
This just sounds like the topology information is being updated at the
wrong place/time? That is, the memory is online, the CPU is being
brought online, but isn't associated with any node?
Hi Nishanth,
Yes, that's the case.
Post by Nishanth Aravamudan
Post by Jiang Liu
This patch tries to improve performance by updating _mem_id_[cpu] for
each possible CPU when memory configuration changes, thus kernel could
always allocate from local node once the node is onlined.
Ok, what is the impact? Do you actually see better performance?
No real data to support this yet, just with code analysis.
Regards!
Gerry
Post by Nishanth Aravamudan
Post by Jiang Liu
1) local_memory_node(nid) needs to access NODE_DATA(nid)
2) try_offline_node(nid) just zeroes out NODE_DATA(nid) instead of free it
---
mm/page_alloc.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0ea758b898fd..de86e941ed57 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3844,13 +3844,13 @@ static int __build_all_zonelists(void *data)
/*
* We now know the "local memory node" for each node--
* i.e., the node of the first zone in the generic zonelist.
- * Set up numa_mem percpu variable for on-line cpus. During
- * boot, only the boot cpu should be on-line; we'll init the
- * secondary cpus' numa_mem as they come on-line. During
- * node/memory hotplug, we'll fixup all on-line cpus.
+ * Set up numa_mem percpu variable for all possible cpus
+ * if associated node has been onlined.
*/
- if (cpu_online(cpu))
+ if (node_online(cpu_to_node(cpu)))
set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
+ else
+ set_cpu_numa_mem(cpu, NUMA_NO_NODE);
#endif
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-11 07:37:47 UTC
Permalink
With typical CPU hot-addition flow on x86, PCI host bridges embedded
in physical processor are always associated with NOMA_NO_NODE, which
may cause sub-optimal performance.
1) Handle CPU hot-addition notification
acpi_processor_add()
acpi_processor_get_info()
acpi_processor_hotadd_init()
acpi_map_lsapic()
1.a) acpi_map_cpu2node()

2) Handle PCI host bridge hot-addition notification
acpi_pci_root_add()
pci_acpi_scan_root()
2.a) if (node != NUMA_NO_NODE && !node_online(node)) node = NUMA_NO_NODE;

3) Handle memory hot-addition notification
acpi_memory_device_add()
acpi_memory_enable_device()
add_memory()
3.a) node_set_online();

4) Online CPUs through sysfs interfaces
cpu_subsys_online()
cpu_up()
try_online_node()
4.a) node_set_online();

So associated node is always in offline state because it is onlined
until step 3.a or 4.a.

We could improve performance by online node at step 1.a. This change
also makes the code symmetric. Nodes are always created when handling
CPU/memory hot-addition events instead of handling user requests from
sysfs interfaces, and are destroyed when handling CPU/memory hot-removal
events.

It also close a race window caused by kmalloc_node(cpu_to_node(cpu)),
which may cause system panic as below.
[ 3663.324476] BUG: unable to handle kernel paging request at 0000000000001f08
[ 3663.332348] IP: [<ffffffff81172219>] __alloc_pages_nodemask+0xb9/0x2d0
[ 3663.339719] PGD 82fe10067 PUD 82ebef067 PMD 0
[ 3663.344773] Oops: 0000 [#1] SMP
[ 3663.348455] Modules linked in: shpchp gpio_ich x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd microcode joydev sb_edac edac_core lpc_ich ipmi_si tpm_tis ipmi_msghandler ioatdma wmi acpi_pad mac_hid lp parport ixgbe isci mpt2sas dca ahci ptp libsas libahci raid_class pps_core scsi_transport_sas mdio hid_generic usbhid hid
[ 3663.394393] CPU: 61 PID: 2416 Comm: cron Tainted: G W 3.14.0-rc5+ #21
[ 3663.402643] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRIVTIN1.86B.0047.F03.1403031049 03/03/2014
[ 3663.414299] task: ffff88082fe54b00 ti: ffff880845fba000 task.ti: ffff880845fba000
[ 3663.422741] RIP: 0010:[<ffffffff81172219>] [<ffffffff81172219>] __alloc_pages_nodemask+0xb9/0x2d0
[ 3663.432857] RSP: 0018:ffff880845fbbcd0 EFLAGS: 00010246
[ 3663.439265] RAX: 0000000000001f00 RBX: 0000000000000000 RCX: 0000000000000000
[ 3663.447291] RDX: 0000000000000000 RSI: 0000000000000a8d RDI: ffffffff81a8d950
[ 3663.455318] RBP: ffff880845fbbd58 R08: ffff880823293400 R09: 0000000000000001
[ 3663.463345] R10: 0000000000000001 R11: 0000000000000000 R12: 00000000002052d0
[ 3663.471363] R13: ffff880854c07600 R14: 0000000000000002 R15: 0000000000000000
[ 3663.479389] FS: 00007f2e8b99e800(0000) GS:ffff88105a400000(0000) knlGS:0000000000000000
[ 3663.488514] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3663.495018] CR2: 0000000000001f08 CR3: 00000008237b1000 CR4: 00000000001407e0
[ 3663.503476] Stack:
[ 3663.505757] ffffffff811bd74d ffff880854c01d98 ffff880854c01df0 ffff880854c01dd0
[ 3663.514167] 00000003208ca420 000000075a5d84d0 ffff88082fe54b00 ffffffff811bb35f
[ 3663.522567] ffff880854c07600 0000000000000003 0000000000001f00 ffff880845fbbd48
[ 3663.530976] Call Trace:
[ 3663.533753] [<ffffffff811bd74d>] ? deactivate_slab+0x41d/0x4f0
[ 3663.540421] [<ffffffff811bb35f>] ? new_slab+0x3f/0x2d0
[ 3663.546307] [<ffffffff811bb3c5>] new_slab+0xa5/0x2d0
[ 3663.552001] [<ffffffff81768c97>] __slab_alloc+0x35d/0x54a
[ 3663.558185] [<ffffffff810a4845>] ? local_clock+0x25/0x30
[ 3663.564686] [<ffffffff8177a34c>] ? __do_page_fault+0x4ec/0x5e0
[ 3663.571356] [<ffffffff810b0054>] ? alloc_fair_sched_group+0xc4/0x190
[ 3663.578609] [<ffffffff810c77f1>] ? __raw_spin_lock_init+0x21/0x60
[ 3663.585570] [<ffffffff811be476>] kmem_cache_alloc_node_trace+0xa6/0x1d0
[ 3663.593112] [<ffffffff810b0054>] ? alloc_fair_sched_group+0xc4/0x190
[ 3663.600363] [<ffffffff810b0054>] alloc_fair_sched_group+0xc4/0x190
[ 3663.607423] [<ffffffff810a359f>] sched_create_group+0x3f/0x80
[ 3663.613994] [<ffffffff810b611f>] sched_autogroup_create_attach+0x3f/0x1b0
[ 3663.621732] [<ffffffff8108258a>] sys_setsid+0xea/0x110
[ 3663.628020] [<ffffffff8177f42d>] system_call_fastpath+0x1a/0x1f
[ 3663.634780] Code: 00 44 89 e7 e8 b9 f8 f4 ff 41 f6 c4 10 74 18 31 d2 be 8d 0a 00 00 48 c7 c7 50 d9 a8 81 e8 70 6a f2 ff e8 db dd 5f 00 48 8b 45 c8 <48> 83 78 08 00 0f 84 b5 01 00 00 48 83 c0 08 44 89 75 c0 4d 89
[ 3663.657032] RIP [<ffffffff81172219>] __alloc_pages_nodemask+0xb9/0x2d0
[ 3663.664491] RSP <ffff880845fbbcd0>
[ 3663.668429] CR2: 0000000000001f08
[ 3663.672659] ---[ end trace df13f08ed9de18ad ]---

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
arch/x86/kernel/acpi/boot.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 3b5641703a49..00c2ed507460 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -611,6 +611,7 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
nid = acpi_get_node(handle);
if (nid != -1) {
set_apicid_to_node(physid, nid);
+ try_online_node(nid);
numa_set_node(cpu, nid);
if (node_online(nid))
set_cpu_numa_mem(cpu, local_memory_node(nid));
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nishanth Aravamudan
2014-07-24 23:30:27 UTC
Permalink
Post by Jiang Liu
With typical CPU hot-addition flow on x86, PCI host bridges embedded
in physical processor are always associated with NOMA_NO_NODE, which
may cause sub-optimal performance.
1) Handle CPU hot-addition notification
acpi_processor_add()
acpi_processor_get_info()
acpi_processor_hotadd_init()
acpi_map_lsapic()
1.a) acpi_map_cpu2node()
2) Handle PCI host bridge hot-addition notification
acpi_pci_root_add()
pci_acpi_scan_root()
2.a) if (node != NUMA_NO_NODE && !node_online(node)) node = NUMA_NO_NODE;
3) Handle memory hot-addition notification
acpi_memory_device_add()
acpi_memory_enable_device()
add_memory()
3.a) node_set_online();
4) Online CPUs through sysfs interfaces
cpu_subsys_online()
cpu_up()
try_online_node()
4.a) node_set_online();
So associated node is always in offline state because it is onlined
until step 3.a or 4.a.
We could improve performance by online node at step 1.a. This change
also makes the code symmetric. Nodes are always created when handling
CPU/memory hot-addition events instead of handling user requests from
sysfs interfaces, and are destroyed when handling CPU/memory hot-removal
events.
It seems like this patch has little to nothing to do with the rest of
the series and can be sent on its own?
Post by Jiang Liu
It also close a race window caused by kmalloc_node(cpu_to_node(cpu)),
To be clear, the race is that on some x86 platforms, there is a period
of time where a node ID returned by cpu_to_node() is offline.

<snip>
Post by Jiang Liu
---
arch/x86/kernel/acpi/boot.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 3b5641703a49..00c2ed507460 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -611,6 +611,7 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
nid = acpi_get_node(handle);
if (nid != -1) {
set_apicid_to_node(physid, nid);
+ try_online_node(nid);
try_online_node() seems like it can fail? I assume it's a pretty rare
case, but should the return code be checked?

If it does fail, it seems like there are pretty serious problems and we
shouldn't be onlining this CPU, etc.?
Post by Jiang Liu
numa_set_node(cpu, nid);
if (node_online(nid))
set_cpu_numa_mem(cpu, local_memory_node(nid));
Which means you can remove this check presuming try_online_node()
returned 0.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-25 01:43:16 UTC
Permalink
Post by Nishanth Aravamudan
Post by Jiang Liu
With typical CPU hot-addition flow on x86, PCI host bridges embedded
in physical processor are always associated with NOMA_NO_NODE, which
may cause sub-optimal performance.
1) Handle CPU hot-addition notification
acpi_processor_add()
acpi_processor_get_info()
acpi_processor_hotadd_init()
acpi_map_lsapic()
1.a) acpi_map_cpu2node()
2) Handle PCI host bridge hot-addition notification
acpi_pci_root_add()
pci_acpi_scan_root()
2.a) if (node != NUMA_NO_NODE && !node_online(node)) node = NUMA_NO_NODE;
3) Handle memory hot-addition notification
acpi_memory_device_add()
acpi_memory_enable_device()
add_memory()
3.a) node_set_online();
4) Online CPUs through sysfs interfaces
cpu_subsys_online()
cpu_up()
try_online_node()
4.a) node_set_online();
So associated node is always in offline state because it is onlined
until step 3.a or 4.a.
We could improve performance by online node at step 1.a. This change
also makes the code symmetric. Nodes are always created when handling
CPU/memory hot-addition events instead of handling user requests from
sysfs interfaces, and are destroyed when handling CPU/memory hot-removal
events.
It seems like this patch has little to nothing to do with the rest of
the series and can be sent on its own?
Post by Jiang Liu
It also close a race window caused by kmalloc_node(cpu_to_node(cpu)),
To be clear, the race is that on some x86 platforms, there is a period
of time where a node ID returned by cpu_to_node() is offline.
<snip>
Post by Jiang Liu
---
arch/x86/kernel/acpi/boot.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 3b5641703a49..00c2ed507460 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -611,6 +611,7 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
nid = acpi_get_node(handle);
if (nid != -1) {
set_apicid_to_node(physid, nid);
+ try_online_node(nid);
try_online_node() seems like it can fail? I assume it's a pretty rare
case, but should the return code be checked?
Good suggestion, I should split out this patch to fix the crash.
Post by Nishanth Aravamudan
If it does fail, it seems like there are pretty serious problems and we
shouldn't be onlining this CPU, etc.?
Post by Jiang Liu
numa_set_node(cpu, nid);
if (node_online(nid))
set_cpu_numa_mem(cpu, local_memory_node(nid));
Which means you can remove this check presuming try_online_node()
returned 0.
Yes, that's true.
Post by Nishanth Aravamudan
Thanks,
Nish
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-25 01:44:36 UTC
Permalink
Post by Nishanth Aravamudan
Post by Jiang Liu
With typical CPU hot-addition flow on x86, PCI host bridges embedded
in physical processor are always associated with NOMA_NO_NODE, which
may cause sub-optimal performance.
1) Handle CPU hot-addition notification
acpi_processor_add()
acpi_processor_get_info()
acpi_processor_hotadd_init()
acpi_map_lsapic()
1.a) acpi_map_cpu2node()
2) Handle PCI host bridge hot-addition notification
acpi_pci_root_add()
pci_acpi_scan_root()
2.a) if (node != NUMA_NO_NODE && !node_online(node)) node = NUMA_NO_NODE;
3) Handle memory hot-addition notification
acpi_memory_device_add()
acpi_memory_enable_device()
add_memory()
3.a) node_set_online();
4) Online CPUs through sysfs interfaces
cpu_subsys_online()
cpu_up()
try_online_node()
4.a) node_set_online();
So associated node is always in offline state because it is onlined
until step 3.a or 4.a.
We could improve performance by online node at step 1.a. This change
also makes the code symmetric. Nodes are always created when handling
CPU/memory hot-addition events instead of handling user requests from
sysfs interfaces, and are destroyed when handling CPU/memory hot-removal
events.
It seems like this patch has little to nothing to do with the rest of
the series and can be sent on its own?
Post by Jiang Liu
It also close a race window caused by kmalloc_node(cpu_to_node(cpu)),
To be clear, the race is that on some x86 platforms, there is a period
of time where a node ID returned by cpu_to_node() is offline.
<snip>
Post by Jiang Liu
---
arch/x86/kernel/acpi/boot.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 3b5641703a49..00c2ed507460 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -611,6 +611,7 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
nid = acpi_get_node(handle);
if (nid != -1) {
set_apicid_to_node(physid, nid);
+ try_online_node(nid);
try_online_node() seems like it can fail? I assume it's a pretty rare
case, but should the return code be checked?
If it does fail, it seems like there are pretty serious problems and we
shouldn't be onlining this CPU, etc.?
Post by Jiang Liu
numa_set_node(cpu, nid);
if (node_online(nid))
set_cpu_numa_mem(cpu, local_memory_node(nid));
Which means you can remove this check presuming try_online_node()
returned 0.
Good suggestion, will try to enhance the error handling path.
Post by Nishanth Aravamudan
Thanks,
Nish
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jiang Liu
2014-07-11 07:37:46 UTC
Permalink
With current implementation, all CPUs within a NUMA node will be
assocaited with another NUMA node if the node has no memory installed.

For example, on a four-node system, CPUs on node 2 and 3 are associated
with node 0 when are no memory install on node 2 and 3, which may
confuse users.
***@bkd01sdp:~# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 0 size: 15602 MB
node 0 free: 15014 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
node 1 size: 15985 MB
node 1 free: 15686 MB
node distances:
node 0 1
0: 10 21
1: 21 10

To be worse, the CPU affinity relationship won't get fixed even after
memory has been added to those nodes. After memory hot-addition to
node 2, CPUs on node 2 are still associated with node 0. This may cause
sub-optimal performance.
***@bkd01sdp:/sys/devices/system/node/node2# numactl --hardware
available: 3 nodes (0-2)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 0 size: 15602 MB
node 0 free: 14743 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
node 1 size: 15985 MB
node 1 free: 15715 MB
node 2 cpus:
node 2 size: 128 MB
node 2 free: 128 MB
node distances:
node 0 1 2
0: 10 21 21
1: 21 10 21
2: 21 21 10

With support of memoryless node enabled, it will correctly report system
hardware topology for nodes without memory installed.
***@bkd01sdp:~# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
node 0 size: 15725 MB
node 0 free: 15129 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
node 1 size: 15862 MB
node 1 free: 15627 MB
node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node 0 1 2 3
0: 10 21 21 21
1: 21 10 21 21
2: 21 21 10 21
3: 21 21 21 10

With memoryless node enabled, CPUs are correctly associated with node 2
after memory hot-addition to node 2.
***@bkd01sdp:/sys/devices/system/node/node2# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
node 0 size: 15725 MB
node 0 free: 14872 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
node 1 size: 15862 MB
node 1 free: 15641 MB
node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
node 2 size: 128 MB
node 2 free: 127 MB
node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node 0 1 2 3
0: 10 21 21 21
1: 21 10 21 21
2: 21 21 10 21
3: 21 21 21 10

Signed-off-by: Jiang Liu <***@linux.intel.com>
---
arch/x86/Kconfig | 3 +++
arch/x86/kernel/acpi/boot.c | 5 ++++-
arch/x86/kernel/smpboot.c | 2 ++
arch/x86/mm/numa.c | 42 +++++++++++++++++++++++++++++++++++-------
4 files changed, 44 insertions(+), 8 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a8f749ef0fdc..f35b25b88625 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1887,6 +1887,9 @@ config USE_PERCPU_NUMA_NODE_ID
def_bool y
depends on NUMA

+config HAVE_MEMORYLESS_NODES
+ def_bool NUMA
+
config ARCH_ENABLE_SPLIT_PMD_PTLOCK
def_bool y
depends on X86_64 || X86_PAE
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 86281ffb96d6..3b5641703a49 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -612,6 +612,8 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
if (nid != -1) {
set_apicid_to_node(physid, nid);
numa_set_node(cpu, nid);
+ if (node_online(nid))
+ set_cpu_numa_mem(cpu, local_memory_node(nid));
}
#endif
}
@@ -644,9 +646,10 @@ int acpi_unmap_lsapic(int cpu)
{
#ifdef CONFIG_ACPI_NUMA
set_apicid_to_node(per_cpu(x86_cpu_to_apicid, cpu), NUMA_NO_NODE);
+ set_cpu_numa_mem(cpu, NUMA_NO_NODE);
#endif

- per_cpu(x86_cpu_to_apicid, cpu) = -1;
+ per_cpu(x86_cpu_to_apicid, cpu) = BAD_APICID;
set_cpu_present(cpu, false);
num_processors--;

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 5492798930ef..4a5437989ffe 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -162,6 +162,8 @@ static void smp_callin(void)
__func__, cpuid);
}

+ set_numa_mem(local_memory_node(cpu_to_node(cpuid)));
+
/*
* the boot CPU has finished the init stage and is spinning
* on callin_map until we finish. We are free to set up this
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index eec4f6c322bb..0d17c05480d2 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -22,6 +22,7 @@

int __initdata numa_off;
nodemask_t numa_nodes_parsed __initdata;
+static nodemask_t numa_nodes_empty __initdata;

struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
EXPORT_SYMBOL(node_data);
@@ -523,8 +524,12 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
end = max(mi->blk[i].end, end);
}

- if (start < end)
+ if (start < end) {
setup_node_data(nid, start, end);
+ } else if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES)) {
+ setup_node_data(nid, 0, 0);
+ node_set(nid, numa_nodes_empty);
+ }
}

/* Dump memblock with node info and return. */
@@ -541,14 +546,18 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
*/
static void __init numa_init_array(void)
{
- int rr, i;
+ int i, rr = MAX_NUMNODES;

- rr = first_node(node_online_map);
for (i = 0; i < nr_cpu_ids; i++) {
+ /* Search for an onlined node with memory */
+ do {
+ if (rr != MAX_NUMNODES)
+ rr = next_node(rr, node_online_map);
+ if (rr == MAX_NUMNODES)
+ rr = first_node(node_online_map);
+ } while (!node_spanned_pages(rr));
+
numa_set_node(i, rr);
- rr = next_node(rr, node_online_map);
- if (rr == MAX_NUMNODES)
- rr = first_node(node_online_map);
}
}

@@ -694,9 +703,12 @@ static __init int find_near_online_node(int node)
{
int n, val;
int min_val = INT_MAX;
- int best_node = -1;
+ int best_node = NUMA_NO_NODE;

for_each_online_node(n) {
+ if (!node_spanned_pages(n))
+ continue;
+
val = node_distance(node, n);

if (val < min_val) {
@@ -737,6 +749,22 @@ void __init init_cpu_to_node(void)
if (!node_online(node))
node = find_near_online_node(node);
numa_set_node(cpu, node);
+ if (node_spanned_pages(node))
+ set_cpu_numa_mem(cpu, node);
+ if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES))
+ node_clear(node, numa_nodes_empty);
+ }
+
+ /* Destroy empty nodes */
+ if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES)) {
+ int nid;
+ const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
+
+ for_each_node_mask(nid, numa_nodes_empty) {
+ node_set_offline(nid);
+ memblock_free(__pa(node_data[nid]), nd_size);
+ node_data[nid] = NULL;
+ }
}
}
--
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nishanth Aravamudan
2014-07-24 23:26:05 UTC
Permalink
Post by Jiang Liu
With current implementation, all CPUs within a NUMA node will be
assocaited with another NUMA node if the node has no memory installed.
<snip>
Post by Jiang Liu
---
arch/x86/Kconfig | 3 +++
arch/x86/kernel/acpi/boot.c | 5 ++++-
arch/x86/kernel/smpboot.c | 2 ++
arch/x86/mm/numa.c | 42 +++++++++++++++++++++++++++++++++++-------
4 files changed, 44 insertions(+), 8 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a8f749ef0fdc..f35b25b88625 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1887,6 +1887,9 @@ config USE_PERCPU_NUMA_NODE_ID
def_bool y
depends on NUMA
+config HAVE_MEMORYLESS_NODES
+ def_bool NUMA
+
config ARCH_ENABLE_SPLIT_PMD_PTLOCK
def_bool y
depends on X86_64 || X86_PAE
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 86281ffb96d6..3b5641703a49 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -612,6 +612,8 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
if (nid != -1) {
set_apicid_to_node(physid, nid);
numa_set_node(cpu, nid);
+ if (node_online(nid))
+ set_cpu_numa_mem(cpu, local_memory_node(nid));
How common is it for this method to be called for a CPU on an offline
node? Aren't you fixing this in the next patch (so maybe the order
should be changed?)?
Post by Jiang Liu
}
#endif
}
@@ -644,9 +646,10 @@ int acpi_unmap_lsapic(int cpu)
{
#ifdef CONFIG_ACPI_NUMA
set_apicid_to_node(per_cpu(x86_cpu_to_apicid, cpu), NUMA_NO_NODE);
+ set_cpu_numa_mem(cpu, NUMA_NO_NODE);
#endif
- per_cpu(x86_cpu_to_apicid, cpu) = -1;
+ per_cpu(x86_cpu_to_apicid, cpu) = BAD_APICID;
I think this is an unrelated change?
Post by Jiang Liu
set_cpu_present(cpu, false);
num_processors--;
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 5492798930ef..4a5437989ffe 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -162,6 +162,8 @@ static void smp_callin(void)
__func__, cpuid);
}
+ set_numa_mem(local_memory_node(cpu_to_node(cpuid)));
+
Note that you might hit the same issue I reported on powerpc, if
smp_callin() is part of smp_init(). The waitqueue initialization code
depends on cpu_to_node() [and eventually cpu_to_mem()] to be initialized
quite early.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-25 01:41:22 UTC
Permalink
Post by Nishanth Aravamudan
Post by Jiang Liu
With current implementation, all CPUs within a NUMA node will be
assocaited with another NUMA node if the node has no memory installed.
<snip>
Post by Jiang Liu
---
arch/x86/Kconfig | 3 +++
arch/x86/kernel/acpi/boot.c | 5 ++++-
arch/x86/kernel/smpboot.c | 2 ++
arch/x86/mm/numa.c | 42 +++++++++++++++++++++++++++++++++++-------
4 files changed, 44 insertions(+), 8 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a8f749ef0fdc..f35b25b88625 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1887,6 +1887,9 @@ config USE_PERCPU_NUMA_NODE_ID
def_bool y
depends on NUMA
+config HAVE_MEMORYLESS_NODES
+ def_bool NUMA
+
config ARCH_ENABLE_SPLIT_PMD_PTLOCK
def_bool y
depends on X86_64 || X86_PAE
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 86281ffb96d6..3b5641703a49 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -612,6 +612,8 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
if (nid != -1) {
set_apicid_to_node(physid, nid);
numa_set_node(cpu, nid);
+ if (node_online(nid))
+ set_cpu_numa_mem(cpu, local_memory_node(nid));
How common is it for this method to be called for a CPU on an offline
node? Aren't you fixing this in the next patch (so maybe the order
should be changed?)?
Hi Nishanth,
For physical CPU hot-addition instead of logical CPU online through
sysfs, the node is always in offline state.
In v2, I have reordered the patch set so patch 30 goes first.
Post by Nishanth Aravamudan
Post by Jiang Liu
}
#endif
}
@@ -644,9 +646,10 @@ int acpi_unmap_lsapic(int cpu)
{
#ifdef CONFIG_ACPI_NUMA
set_apicid_to_node(per_cpu(x86_cpu_to_apicid, cpu), NUMA_NO_NODE);
+ set_cpu_numa_mem(cpu, NUMA_NO_NODE);
#endif
- per_cpu(x86_cpu_to_apicid, cpu) = -1;
+ per_cpu(x86_cpu_to_apicid, cpu) = BAD_APICID;
I think this is an unrelated change?
Thanks for reminder, it's unrelated to support memoryless node.
Post by Nishanth Aravamudan
Post by Jiang Liu
set_cpu_present(cpu, false);
num_processors--;
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 5492798930ef..4a5437989ffe 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -162,6 +162,8 @@ static void smp_callin(void)
__func__, cpuid);
}
+ set_numa_mem(local_memory_node(cpu_to_node(cpuid)));
+
Note that you might hit the same issue I reported on powerpc, if
smp_callin() is part of smp_init(). The waitqueue initialization code
depends on cpu_to_node() [and eventually cpu_to_mem()] to be initialized
quite early.
Thanks for reminder. Patch 29/30 together will setup cpu_to_mem() array
when enumerating CPUs for hot-adding events, so it should be ready
for use when onlining those CPUs.

Regards!
Gerry
Post by Nishanth Aravamudan
Thanks,
Nish
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Peter Zijlstra
2014-07-11 08:29:56 UTC
Permalink
Post by Jiang Liu
Any comments are welcomed!
Why would anybody _ever_ have a memoryless node? That's ridiculous.
--
To unsubscribe from this list: send the line "unsubscribe linux-hotplug" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Greg KH
2014-07-11 15:33:14 UTC
Permalink
Post by Peter Zijlstra
Post by Jiang Liu
Any comments are welcomed!
Why would anybody _ever_ have a memoryless node? That's ridiculous.
I'm with Peter here, why would this be a situation that we should even
support? Are there machines out there shipping like this?

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dave Hansen
2014-07-11 20:02:14 UTC
Permalink
Post by Greg KH
Post by Peter Zijlstra
Post by Jiang Liu
Any comments are welcomed!
Why would anybody _ever_ have a memoryless node? That's ridiculous.
I'm with Peter here, why would this be a situation that we should even
support? Are there machines out there shipping like this?
This is orthogonal to the problem Jiang Liu is solving, but...

The IBM guys have been hitting the CPU-less and memoryless node issues
forever, but that's mostly because their (traditional) hypervisor had
good NUMA support and ran multi-node guests.

I've never seen it in practice on x86 mostly because the hypervisors
don't have good NUMA support. I honestly think this is something x86 is
going to have to handle eventually anyway. It's essentially a resource
fragmentation problem, and there are going to be times where a guest
needs to be spun up and hypervisor has nodes with either no spare memory
or no spare CPUs.

The hypervisor has 3 choices in this case:
1. Lie about the NUMA layout
2. Waste the resources
3. Tell the guest how it's actually arranged


--
To unsubscribe from this list: send the line "unsubscribe linux-hotplug" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andi Kleen
2014-07-11 20:20:51 UTC
Permalink
Post by Greg KH
Post by Peter Zijlstra
Post by Jiang Liu
Any comments are welcomed!
Why would anybody _ever_ have a memoryless node? That's ridiculous.
I'm with Peter here, why would this be a situation that we should even
support? Are there machines out there shipping like this?
We've always had memory nodes.

A classic case in the old days was a two socket system where someone
didn't populate any DIMMs on the second socket.

There are other cases too.

-Andi
--
***@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-hotplug" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Peter Zijlstra
2014-07-11 20:51:06 UTC
Permalink
Post by Andi Kleen
Post by Greg KH
Post by Peter Zijlstra
Post by Jiang Liu
Any comments are welcomed!
Why would anybody _ever_ have a memoryless node? That's ridiculous.
I'm with Peter here, why would this be a situation that we should even
support? Are there machines out there shipping like this?
We've always had memory nodes.
A classic case in the old days was a two socket system where someone
didn't populate any DIMMs on the second socket.
That's a obvious; don't do that then case. Its silly.
Post by Andi Kleen
There are other cases too.
Are there any sane ones?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andi Kleen
2014-07-11 21:58:37 UTC
Permalink
Post by Peter Zijlstra
Post by Andi Kleen
Post by Greg KH
Post by Peter Zijlstra
Post by Jiang Liu
Any comments are welcomed!
Why would anybody _ever_ have a memoryless node? That's ridiculous.
I'm with Peter here, why would this be a situation that we should even
support? Are there machines out there shipping like this?
We've always had memory nodes.
A classic case in the old days was a two socket system where someone
didn't populate any DIMMs on the second socket.
That's a obvious; don't do that then case. Its silly.
True. We should recommend that anyone running Linux will email you
for approval of their configuration first.
Post by Peter Zijlstra
Post by Andi Kleen
There are other cases too.
Are there any sane ones
Yes.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-hotplug" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Rientjes
2014-07-15 01:18:08 UTC
Permalink
Post by Peter Zijlstra
Post by Andi Kleen
There are other cases too.
Are there any sane ones?
They are specifically allowed by the ACPI specification to be able to
include only cpus, I/O, networking cards, etc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
H. Peter Anvin
2014-07-11 23:51:01 UTC
Permalink
Post by Andi Kleen
Post by Greg KH
Post by Peter Zijlstra
Post by Jiang Liu
Any comments are welcomed!
Why would anybody _ever_ have a memoryless node? That's ridiculous.
I'm with Peter here, why would this be a situation that we should even
support? Are there machines out there shipping like this?
We've always had memory nodes.
A classic case in the old days was a two socket system where someone
didn't populate any DIMMs on the second socket.
There are other cases too.
Yes, like a node controller-based system where the system can be
populated with either memory cards or CPU cards, for example. Now you
can have both memoryless nodes and memory-only nodes...

Memory-only nodes also happen in real life. In some cases they are done
by permanently putting low-frequency CPUs to sleep for their memory
controllers.

-hpa


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiri Kosina
2014-07-11 22:40:40 UTC
Permalink
Post by Greg KH
Post by Peter Zijlstra
Post by Jiang Liu
Any comments are welcomed!
Why would anybody _ever_ have a memoryless node? That's ridiculous.
I'm with Peter here, why would this be a situation that we should even
support? Are there machines out there shipping like this?
I am pretty sure I've seen ppc64 machine with memoryless NUMA node.
--
Jiri Kosina
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
David Rientjes
2014-07-15 01:19:55 UTC
Permalink
Post by Jiri Kosina
I am pretty sure I've seen ppc64 machine with memoryless NUMA node.
Yes, Nishanth Aravamudan (now cc'd) has been working diligently on the
problems that have been encountered, including problems in generic kernel
code, on powerpc with memoryless nodes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nish Aravamudan
2014-07-18 17:48:04 UTC
Permalink
Hi David,
Post by David Rientjes
Post by Jiri Kosina
I am pretty sure I've seen ppc64 machine with memoryless NUMA node.
Yes, Nishanth Aravamudan (now cc'd) has been working diligently on the
problems that have been encountered, including problems in generic kernel
code, on powerpc with memoryless nodes.
Thanks for Cc'ing me on this discussion. I'm going to review Jiang's
patchset now, as best I can, but yes I can confirm we see memoryless nodes
somewhat frequently on powerpc under PowerVM, due to presumably hypervisor
fragmentation (the reason isn't clear to an LPAR, as it's just given a
topology).

I agree with Dave Hansen that this seems like a "good thing" to try and
figure out, unless KVM decides it's going to hide the underlying topology
of a guest's memory from the guest -- which I think could lead (eventually)
to confusing performance results.

I believe I have also seen them in hardware on ia64 (cpu-only and
memory-only drawers), but not sure if those specific models are in
production still.

Finally, I will say that in working on supporting memoryless nodes, I've
come across what look like bugs in the NUMA code. Or more accurately,
assumptions which aren't always true. So it's a useful exercise for that
reason to.

Thanks,
Nish
Nishanth Aravamudan
2014-07-21 17:23:31 UTC
Permalink
Hi Jiang,
Post by Jiang Liu
Previously we have posted a patch fix a memory crash issue caused by
memoryless node on x86 platforms, please refer to
http://comments.gmane.org/gmane.linux.kernel/1687425
As suggested by David Rientjes, the most suitable fix for the issue
should be to use cpu_to_mem() rather than cpu_to_node() in the caller.
So this is the patchset according to David's suggestion.
Hrm, that is initially what David said, but then later on in the thread,
he specifically says he doesn't think memoryless nodes are the problem.
It seems like the issue is the order of onlining of resources on a
specifix x86 platform?

memoryless nodes in and of themselves don't cause the kernel to crash.
powerpc boots with them (both previously without
CONFIG_HAVE_MEMORYLESS_NODES and now with it) and is functional,
although it does lead to some performance issues I'm hoping to resolve.
In fact, David specifically says that the kernel crash you triggered
makes sense as cpu_to_node() points to an offline node?

In any case, a blind s/cpu_to_node/cpu_to_mem/ is not always correct.
There is a semantic difference and in some cases the allocator already
do the right thing under covers (falls back to nearest node) and in some
cases it doesn't.

Thanks,
Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-hotplug" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tony Luck
2014-07-21 17:41:59 UTC
Permalink
On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
Post by Nishanth Aravamudan
It seems like the issue is the order of onlining of resources on a
specific x86 platform?
Yes. When we online a node the BIOS hits us with some ACPI hotplug events:

First: Here are some new cpus
Next: Here is some new memory
Last; Here are some new I/O things (PCIe root ports, PCIe devices,
IOAPICs, IOMMUs, ...)

So there is a period where the node is memoryless - although that will generally
be resolved when the memory hot plug event arrives ... that isn't guaranteed to
occur (there might not be any memory on the node, or what memory there is
may have failed self-test and been disabled).

-Tony
Nishanth Aravamudan
2014-07-21 17:57:36 UTC
Permalink
Post by Tony Luck
On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
Post by Nishanth Aravamudan
It seems like the issue is the order of onlining of resources on a
specific x86 platform?
First: Here are some new cpus
Ok, so during this period, you might get some remote allocations. Do you
know the topology of these CPUs? That is they belong to a
(soon-to-exist) NUMA node? Can you online that currently offline NUMA
node at this point (so that NODE_DATA()) resolves, etc.)?
Post by Tony Luck
Next: Here is some new memory
And then update the NUMA topology at this point? That is,
set_cpu_numa_node/mem as appropriate so the underlying allocators do the
right thing?
Post by Tony Luck
Last; Here are some new I/O things (PCIe root ports, PCIe devices,
IOAPICs, IOMMUs, ...)
So there is a period where the node is memoryless - although that will
generally be resolved when the memory hot plug event arrives ... that
isn't guaranteed to occur (there might not be any memory on the node,
or what memory there is may have failed self-test and been disabled).
Right, but the allocator(s) generally does the right thing already in
the face of memoryless nodes -- they fallback to the nearest node. That
leads to poor performance, but is functional. Based upon the previous
thread Jiang pointed to, it seems like the real issue here isn't that
the node is memoryless, but that it's not even online yet? So NODE_DATA
access crashes?

Thanks,
Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-hotplug" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jiang Liu
2014-07-23 08:20:24 UTC
Permalink
Post by Nishanth Aravamudan
Post by Tony Luck
On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
Post by Nishanth Aravamudan
It seems like the issue is the order of onlining of resources on a
specific x86 platform?
First: Here are some new cpus
Ok, so during this period, you might get some remote allocations. Do you
know the topology of these CPUs? That is they belong to a
(soon-to-exist) NUMA node? Can you online that currently offline NUMA
node at this point (so that NODE_DATA()) resolves, etc.)?
Hi Nishanth,
We have method to get the NUMA information about the CPU, and
patch "[RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing
CPU hot-addition" tries to solve this issue by onlining NUMA node
as early as possible. Actually we are trying to enable memoryless node
as you have suggested.

Regards!
Gerry
Post by Nishanth Aravamudan
Post by Tony Luck
Next: Here is some new memory
And then update the NUMA topology at this point? That is,
set_cpu_numa_node/mem as appropriate so the underlying allocators do the
right thing?
Post by Tony Luck
Last; Here are some new I/O things (PCIe root ports, PCIe devices,
IOAPICs, IOMMUs, ...)
So there is a period where the node is memoryless - although that will
generally be resolved when the memory hot plug event arrives ... that
isn't guaranteed to occur (there might not be any memory on the node,
or what memory there is may have failed self-test and been disabled).
Right, but the allocator(s) generally does the right thing already in
the face of memoryless nodes -- they fallback to the nearest node. That
leads to poor performance, but is functional. Based upon the previous
thread Jiang pointed to, it seems like the real issue here isn't that
the node is memoryless, but that it's not even online yet? So NODE_DATA
access crashes?
Thanks,
Nish
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nishanth Aravamudan
2014-07-24 23:32:30 UTC
Permalink
Post by Jiang Liu
Post by Nishanth Aravamudan
Post by Tony Luck
On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
Post by Nishanth Aravamudan
It seems like the issue is the order of onlining of resources on a
specific x86 platform?
First: Here are some new cpus
Ok, so during this period, you might get some remote allocations. Do you
know the topology of these CPUs? That is they belong to a
(soon-to-exist) NUMA node? Can you online that currently offline NUMA
node at this point (so that NODE_DATA()) resolves, etc.)?
Hi Nishanth,
We have method to get the NUMA information about the CPU, and
patch "[RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing
CPU hot-addition" tries to solve this issue by onlining NUMA node
as early as possible. Actually we are trying to enable memoryless node
as you have suggested.
Ok, it seems like you have two sets of patches then? One is to fix the
NUMA information timing (30/30 only). The rest of the patches are
general discussions about where cpu_to_mem() might be used instead of
cpu_to_node(). However, based upon Tejun's feedback, it seems like
rather than force all callers to use cpu_to_mem(), we should be looking
at the core VM to ensure fallback is occuring appropriately when
memoryless nodes are present.

Do you have a specific situation, once you've applied 30/30, where
kmalloc_node() leads to an Oops?

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jiang Liu
2014-07-25 01:50:01 UTC
Permalink
Post by Nishanth Aravamudan
Post by Jiang Liu
Post by Nishanth Aravamudan
Post by Tony Luck
On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
Post by Nishanth Aravamudan
It seems like the issue is the order of onlining of resources on a
specific x86 platform?
First: Here are some new cpus
Ok, so during this period, you might get some remote allocations. Do you
know the topology of these CPUs? That is they belong to a
(soon-to-exist) NUMA node? Can you online that currently offline NUMA
node at this point (so that NODE_DATA()) resolves, etc.)?
Hi Nishanth,
We have method to get the NUMA information about the CPU, and
patch "[RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing
CPU hot-addition" tries to solve this issue by onlining NUMA node
as early as possible. Actually we are trying to enable memoryless node
as you have suggested.
Ok, it seems like you have two sets of patches then? One is to fix the
NUMA information timing (30/30 only). The rest of the patches are
general discussions about where cpu_to_mem() might be used instead of
cpu_to_node(). However, based upon Tejun's feedback, it seems like
rather than force all callers to use cpu_to_mem(), we should be looking
at the core VM to ensure fallback is occuring appropriately when
memoryless nodes are present.
Do you have a specific situation, once you've applied 30/30, where
kmalloc_node() leads to an Oops?
Hi Nishanth,
After following the two threads related to support of memoryless
node and digging more code, I realized my first version path set is an
overkill. As Tejun has pointed out, we shouldn't expose the detail of
memoryless node to normal user, but there are still some special users
who need the detail. So I have tried to summarize it as:
1) Arch code should online corresponding NUMA node before onlining any
CPU or memory, otherwise it may cause invalid memory access when
accessing NODE_DATA(nid).
2) For normal memory allocations without __GFP_THISNODE setting in the
gfp_flags, we should prefer numa_node_id()/cpu_to_node() instead of
numa_mem_id()/cpu_to_mem() because the latter loses hardware topology
information as pointed out by Tejun:
A - B - X - C - D
Where X is the memless node. numa_mem_id() on X would return
either B or C, right? If B or C can't satisfy the allocation,
the allocator would fallback to A from B and D for C, both of
which aren't optimal. It should first fall back to C or B
respectively, which the allocator can't do anymoe because the
information is lost when the caller side performs numa_mem_id().
3) For memory allocation with __GFP_THISNODE setting in gfp_flags,
numa_node_id()/cpu_to_node() should be used if caller only wants to
allocate from local memory, otherwise numa_mem_id()/cpu_to_mem()
should be used if caller wants to allocate from the nearest node.
4) numa_mem_id()/cpu_to_mem() should be used if caller wants to check
whether a page is allocated from the nearest node.

And my v2 patch set is based on above rules.
Any suggestions here?
Regards!
Gerry
Post by Nishanth Aravamudan
Thanks,
Nish
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nishanth Aravamudan
2014-08-18 23:30:41 UTC
Permalink
Hi Gerry,
Post by Jiang Liu
Post by Nishanth Aravamudan
Post by Jiang Liu
Post by Nishanth Aravamudan
Post by Tony Luck
On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
Post by Nishanth Aravamudan
It seems like the issue is the order of onlining of resources on a
specific x86 platform?
First: Here are some new cpus
Ok, so during this period, you might get some remote allocations. Do you
know the topology of these CPUs? That is they belong to a
(soon-to-exist) NUMA node? Can you online that currently offline NUMA
node at this point (so that NODE_DATA()) resolves, etc.)?
Hi Nishanth,
We have method to get the NUMA information about the CPU, and
patch "[RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing
CPU hot-addition" tries to solve this issue by onlining NUMA node
as early as possible. Actually we are trying to enable memoryless node
as you have suggested.
Ok, it seems like you have two sets of patches then? One is to fix the
NUMA information timing (30/30 only). The rest of the patches are
general discussions about where cpu_to_mem() might be used instead of
cpu_to_node(). However, based upon Tejun's feedback, it seems like
rather than force all callers to use cpu_to_mem(), we should be looking
at the core VM to ensure fallback is occuring appropriately when
memoryless nodes are present.
Do you have a specific situation, once you've applied 30/30, where
kmalloc_node() leads to an Oops?
Hi Nishanth,
After following the two threads related to support of memoryless
node and digging more code, I realized my first version path set is an
overkill. As Tejun has pointed out, we shouldn't expose the detail of
memoryless node to normal user, but there are still some special users
1) Arch code should online corresponding NUMA node before onlining any
CPU or memory, otherwise it may cause invalid memory access when
accessing NODE_DATA(nid).
I think that's reasonable.

A related caveat is that NUMA topology information should be stored as
early as possible in boot for *all* CPUs [I think only cpu_to_* is used,
at least for now], not just the boot CPU, etc. This is because (at least
on my examination) pre-SMP initcalls are not prevented from using
cpu_to_node, which will falsely return 0 for all CPUs until
set_cpu_numa_node() is called.
Post by Jiang Liu
2) For normal memory allocations without __GFP_THISNODE setting in the
gfp_flags, we should prefer numa_node_id()/cpu_to_node() instead of
numa_mem_id()/cpu_to_mem() because the latter loses hardware topology
A - B - X - C - D
Where X is the memless node. numa_mem_id() on X would return
either B or C, right? If B or C can't satisfy the allocation,
the allocator would fallback to A from B and D for C, both of
which aren't optimal. It should first fall back to C or B
respectively, which the allocator can't do anymoe because the
information is lost when the caller side performs numa_mem_id().
Yes, this seems like a very good description of the reasoning.
Post by Jiang Liu
3) For memory allocation with __GFP_THISNODE setting in gfp_flags,
numa_node_id()/cpu_to_node() should be used if caller only wants to
allocate from local memory, otherwise numa_mem_id()/cpu_to_mem()
should be used if caller wants to allocate from the nearest node.
4) numa_mem_id()/cpu_to_mem() should be used if caller wants to check
whether a page is allocated from the nearest node.
I'm less clear on what you mean here, I'll look at your v2 patches. I
mean, numa_node_id()/cpu_to_node() should be used to indicate node-local
preference with appropriate failure handling. But I don't know why one
would prefer to use numa_node_id() to numa_mem_id() in such a path? The
only time they differ is if memoryless nodes are present, which is what
your local memory allocation would ideally be for those nodes anyways?

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Peter Zijlstra
2014-07-21 20:06:25 UTC
Permalink
Post by Tony Luck
On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
Post by Nishanth Aravamudan
It seems like the issue is the order of onlining of resources on a
specific x86 platform?
First: Here are some new cpus
Next: Here is some new memory
Last; Here are some new I/O things (PCIe root ports, PCIe devices,
IOAPICs, IOMMUs, ...)
So there is a period where the node is memoryless - although that will generally
be resolved when the memory hot plug event arrives ... that isn't guaranteed to
occur (there might not be any memory on the node, or what memory there is
may have failed self-test and been disabled).
Right, but we could 'easily' capture that in arch code and make it look
like it was done in a 'sane' order. No need to wreck the rest of the
kernel to support this particular BIOS fuckup.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Loading...