opensuse:kernel.git
7 years agov2.6.33.3-rt19
Thomas Gleixner [Sun, 2 May 2010 19:04:12 +0000 (21:04 +0200)]
v2.6.33.3-rt19

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agosched: Warn on rt throttling
Thomas Gleixner [Sun, 2 May 2010 18:55:06 +0000 (20:55 +0200)]
sched: Warn on rt throttling

The default rt-throttling is a source of never ending questions. Warn
once when we go into throttling so folks have that info in dmesg.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agoinit: Fix config items in debug reminder finally
Carsten Emde [Sun, 2 May 2010 13:35:12 +0000 (15:35 +0200)]
init: Fix config items in debug reminder finally

After the recent repair of the debug config reminder, there was a
duplicate config item, while the CONFIG_DEBUG_PAGEALLOC config item
was missing.

Signed-off-by: Carsten Emde <C.Emde@osadl.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agov2.6.33.3-rt18
Thomas Gleixner [Sun, 2 May 2010 18:14:57 +0000 (20:14 +0200)]
v2.6.33.3-rt18

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agotclib: Default to tclib timer for RT
Thomas Gleixner [Sat, 1 May 2010 16:29:35 +0000 (18:29 +0200)]
tclib: Default to tclib timer for RT

RT is not too happy about the shared timer interrupt in AT91
devices. Default to tclib timer for RT.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agoAtmel TCLIB: Allow higher clock rates for clock events
Benedikt Spranger [Mon, 8 Mar 2010 17:57:04 +0000 (18:57 +0100)]
Atmel TCLIB: Allow higher clock rates for clock events

As default the TCLIB uses the 32KiHz base clock rate for clock events.
Add a compile time selection to allow higher clock resulution.

Signed-off-by: Benedikt Spranger <b.spranger@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agoAT91: PIT: Remove irq handler when clock event is unused
Benedikt Spranger [Sat, 6 Mar 2010 16:47:10 +0000 (17:47 +0100)]
AT91: PIT: Remove irq handler when clock event is unused

Setup and remove the interrupt handler in clock event mode selection.
This avoids calling the (shared) interrupt handler when the device is
not used.

Signed-off-by: Benedikt Spranger <b.spranger@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs: Prevent dput race
Thomas Gleixner [Sat, 1 May 2010 14:23:13 +0000 (16:23 +0200)]
fs: Prevent dput race

dput() drops dentry->d_lock when it fails to lock inode->i_lock or
parent->d_lock. dentry->d_count is 0 at this point so dentry kann be
killed and freed by someone else. This leaves dput with a stale
pointer in the retry code which results in interesting kernel crashes.

Prevent this by incrementing dentry->d_count before dropping the
lock. Go back to start after dropping the lock so d_count is
decremented again.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agov2.6.33.3-rt17
Thomas Gleixner [Fri, 30 Apr 2010 09:40:35 +0000 (11:40 +0200)]
v2.6.33.3-rt17

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agort: Fix the reminder block accounting for CONFIG_FUNCTION_TRACER
John Kacur [Thu, 29 Apr 2010 20:07:32 +0000 (22:07 +0200)]
rt: Fix the reminder block accounting for CONFIG_FUNCTION_TRACER

Make the accounting for CONFIG_FUNCTION_TRACER in DEBUG_COUNT
match that in the reminder block reporting.

Signed-off-by: John Kacur <jkacur@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Clark Williams <williams@redhat.com>
LKML-Reference: <1272571652-5719-1-git-send-email-jkacur@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs: Use s_inodes not s_files for inode lists
Thomas Gleixner [Fri, 30 Apr 2010 09:26:20 +0000 (11:26 +0200)]
fs: Use s_inodes not s_files for inode lists

The VFS scalability rework broke UP due to a stupid typo which
enqueued inodes on the file list.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs: Fix namespace related hangs
John Stultz [Thu, 29 Apr 2010 06:31:49 +0000 (23:31 -0700)]
fs: Fix namespace related hangs

Nick converted the dentry->d_mounted counter to a flag, however with
namespaces, dentries can be mounted multiple times (and more
importantly unmounted multiple times).

If a namespace was created and then released, the unmount_tree would
remove the DCACHE_MOUNTED flag and that would make d_mountpoint fail,
causing the mounts to be lost.

This patch coverts it back to a counter, and adds some extra WARN_ONs
to make sure things are accounted properly.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: "Luis Claudio R. Goncalves" <lclaudio@uudg.org>
Cc: Nick Piggin <npiggin@suse.de>
LKML-Reference: <1272522942.1967.12.camel@work-vm>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agoxfs: Make i_count access non-atomic
John Stultz [Thu, 29 Apr 2010 07:31:45 +0000 (09:31 +0200)]
xfs: Make i_count access non-atomic

i_count is not longer atomic. Fix up the leftover.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs: Fix d_count fallout
Thomas Gleixner [Wed, 28 Apr 2010 21:22:33 +0000 (23:22 +0200)]
fs: Fix d_count fallout

d_count got converted to int and back to atomic_t. Two instances were
missed in the backward conversion. Fix them up.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs: namespace: Fix MNT_MOUNTED handling for cloned rootfs
John Stultz [Wed, 28 Apr 2010 10:39:45 +0000 (12:39 +0200)]
fs: namespace: Fix MNT_MOUNTED handling for cloned rootfs

We don't call attach_mnt on a cloned rootfs so set the MNT_MOUNTED
flag in copy_tree().

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs: namespace: Make put_mnt_ns rt aware
Thomas Gleixner [Wed, 28 Apr 2010 10:25:20 +0000 (12:25 +0200)]
fs: namespace: Make put_mnt_ns rt aware

On RT the lock() inside the preempt disabled region of get_cpu_var()
results in a might sleep warning.

Restructure the code and check the atomic transition to 0 open coded
to avoid vfsmount_write_lock() in the case when ns->count is > 1.

If ns->count == 1 then do the atomic decrement under full locking of
namespace_sem and vfsmount_write_lock(). In most cases the
atomic_dec_and_test() will have dropped ns->count to 0 so we need the
full locking anyway.

Based on a patch from John Stultz

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs: namespace: Fix potential deadlock
John Stultz [Wed, 28 Apr 2010 09:50:56 +0000 (11:50 +0200)]
fs: namespace: Fix potential deadlock

do_unmount() does a lock() instead of unlock() in a return path which
will lead to a dead lock when this code path is taken. Fix the typo.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agort: Remove irrelevant CONFIGS from reminder block.
John Kacur [Wed, 28 Apr 2010 19:34:27 +0000 (21:34 +0200)]
rt: Remove irrelevant CONFIGS from reminder block.

- remove CONFIGS that do not have an impact to -rt from the reminder block.
- reformat DEBUG_COUNT so that it is more easily human readable.
- correct CONFIG_FTRACE to CONFIG_FUNCTION_TRACER.

Signed-off-by: John Kacur <jkacur@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Clark Williams <williams@redhat.com>
LKML-Reference: <1272483267-10900-1-git-send-email-jkacur@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agohugetlb: fix infinite loop in get_futex_key() when backed by huge pages
Mel Gorman [Fri, 23 Apr 2010 17:17:56 +0000 (13:17 -0400)]
hugetlb: fix infinite loop in get_futex_key() when backed by huge pages

If a futex key happens to be located within a huge page mapped
MAP_PRIVATE, get_futex_key() can go into an infinite loop waiting for a
page->mapping that will never exist.

See https://bugzilla.redhat.com/show_bug.cgi?id=552257 for more details
about the problem.

This patch makes page->mapping a poisoned value that includes
PAGE_MAPPING_ANON mapped MAP_PRIVATE.  This is enough for futex to
continue but because of PAGE_MAPPING_ANON, the poisoned value is not
dereferenced or used by futex.  No other part of the VM should be
dereferencing the page->mapping of a hugetlbfs page as its page cache is
not on the LRU.

This patch fixes the problem with the test case described in the bugzilla.

[ upstream commit: 23be7468e8802a2ac1de6ee3eecb3ec7f14dc703 ]

[akpm@linux-foundation.org: mel cant spel]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Darren Hart <darren@dvhart.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agov2.6.33.3-rt16
Thomas Gleixner [Tue, 27 Apr 2010 15:38:49 +0000 (17:38 +0200)]
v2.6.33.3-rt16

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agoFixup some compilation warnings and errors
John Stultz [Sat, 17 Apr 2010 01:30:04 +0000 (18:30 -0700)]
Fixup some compilation warnings and errors

Amit Arora noticed some compile issues with coda, and an fs.h include
issue, so so this patch fixes those along with btrfs warnings.

Thanks to Amit for the testing!

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agoRemove j_state_lock usage in jbd2_journal_stop()
tytso@mit.edu [Fri, 9 Apr 2010 02:38:13 +0000 (19:38 -0700)]
Remove j_state_lock usage in jbd2_journal_stop()

On Wed, Apr 07, 2010 at 04:21:18PM -0700, john stultz wrote:
> Further using lockstat I was able to isolate it the contention down to
> the journal j_state_lock, and then adding some lock owner tracking, I
> was able to see that the lock owners were almost always in
> start_this_handle, and jbd2_journal_stop when we saw contention (with
> the freq breakdown being about 55% in jbd2_journal_stop and 45% in
> start_this_handle).

Hmm....  I've taken a very close look at jbd2_journal_stop(), and I
don't think we need to take j_state_lock() at all except if we need to
call jbd2_log_start_commit().  t_outstanding_credits,
h_buffer_credits, and t_updates are all documented (and verified by
me) to be protected by the t_handle_lock spinlock.

So I ***think*** the following might be safe.  WARNING!  WARNING!!  No
real testing done on this patch, other than "it compiles!  ship it!!".

I'll let other people review it, and maybe you could give this a run
and see what happens with this patch?

                                        - Ted

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agoRevert Nick's fs-scale-pseudo
John Stultz [Fri, 12 Mar 2010 00:18:14 +0000 (16:18 -0800)]
Revert Nick's fs-scale-pseudo

After adding an xfs partition to my system, I started seeing
boot time NULL pointer oopses, and bisected it down to the
fs-scale-pseudo change.

Not sure what the right fix is, but this change avoids the issue.

Here's the bug i was seeing on boot:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
IP: [<ffffffff81103d42>] link_path_walk+0xd12/0xda0
PGD 42b12e067 PUD 42cb2a067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
last sysfs file: /sys/block/md0/dev
CPU 7
Pid: 2993, comm: vgs Not tainted 2.6.33-rc8john #272 Server Blade/IBM eServer BladeCenter HS21 -[7995AC1]-
RIP: 0010:[<ffffffff81103d42>]  [<ffffffff81103d42>] link_path_walk+0xd12/0xda0
RSP: 0018:ffff88042a929b78  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88042ab41000 RCX: ffff88042ab41028
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88042aa0fcc0
RBP: ffff88042a929c28 R08: ffff88042aa0fcc0 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff88042c6a40b0
R13: 0000000000000000 R14: 0000000000000000 R15: ffff88042a929dc8
FS:  00007f6f8c481710(0000) GS:ffff8800283c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000030 CR3: 000000042b310000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process vgs (pid: 2993, threadinfo ffff88042a928000, task ffff88042ab41000)
Stack:
 ffff88042ab41000 ffff88042ab41000 ffff88042ab41000 ffff88042ab41000
<0> 0000000100000000 ffff88042a929de8 ffff880400000000 0000000000000000
<0> ffff88042f6b5610 0000000000000000 0000000000000000 ffff88042f418920
Call Trace:
 [<ffffffff811006c2>] ? path_get+0x32/0x50
 [<ffffffff81103c50>] link_path_walk+0xc20/0xda0
 [<ffffffff811006c2>] ? path_get+0x32/0x50
 [<ffffffff81103f7c>] path_walk+0x5c/0xd0
 [<ffffffff811041de>] do_path_lookup+0x1ee/0x250
 [<ffffffff81103ff0>] ? do_path_lookup+0x0/0x250
 [<ffffffff81104ebb>] user_path_at+0x7b/0xb0
 [<ffffffff81112bb1>] ? vfsmount_read_unlock+0x31/0x60
 [<ffffffff81114788>] ? mntput_no_expire+0x48/0x190
 [<ffffffff810fb293>] ? cp_new_stat+0xe3/0xf0
 [<ffffffff810fb4ac>] vfs_fstatat+0x3c/0x80
 [<ffffffff810fb616>] vfs_stat+0x16/0x20
 [<ffffffff810fb63f>] sys_newstat+0x1f/0x50
 [<ffffffff81994a33>] ? lockdep_sys_exit_thunk+0x35/0x67
 [<ffffffff810025eb>] system_call_fastpath+0x16/0x1b
Code: ec e8 93 c8 ff ff 0f 1f 00 e9 46 ff ff ff 41 83 7f 34 04 66 0f 1f 44 00 00 0f 85 38 ff ff ff 4d 8b 67 08 49 8b 84 24 b8 00 00 00 <48> 8b 40 30 f6 40 09 40 0f 84 1e ff ff ff 49 8b 44 24 70 4c 89
RIP  [<ffffffff81103d42>] link_path_walk+0xd12/0xda0
 RSP <ffff88042a929b78>
CR2: 0000000000000030
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---[ end trace 0dd94d94b1b27094 ]---

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agoCall synchronize_rcpu in unregister_filesystem
John Stultz [Thu, 4 Mar 2010 00:52:14 +0000 (16:52 -0800)]
Call synchronize_rcpu in unregister_filesystem

Quoting Nick:
"BTW there are a few issues Al pointed out. We have to synchronize RCU
after unregistering a filesystem so d_ops/i_ops doesn't go away, and
mntput can sleep so we can't do it under RCU read lock."

This patch simply calls synchronize_rcpu in unregister_filesystem to avoid
this issue

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agoMake sure MNT_MOUNTED isn't cleared on remount
John Stultz [Fri, 26 Feb 2010 04:03:00 +0000 (20:03 -0800)]
Make sure MNT_MOUNTED isn't cleared on remount

Originally found by Anton Blanchard, this patch makes sure
we keep the MNT_MOUNTED flag set in do_remount(). Without this
scalability suffers pretty badly.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agoRevert d_count back to an atomic_t
John Stultz [Tue, 23 Feb 2010 03:58:20 +0000 (19:58 -0800)]
Revert d_count back to an atomic_t

This patch reverts the portion of Nick's vfs scalability patch that
converts the dentry d_count from an atomic_t to an int protected by
the d_lock.

This greatly improves vfs scalability with the -rt kernel, as
the extra lock contention on the d_lock hurts very badly when
CONFIG_PREEMPT_RT is enabled and the spinlocks become rtmutexes.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agoFixup get_cpu_var holds over spinlock() calls.
John Stultz [Tue, 23 Feb 2010 02:41:36 +0000 (18:41 -0800)]
Fixup get_cpu_var holds over spinlock() calls.

In Nick's patches, there's a few spots that use get_cpu_var to access
a per-cpu spinlock. However, the put_cpu_var isn't called until after the
lock is aquired and released. This causes mightsleep warnings with -rt.

Move the put_cpu_var above the spin_lock/unlock call to avoid this.

Not sure if this is 100% right, but seems to work. Not sure what
holding the get does on the lock, since once we have the lock,
the reference shouldn't change. Other users of the same lock don't bother
with the get_cpu_var method and just use per_cpu.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agoFix inc/dec_mnt_count for -rt
John Stultz [Tue, 23 Feb 2010 02:09:56 +0000 (18:09 -0800)]
Fix inc/dec_mnt_count for -rt

With Nick's vfs patches, inc/dec_mnt_count use per-cpu counters, so
this patch makes sure we disable preemption before calling.

Its not a great fix, but works because count_mnt_count() sums all the
percpu values, so each one individually doesn't need to be 0'ed out.

I suspect the better fix for -rt is to revert the mnt_count back to an atomic
counter.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agoFix vfsmount_read_lock to work with -rt
John Stultz [Tue, 23 Feb 2010 00:48:26 +0000 (16:48 -0800)]
Fix vfsmount_read_lock to work with -rt

Because vfsmount_read_lock aquires the vfsmount spinlock for the current cpu,
it causes problems wiht -rt, as you might migrate between cpus between a
lock and unlock.

This patch fixes the issue by having the caller pick a cpu, then consistently
use that cpu between the lock and unlock. We may migrate inbetween lock and
unlock, but that's ok because we're not doing anything cpu specific, other
then avoiding contention on the read side across the cpus.

Its not pretty, but it works and statistically shouldn't hurt performance.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agoFixup rt hack for mnt_want_write
John Stultz [Mon, 22 Feb 2010 23:56:56 +0000 (15:56 -0800)]
Fixup rt hack for mnt_want_write

The rt hack in mnt_want_write needs to be changed to work with
Nick's VFS patches.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agoFix MNT_MOUNTED WARN_ON
John Stultz [Thu, 18 Feb 2010 03:23:07 +0000 (19:23 -0800)]
Fix MNT_MOUNTED WARN_ON

I was seeing MNT_MOUNTED already set WARN_ON messages in commit_tree.
This seems to be caused by clone_mnt copying the flag of an already mounted
mnt to the mount before it is used by commit_tree.

My fix (which may not be correct) is to unmark MNT_MOUNTED on the cloned
mnt.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agoFixups from 09102009.patch.gz
Nick Piggin [Fri, 29 Jan 2010 23:44:00 +0000 (15:44 -0800)]
Fixups from 09102009.patch.gz

This patch is just the delta from Nick's 06102009 and his 09102009 megapatches

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-fixes
Nick Piggin [Fri, 29 Jan 2010 23:38:35 +0000 (15:38 -0800)]
fs-fixes

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-inode-hash-rcu
Nick Piggin [Fri, 29 Jan 2010 23:38:34 +0000 (15:38 -0800)]
fs-inode-hash-rcu

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-sb-inodes-percpu
Nick Piggin [Fri, 29 Jan 2010 23:38:34 +0000 (15:38 -0800)]
fs-sb-inodes-percpu

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-nr_inodes-percpu
Eric Dumazet [Fri, 29 Jan 2010 23:38:34 +0000 (15:38 -0800)]
fs-nr_inodes-percpu

fs: inode per-cpu nr_inodes counter

Avoids cache line ping pongs between cpus and prepare next patch,
because updates of nr_inodes dont need inode_lock anymore.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-last_ino-percpu
Eric Dumazet [Fri, 29 Jan 2010 23:38:33 +0000 (15:38 -0800)]
fs-last_ino-percpu

fs: inode per-cpu last_ino allocator

new_inode() dirties a contended cache line to get increasing
inode numbers.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.

This reduce contention on the shared last_ino, and give same
spreading ino numbers than before.
(same wraparound after 2^32 allocations)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-inode-nr_inodes
Nick Piggin [Fri, 29 Jan 2010 23:38:33 +0000 (15:38 -0800)]
fs-inode-nr_inodes

XXX: this should be folded back into the individual locking patches

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-scale-pseudo
Nick Piggin [Fri, 29 Jan 2010 23:38:32 +0000 (15:38 -0800)]
fs-scale-pseudo

Regardless of how much we possibly try to scale dcache, there is likely
always going to be some fundamental contention when adding or removing children
under the same parent. Pseudo filesystems do not seem need to have connected
dentries because by definition they are disconnected.

XXX: is this right? I can't see any reason why they need to have a real
parent.

TODO: add a d_instantiate_something() and avoid adding the extra checks
for !d_parent

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-inode_lock-scale-11
Nick Piggin [Fri, 29 Jan 2010 23:38:32 +0000 (15:38 -0800)]
fs-inode_lock-scale-11

This enables locking to be reduced and simplified.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-inode_rcu
Nick Piggin [Fri, 29 Jan 2010 23:38:32 +0000 (15:38 -0800)]
fs-inode_rcu

RCU free the struct inode. This will allow:

- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
  to take i_lock no longer need to take sb_inode_list_lock to walk the list in
  the first place. This will simplify and optimize locking.
- eventually, completely write-free RCU path walking. The inode must be
  consulted for permissions when walking, so a write-free reference (ie.
  RCU is helpful).
- can potentially simplify things a bit in VM land. May not need to take the
  page lock to get back to the page->mapping.
- can remove some nested trylock loops in dcache code

todo: convert all filesystems

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-inode_lock-scale-10
Nick Piggin [Fri, 29 Jan 2010 23:38:31 +0000 (15:38 -0800)]
fs-inode_lock-scale-10

Impelemnt lazy inode lru similarly to dcache. This should reduce inode list
lock acquisition (todo: measure).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-inode_lock-scale-9
Nick Piggin [Fri, 29 Jan 2010 23:38:31 +0000 (15:38 -0800)]
fs-inode_lock-scale-9

Remove the global inode_hash_lock and replace it with per-hash-bucket locks.

Todo: should use bit spinlock in hlist_head pointer to save space.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-inode_lock-scale-8
Nick Piggin [Fri, 29 Jan 2010 23:38:30 +0000 (15:38 -0800)]
fs-inode_lock-scale-8

Make inode_hash_lock private by adding a function __remove_inode_hash
that can be used by filesystems defining their own drop_inode functions.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-inode_lock-scale-7
Nick Piggin [Fri, 29 Jan 2010 23:38:30 +0000 (15:38 -0800)]
fs-inode_lock-scale-7

Remove the global inode_lock

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-inode_lock-scale-6c
Nick Piggin [Fri, 29 Jan 2010 23:38:29 +0000 (15:38 -0800)]
fs-inode_lock-scale-6c

Make last_ino atomic in preperation for removing inode_lock.
Make a new lock for iunique counter, for removing inode_lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-inode_lock-scale-6b
Nick Piggin [Fri, 29 Jan 2010 23:38:29 +0000 (15:38 -0800)]
fs-inode_lock-scale-6b

Protect i_hash, i_sb_list etc members with i_lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-inode_lock-scale-5
Nick Piggin [Fri, 29 Jan 2010 23:38:29 +0000 (15:38 -0800)]
fs-inode_lock-scale-5

Protect inodes_stat statistics with atomic ops rather than inode_lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-inode_lock-scale-6
Nick Piggin [Fri, 29 Jan 2010 23:38:28 +0000 (15:38 -0800)]
fs-inode_lock-scale-6

Add a new lock, wb_inode_list_lock, to protect i_list and various lists
which the inode can be put onto.

XXX: haven't audited ocfs2

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-inode_lock-scale-4
Nick Piggin [Fri, 29 Jan 2010 23:38:28 +0000 (15:38 -0800)]
fs-inode_lock-scale-4

Protect inode->i_count with i_lock, rather than having it atomic.
Next step should also be to move things together (eg. the refcount increment
into d_instantiate, which will remove a lock/unlock cycle on i_lock).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-inode_lock-scale-3
Nick Piggin [Fri, 29 Jan 2010 23:38:27 +0000 (15:38 -0800)]
fs-inode_lock-scale-3

Protect i_state updates with i_lock

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-inode_lock-scale-2
Nick Piggin [Fri, 29 Jan 2010 23:38:27 +0000 (15:38 -0800)]
fs-inode_lock-scale-2

Add a new lock, inode_hash_lock, to protect the inode hash table lists.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-inode_lock-scale
Nick Piggin [Fri, 29 Jan 2010 23:38:26 +0000 (15:38 -0800)]
fs-inode_lock-scale

Protect sb->s_inodes with a new lock, sb_inode_list_lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agodcache-percpu-nr_dentry
Nick Piggin [Fri, 29 Jan 2010 23:38:26 +0000 (15:38 -0800)]
dcache-percpu-nr_dentry

The nr_dentry stat is a globally touched cacheline and atomic operation
twice over the lifetime of a dentry. It is used for the benfit of userspace
only. We could make a per-cpu counter or something for it, but it is only
accessed via proc, so we could use slab stats.

XXX: must implement slab routines to return stats for a single cache, and
implement the proc handler.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agodcache-split-inode_lock
Nick Piggin [Fri, 29 Jan 2010 23:38:25 +0000 (15:38 -0800)]
dcache-split-inode_lock

dcache_inode_lock can be replaced with per-inode locking. Use existing
inode->i_lock for this. This is slightly non-trivial because we sometimes
need to find the inode from the dentry, which requires d_inode to be
stabilised (either with refcount or d_lock).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-dcache-d_delete-less-lock
Nick Piggin [Fri, 29 Jan 2010 23:38:25 +0000 (15:38 -0800)]
fs-dcache-d_delete-less-lock

dcache_inode_lock can be avoided in d_delete() and d_materialise_unique()
in cases where it is not required.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agodcache-chain-hashlock
Nick Piggin [Fri, 29 Jan 2010 23:38:24 +0000 (15:38 -0800)]
dcache-chain-hashlock

We can turn the dcache hash locking from a global dcache_hash_lock into
per-bucket locking.

XXX: should probably use a bit lock in the first bit of the hash pointers
to avoid any space bloating (non-atomic unlock means no extra atomics either)

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agodcache-dput-less-dcache_lock
Nick Piggin [Fri, 29 Jan 2010 23:38:24 +0000 (15:38 -0800)]
dcache-dput-less-dcache_lock

It is possible to run dput without taking locks up-front. In many cases
where we don't kill the dentry anyway, these locks are not required.

(I think... need to think about it more). Further changes ->d_delete
locking which is not all audited.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-dcache_lock-remove
Nick Piggin [Fri, 29 Jan 2010 23:38:23 +0000 (15:38 -0800)]
fs-dcache_lock-remove

dcache_lock no longer protects anything (I hope). remove it.

This breaks a lot of the tree where I haven't thought about the problem,
but it simplifies the dcache.c code quite a bit (and it's also probably
a good thing to break unconverted code). So I include this here before
making further changes to the locking.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-dcache_lock-multi-step
Nick Piggin [Fri, 29 Jan 2010 23:38:23 +0000 (15:38 -0800)]
fs-dcache_lock-multi-step

The remaining usages for dcache_lock is to allow atomic, multi-step read-side
operations over the directory tree by excluding modifications to the tree.
Also, to walk in the leaf->root direction in the tree where we don't have
a natural d_lock ordering.

This could be accomplished by taking every d_lock, but this would mean a
huge number of locks and actually gets very tricky.

Solve this instead by using the rename seqlock for multi-step read-side
operations. Insert operations are not serialised. Delete operations are
tricky when walking up the directory our parent might have been deleted
when dropping locks so also need to check and retry for that.

XXX: hmm, we could of course just take the rename lock if there is any worry
about livelock. Most of these are slow paths.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-dcache-scale-i_dentry
Nick Piggin [Fri, 29 Jan 2010 23:38:22 +0000 (15:38 -0800)]
fs-dcache-scale-i_dentry

Add a new lock, dcache_inode_lock, to protect the inode's i_dentry list
from concurrent modification. d_alias is also protected by d_lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-dcache-scale-d_subdirs
Nick Piggin [Fri, 29 Jan 2010 23:38:22 +0000 (15:38 -0800)]
fs-dcache-scale-d_subdirs

Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
using dcache_lock for these anyway (eg. using i_mutex).

XXX: probably don't need parent lock in inotify (because child lock
should stabilize parent). Also, possibly some filesystems don't need so
much locking (eg. of child dentry when modifying d_child, so long as
parent is locked)... but be on the safe side. Hmm, maybe we should just
say d_child list is protected by d_parent->d_lock. d_parent could remain
protected with d_lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-dcache-scale-d_unhashed
Nick Piggin [Fri, 29 Jan 2010 23:38:21 +0000 (15:38 -0800)]
fs-dcache-scale-d_unhashed

Protect d_unhashed(dentry) condition with d_lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-dcache-scale-d_count
Nick Piggin [Fri, 29 Jan 2010 23:38:21 +0000 (15:38 -0800)]
fs-dcache-scale-d_count

Make d_count non-atomic and protect it with d_lock. This allows us to
ensure a 0 refcount dentry remains 0 without dcache_lock. It is also
fairly natural when we start protecting many other dentry members with
d_lock.

XXX: This patch does not boot on its own

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-dcache-scale-nr_dentry
Nick Piggin [Fri, 29 Jan 2010 23:38:21 +0000 (15:38 -0800)]
fs-dcache-scale-nr_dentry

Make dentry_stat_t.nr_dentry an atomic_t type, and move it from under
dcache_lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-dcache-scale-d_lru
Nick Piggin [Fri, 29 Jan 2010 23:38:20 +0000 (15:38 -0800)]
fs-dcache-scale-d_lru

Add a new lock, dcache_lru_lock, to protect the dcache hash table from
concurrent modification. d_lru is also protected by d_lock.

Move lru scanning out from underneath dcache_lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-dcache-scale-d_hash
Nick Piggin [Fri, 29 Jan 2010 23:38:20 +0000 (15:38 -0800)]
fs-dcache-scale-d_hash

Add a new lock, dcache_hash_lock, to protect the dcache hash table from
concurrent modification. d_hash is also protected by d_lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-mntget-scale
Nick Piggin [Fri, 29 Jan 2010 23:38:19 +0000 (15:38 -0800)]
fs-mntget-scale

Improve scalability of mntget/mntput by using per-cpu counters protected
by the reader side of the brlock vfsmount_lock. mnt_mounted keeps track of
whether the vfsmount is actually attached to the tree so we can shortcut
expensive checks in mntput.

XXX: count_mnt_count needs write lock. Document this and/or revisit locking
(eg. look at writers count)

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-vfsmount_lock-scale
Nick Piggin [Fri, 29 Jan 2010 23:38:19 +0000 (15:38 -0800)]
fs-vfsmount_lock-scale

Use a brlock for the vfsmount lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-files_lock-scale
John Stultz [Tue, 23 Mar 2010 00:22:15 +0000 (17:22 -0700)]
fs-files_lock-scale

Improve scalability of files_lock by adding per-cpu, per-sb files lists,
protected with per-cpu locking. Effectively turning it into a big-writer
lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-files_list-improve
John Stultz [Tue, 23 Mar 2010 00:18:36 +0000 (17:18 -0700)]
fs-files_list-improve

Lock tty_files with a new spinlock, tty_files_lock; provide helpers to
manipulate the per-sb files list; unexport the files_lock spinlock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agonfs-use-__iget
Nick Piggin [Fri, 29 Jan 2010 23:38:18 +0000 (15:38 -0800)]
nfs-use-__iget

We aren't checking the return code from igrab anyway, so it seems like
it would be a bit racy? This is required to avoid i_lock recursion in
future patches.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agofs-remount-coherency
Nick Piggin [Fri, 29 Jan 2010 23:38:17 +0000 (15:38 -0800)]
fs-remount-coherency

Fixes a problem reported by "Jorge Boncompte [DTI2]" <jorge@dti2.net>
who is seeing corruption trying to snapshot a minix filesystem image.
Some filesystems modify their metadata via a path other than the bdev
buffer cache (eg. they may use a private linear mapping for their
metadata, or implement directories in pagecache, etc). Also, file
data modifications usually go to the bdev via their own mappings.

These updates are not coherent with buffercache IO (eg. via /dev/bdev)
and never have been. However there could be a reasonable expectation
that after a mount -oremount,ro operation then the buffercache should
subsequently be coherent with previous filesystem modifications.

So invalidate the bdev mappings on a remount,ro operation to provide
a coherency point.

The problem was exposed when we switched the old rd to brd because old rd
didn't really function like a normal block device and updates to rd via
mappings other than the buffercache would still end up going into its
buffercache. But the same problem has always affected other "normal" block
devices, including loop.

Reported-by: "Jorge Boncompte [DTI2]" <jorge@dti2.net>
Tested-by: "Jorge Boncompte [DTI2]" <jorge@dti2.net>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agov2.6.33.3-rt15
Thomas Gleixner [Tue, 27 Apr 2010 15:31:28 +0000 (17:31 +0200)]
v2.6.33.3-rt15

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agonet: gianfar: More RT fixups
Xianghua Xiao [Fri, 23 Apr 2010 21:57:52 +0000 (16:57 -0500)]
net: gianfar: More RT fixups

stop_gfar() needs the same fixup (local_irq_save/restore_nort()) as
adjust_link().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agortmutex: Preserve TASK_STOPPED state when blocking on a "spin_lock"
Kevin Hao [Tue, 2 Mar 2010 21:51:58 +0000 (16:51 -0500)]
rtmutex: Preserve TASK_STOPPED state when blocking on a "spin_lock"

When a process handles a SIGSTOP signal, it will set the state to
TASK_STOPPED, acquire tasklist_lock and notifiy the parent of the
status change. But in the rt kernel the process state will change
to TASK_UNINTERRUPTIBLE if it blocks on the tasklist_lock. So if
we send a SIGCONT signal to this process at this time, the SIGCONT
signal just does nothing because this process is not in TASK_STOPPED
state. Of course this is not what we wanted. Preserving the
TASK_STOPPED state when blocking on a "spin_lock" can fix this bug.

Signed-off-by: Kevin Hao <kexin.hao@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
LKML-Reference: <18e240905fcfd72457930322ee187e7ff9313aec.1267566249.git.paul.gortmaker@windriver.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agopowerpc: Replace kmap_atomic with kmap in pte_offset_map
Kevin Hao [Tue, 2 Mar 2010 21:51:57 +0000 (16:51 -0500)]
powerpc: Replace kmap_atomic with kmap in pte_offset_map

The pte_offset_map/pte_offset_map_nested use kmap_atomic to get the
virtual address for the pte table, but kmap_atomic will disable preempt.
Hence there will be call trace if we acquire a spin lock after invoking
pte_offset_map/pte_offset_map_nested in preempt-rt.  To fix it, I've
replaced kmap_atomic with kmap in these macros.

Signed-off-by: Kevin Hao <kexin.hao@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
LKML-Reference: <ffaf532c138188b526a8c623ed3c7f5067da6d68.1267566249.git.paul.gortmaker@windriver.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agov2.6.33.3-rt14
Thomas Gleixner [Tue, 27 Apr 2010 09:11:54 +0000 (11:11 +0200)]
v2.6.33.3-rt14

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agoMerge branch 'master' of
Thomas Gleixner [Tue, 27 Apr 2010 09:11:07 +0000 (11:11 +0200)]
Merge branch 'master' of

git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.33.y

into rt/2.6.33

Conflicts:
Makefile
arch/x86/include/asm/rwsem.h

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agonet: Fix iptables get_counters()
Thomas Gleixner [Tue, 27 Apr 2010 08:05:28 +0000 (10:05 +0200)]
net: Fix iptables get_counters()

The preempt-rt changes to iptables get_counters() left the counters
array uninitialized which results in random packet statistic numbers.

Reported-by: prd.gtt@operamail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agoLinux 2.6.33.3
Greg Kroah-Hartman [Mon, 26 Apr 2010 14:48:30 +0000 (07:48 -0700)]
Linux 2.6.33.3

7 years agor8169: clean up my printk uglyness
Neil Horman [Thu, 1 Apr 2010 07:30:07 +0000 (07:30 +0000)]
r8169: clean up my printk uglyness

commit 93f4d91d879acfcb0ba9c2725e3133fcff2dfd1e upstream.

Fix formatting on r8169 printk

Brandon Philips noted that I had a spacing issue in my printk for the
last r8169 patch that made it quite ugly.  Fix that up and add the PFX
macro to it as well so it looks like the other r8169 printks

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
7 years agox86/gart: Disable GART explicitly before initialization
Joerg Roedel [Wed, 7 Apr 2010 10:57:35 +0000 (12:57 +0200)]
x86/gart: Disable GART explicitly before initialization

commit 4b83873d3da0704987cb116833818ed96214ee29 upstream.

If we boot into a crash-kernel the gart might still be
enabled and its caches might be dirty. This can result in
undefined behavior later. Fix it by explicitly disabling the
gart hardware before initialization and flushing the caches
after enablement.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
7 years agoKVM: x86: Fix TSS size check for 16-bit tasks
Jan Kiszka [Wed, 14 Apr 2010 14:57:11 +0000 (16:57 +0200)]
KVM: x86: Fix TSS size check for 16-bit tasks

(Cherry-picked from commit e8861cfe2c75bdce36655b64d7ce02c2b31b604d)

A 16-bit TSS is only 44 bytes long. So make sure to test for the correct
size on task switch.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
7 years agoKVM: Increase NR_IOBUS_DEVS limit to 200
Sridhar Samudrala [Tue, 30 Mar 2010 23:48:25 +0000 (16:48 -0700)]
KVM: Increase NR_IOBUS_DEVS limit to 200

(Cherry-picked from commit e80e2a60ff7914dae691345a976c80bbbff3ec74)

This patch increases the current hardcoded limit of NR_IOBUS_DEVS
from 6 to 200. We are hitting this limit when creating a guest with more
than 1 virtio-net device using vhost-net backend. Each virtio-net
device requires 2 such devices to service notifications from rx/tx queues.

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
7 years agoKVM: fix the handling of dirty bitmaps to avoid overflows
Takuya Yoshikawa [Mon, 12 Apr 2010 10:35:35 +0000 (19:35 +0900)]
KVM: fix the handling of dirty bitmaps to avoid overflows

(Cherry-picked from commit 87bf6e7de1134f48681fd2ce4b7c1ec45458cb6d)

Int is not long enough to store the size of a dirty bitmap.

This patch fixes this problem with the introduction of a wrapper
function to calculate the sizes of dirty bitmaps.

Note: in mark_page_dirty(), we have to consider the fact that
  __set_bit() takes the offset as int, not long.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
7 years agoKVM: MMU: fix kvm_mmu_zap_page() and its calling path
Xiao Guangrong [Fri, 16 Apr 2010 08:34:42 +0000 (16:34 +0800)]
KVM: MMU: fix kvm_mmu_zap_page() and its calling path

(Cherry-picked from commit 77662e0028c7c63e34257fda03ff9625c59d939d)

This patch fix:

- calculate zapped page number properly in mmu_zap_unsync_children()
- calculate freeed page number properly kvm_mmu_change_mmu_pages()
- if zapped children page it shoud restart hlist walking

KVM-Stable-Tag.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
7 years agoKVM: VMX: Save/restore rflags.vm correctly in real mode
Avi Kivity [Thu, 8 Apr 2010 15:19:35 +0000 (18:19 +0300)]
KVM: VMX: Save/restore rflags.vm correctly in real mode

(Cherry-picked from commit 78ac8b47c566dd6177a3b9b291b756ccb70670b7)

Currently we set eflags.vm unconditionally when entering real mode emulation
through virtual-8086 mode, and clear it unconditionally when we enter protected
mode.  The means that the following sequence

  KVM_SET_REGS  (rflags.vm=1)
  KVM_SET_SREGS (cr0.pe=1)

Ends up with rflags.vm clear due to KVM_SET_SREGS triggering enter_pmode().

Fix by shadowing rflags.vm (and rflags.iopl) correctly while in real mode:
reads and writes to those bits access a shadow register instead of the actual
register.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
7 years agoKVM: allow bit 10 to be cleared in MSR_IA32_MC4_CTL
Andre Przywara [Wed, 24 Mar 2010 16:46:42 +0000 (17:46 +0100)]
KVM: allow bit 10 to be cleared in MSR_IA32_MC4_CTL

(Cherry-picked from commit 114be429c8cd44e57f312af2bbd6734e5a185b0d)

There is a quirk for AMD K8 CPUs in many Linux kernels (see
arch/x86/kernel/cpu/mcheck/mce.c:__mcheck_cpu_apply_quirks()) that
clears bit 10 in that MCE related MSR. KVM can only cope with all
zeros or all ones, so it will inject a #GP into the guest, which
will let it panic.
So lets add a quirk to the quirk and ignore this single cleared bit.
This fixes -cpu kvm64 on all machines and -cpu host on K8 machines
with some guest Linux kernels.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
7 years agoKVM: Don't spam kernel log when injecting exceptions due to bad cr writes
Avi Kivity [Thu, 11 Mar 2010 10:20:03 +0000 (12:20 +0200)]
KVM: Don't spam kernel log when injecting exceptions due to bad cr writes

(Cherry-picked from commit d6a23895aa82353788a1cc5a1d9a1c963465463e)

These are guest-triggerable.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
7 years agoKVM: SVM: Fix memory leaks that happen when svm_create_vcpu() fails
Takuya Yoshikawa [Tue, 9 Mar 2010 05:55:19 +0000 (14:55 +0900)]
KVM: SVM: Fix memory leaks that happen when svm_create_vcpu() fails

(Cherry-picked from commit b7af40433870aa0636932ad39b0c48a0cb319057)

svm_create_vcpu() does not free the pages allocated during the creation
when it fails to complete the allocations. This patch fixes it.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
7 years agoKVM: VMX: Update instruction length on intercepted BP
Jan Kiszka [Tue, 23 Feb 2010 16:47:53 +0000 (17:47 +0100)]
KVM: VMX: Update instruction length on intercepted BP

(Cherry-picked from commit c573cd22939e54fc1b8e672054a505048987a7cb)

We intercept #BP while in guest debugging mode. As VM exits due to
intercepted exceptions do not necessarily come with valid
idt_vectoring, we have to update event_exit_inst_len explicitly in such
cases. At least in the absence of migration, this ensures that
re-injections of #BP will find and use the correct instruction length.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
7 years agosched: Use proper type in sched_getaffinity()
KOSAKI Motohiro [Wed, 17 Mar 2010 00:36:58 +0000 (09:36 +0900)]
sched: Use proper type in sched_getaffinity()

commit 8bc037fb89bb3104b9ae290d18c877624cd7d9cc upstream.

Using the proper type fixes the following compiler warning:

  kernel/sched.c:4850: warning: comparison of distinct pointer types lacks a cast

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: torvalds@linux-foundation.org
Cc: travis@sgi.com
Cc: peterz@infradead.org
Cc: drepper@redhat.com
Cc: rja@sgi.com
Cc: sharyath@in.ibm.com
Cc: steiner@sgi.com
LKML-Reference: <20100317090046.4C79.A69D9226@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
7 years agoext4: fix async i/o writes beyond 4GB to a sparse file
Eric Sandeen [Fri, 5 Feb 2010 04:58:38 +0000 (23:58 -0500)]
ext4: fix async i/o writes beyond 4GB to a sparse file

commit a1de02dccf906faba2ee2d99cac56799bda3b96a upstream.

The "offset" member in ext4_io_end holds bytes, not blocks, so
ext4_lblk_t is wrong - and too small (u32).

This caused the async i/o writes to sparse files beyond 4GB to fail
when they wrapped around to 0.

Also fix up the type of arguments to ext4_convert_unwritten_extents(),
it gets ssize_t from ext4_end_aio_dio_nolock() and
ext4_ext_direct_IO().

Reported-by: Giel de Nijs <giel@vectorwise.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: maximilian attems <max@stro.at>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
7 years agopowerpc: Fix SMP build with disabled CPU hotplugging.
Adam Lackorzynski [Sat, 27 Feb 2010 07:07:59 +0000 (07:07 +0000)]
powerpc: Fix SMP build with disabled CPU hotplugging.

commit 5b72d74ce2fccca2a301de60f31b16ddf5c93984 upstream.

Compiling 2.6.33 with SMP enabled and HOTPLUG_CPU disabled gives me the
following link errors:

  LD      init/built-in.o
  LD      .tmp_vmlinux1
arch/powerpc/platforms/built-in.o: In function `.smp_xics_setup_cpu':
smp.c:(.devinit.text+0x88): undefined reference to `.set_cpu_current_state'
smp.c:(.devinit.text+0x94): undefined reference to `.set_default_offline_state'
arch/powerpc/platforms/built-in.o: In function `.smp_pSeries_kick_cpu':
smp.c:(.devinit.text+0x13c): undefined reference to `.set_preferred_offline_state'
smp.c:(.devinit.text+0x148): undefined reference to `.get_cpu_current_state'
smp.c:(.devinit.text+0x1a8): undefined reference to `.get_cpu_current_state'
make: *** [.tmp_vmlinux1] Error 1

The following change fixes that for me and seems to work as expected.

Signed-off-by: Adam Lackorzynski <adam@os.inf.tu-dresden.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
7 years agomd: deal with merge_bvec_fn in component devices better.
NeilBrown [Wed, 31 Mar 2010 01:07:16 +0000 (12:07 +1100)]
md: deal with merge_bvec_fn in component devices better.

commit 627a2d3c29427637f4c5d31ccc7fcbd8d312cd71 upstream.

If a component device has a merge_bvec_fn then as we never call it
we must ensure we never need to.  Currently this is done by setting
max_sector to 1 PAGE, however this does not stop a bio being created
with several sub-page iovecs that would violate the merge_bvec_fn.

So instead set max_phys_segments to 1 and set the segment boundary to the
same as a page boundary to ensure there is only ever one single-page
segment of IO requested at a time.

This can particularly be an issue when 'xen' is used as it is
known to submit multiple small buffers in a single bio.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
7 years agomodule: fix __module_ref_addr()
Mathieu Desnoyers [Tue, 20 Apr 2010 14:38:10 +0000 (10:38 -0400)]
module: fix __module_ref_addr()

The __module_ref_addr() problem disappears in 2.6.34-rc kernels because these
percpu accesses were re-factored.

__module_ref_addr() should use per_cpu_ptr() to obfuscate the pointer
(RELOC_HIDE is needed for per cpu pointers).

This non-standard per-cpu pointer use has been introduced by commit
720eba31f47aeade8ec130ca7f4353223c49170f

It causes a NULL pointer exception on some configurations when CONFIG_TRACING is
enabled on 2.6.33. This patch fixes the problem (acknowledged by Randy who
reported the bug).

It did not appear to hurt previously because most of the accesses were done
through local_inc, which probably obfuscated the access enough that no compiler
optimizations were done. But with local_read() done when CONFIG_TRACING is
active, this becomes a problem. Non-CONFIG_TRACING is probably affected as well
(module.c contains local_set and local_read that use __module_ref_addr()), but I
guess nobody noticed because we've been lucky enough that the compiler did not
generate the inappropriate optimization pattern there.

This patch should be queued for the 2.6.29.x through 2.6.33.x stable branches.
(tested on 2.6.33.1 x86_64)

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Tested-by: Randy Dunlap <randy.dunlap@oracle.com>
CC: Eric Dumazet <dada1@cosmosbay.com>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Tejun Heo <tj@kernel.org>
CC: Ingo Molnar <mingo@elte.hu>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Greg Kroah-Hartman <gregkh@suse.de>
CC: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
7 years agolockdep: fix incorrect percpu usage
Mathieu Desnoyers [Tue, 20 Apr 2010 14:33:50 +0000 (10:33 -0400)]
lockdep: fix incorrect percpu usage

The mainline kernel as of 2.6.34-rc5 is not affected by this problem because
commit 10fad5e46f6c7bdfb01b1a012380a38e3c6ab346 fixed it by refactoring.

lockdep fix incorrect percpu usage

Should use per_cpu_ptr() to obfuscate the per cpu pointers (RELOC_HIDE is needed
for per cpu pointers).

git blame points to commit:

lockdep.c: commit 8e18257d29238311e82085152741f0c3aa18b74d

But it's really just moving the code around. But it's enough to say that the
problems appeared before Jul 19 01:48:54 2007, which brings us back to 2.6.23.

It should be applied to stable 2.6.23.x to 2.6.33.x (or whichever of these
stable branches are still maintained).

(tested on 2.6.33.1 x86_64)

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Randy Dunlap <randy.dunlap@oracle.com>
CC: Eric Dumazet <dada1@cosmosbay.com>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Tejun Heo <tj@kernel.org>
CC: Ingo Molnar <mingo@elte.hu>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Greg Kroah-Hartman <gregkh@suse.de>
CC: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
7 years agomodules: fix incorrect percpu usage
Mathieu Desnoyers [Tue, 20 Apr 2010 14:34:57 +0000 (10:34 -0400)]
modules: fix incorrect percpu usage

Mainline does not need this fix, as commit
259354deaaf03d49a02dbb9975d6ec2a54675672 fixed the problem by refactoring.

Should use per_cpu_ptr() to obfuscate the per cpu pointers (RELOC_HIDE is needed
for per cpu pointers).

Introduced by commit:

module.c: commit 6b588c18f8dacfa6d7957c33c5ff832096e752d3

This patch should be queued for the stable branch, for kernels 2.6.29.x to
2.6.33.x.  (tested on 2.6.33.1 x86_64)

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Randy Dunlap <randy.dunlap@oracle.com>
CC: Eric Dumazet <dada1@cosmosbay.com>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Tejun Heo <tj@kernel.org>
CC: Ingo Molnar <mingo@elte.hu>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Greg Kroah-Hartman <gregkh@suse.de>
CC: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
7 years agoACPI: EC: Limit burst to 64 bits
Alexey Starikovskiy [Fri, 16 Apr 2010 19:36:40 +0000 (15:36 -0400)]
ACPI: EC: Limit burst to 64 bits

commit 2060c44576c79086ff24718878d7edaa7384a985 upstream.

access_bit_width field is u8 in ACPICA, thus 256 value written to it
becomes 0, causing divide by zero later.

Proper fix would be to remove access_bit_width at all, just because
we already have access_byte_width, which is access_bit_width / 8.
Limit access width to 64 bit for now.

https://bugzilla.kernel.org/show_bug.cgi?id=15749
fixes regression caused by the fix for:
https://bugzilla.kernel.org/show_bug.cgi?id=14667

Signed-off-by: Alexey Starikovskiy <astarikovskiy@suse.de>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>